[GitHub] [spark] zhengruifeng commented on pull request #28349: [SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors

2020-05-05 Thread GitBox
zhengruifeng commented on pull request #28349: URL: https://github.com/apache/spark/pull/28349#issuecomment-624405669 Merged to master This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [spark] zhengruifeng commented on pull request #28349: [SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors

2020-05-05 Thread GitBox
zhengruifeng commented on pull request #28349: URL: https://github.com/apache/spark/pull/28349#issuecomment-623930812 > we need to tell the user about this tradeoff in the doc above. @srowen I think there maybe other implementations (LoR/LiR/KMeans/GMM/...) that will support

[GitHub] [spark] zhengruifeng commented on pull request #28349: [SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors

2020-04-28 Thread GitBox
zhengruifeng commented on pull request #28349: URL: https://github.com/apache/spark/pull/28349#issuecomment-620976390 I will merge this PR this week if nobody object. Different from the [previous one](https://github.com/apache/spark/pull/27360), this PR will no cause performace

[GitHub] [spark] zhengruifeng commented on pull request #28349: [SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors

2020-04-26 Thread GitBox
zhengruifeng commented on pull request #28349: URL: https://github.com/apache/spark/pull/28349#issuecomment-619692445 I also test on sparse dataset: ``` import org.apache.spark.ml.classification._ import org.apache.spark.storage.StorageLevel val df =

[GitHub] [spark] zhengruifeng commented on pull request #28349: [SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors

2020-04-26 Thread GitBox
zhengruifeng commented on pull request #28349: URL: https://github.com/apache/spark/pull/28349#issuecomment-619512391 The speedup is more significiant than that in https://github.com/apache/spark/pull/27360, I think that is because: dataset `epsilon` has 2,000 features while a9a only

[GitHub] [spark] zhengruifeng commented on pull request #28349: [SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors

2020-04-26 Thread GitBox
zhengruifeng commented on pull request #28349: URL: https://github.com/apache/spark/pull/28349#issuecomment-619511225 The main part of this PR is similar to https://github.com/apache/spark/pull/27360, while this PR will choose the original impl if `blockSize=1`

[GitHub] [spark] zhengruifeng commented on pull request #28349: [SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors

2020-04-26 Thread GitBox
zhengruifeng commented on pull request #28349: URL: https://github.com/apache/spark/pull/28349#issuecomment-619510418 friendly ping @srowen @WeichenXu123 Using high-level BLAS on dense datasets makes SVC much faster than existing impl, even without NativeBLAS. To avoid

[GitHub] [spark] zhengruifeng commented on pull request #28349: [SPARK-30642][ML][PYSPARK] LinearSVC blockify input vectors

2020-04-26 Thread GitBox
zhengruifeng commented on pull request #28349: URL: https://github.com/apache/spark/pull/28349#issuecomment-619509562 dataset: [epsilon_normalized.t](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html), numInstances=100,000, numFeatures=2,000 testCode: ```