[GitHub] spark pull request #23271: [SPARK-26318][SQL] Enhance function merge perform...

2018-12-10 Thread KyleLi1985
GitHub user KyleLi1985 opened a pull request: https://github.com/apache/spark/pull/23271 [SPARK-26318][SQL] Enhance function merge performance in Row ## What changes were proposed in this pull request? Enhance function merge performance in Row Like do 1 time

[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

2018-11-29 Thread KyleLi1985
Github user KyleLi1985 commented on a diff in the pull request: https://github.com/apache/spark/pull/23126#discussion_r237551217 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -128,6 +128,82 @@ class RowMatrix @Since("

[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

2018-11-29 Thread KyleLi1985
Github user KyleLi1985 commented on a diff in the pull request: https://github.com/apache/spark/pull/23126#discussion_r237532703 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -128,6 +128,82 @@ class RowMatrix @Since("

[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

2018-11-29 Thread KyleLi1985
Github user KyleLi1985 commented on a diff in the pull request: https://github.com/apache/spark/pull/23126#discussion_r237505113 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala --- @@ -128,6 +128,69 @@ class RowMatrix @Since("

[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

2018-11-27 Thread KyleLi1985
Github user KyleLi1985 commented on a diff in the pull request: https://github.com/apache/spark/pull/23126#discussion_r236927771 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/RowMatrixSuite.scala --- @@ -266,6 +266,16 @@ class RowMatrixSuite extends

[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

2018-11-27 Thread KyleLi1985
Github user KyleLi1985 commented on a diff in the pull request: https://github.com/apache/spark/pull/23126#discussion_r236927721 --- Diff: mllib/src/test/java/org/apache/spark/ml/feature/JavaPCASuite.java --- @@ -67,7 +66,7 @@ public void testPCA() { JavaRDD dataRDD

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-27 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 Add test case in RowMatrixSuite for this PR, The breeze output is 6.711333870761802E-11 -3.833375461575691E-12 -3.833375461575691E-12 2.916662578525011E-12 Before

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-27 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 align JavaPCASuite expected data process behavior with PCA function fit --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-26 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 Ok, I will do it later --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-26 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 Um, the unit test in spark indeed cover both case. But there is function closeToZero to handle accuracy problem, so

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-26 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 Sure, the test cases include sparse and dense case. Do these case again for new commit we use data from http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-26 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 It would be better, update the commit --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-23 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 Plug do some more test on real data after add this commit we use data from http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals and data

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-23 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 After add this commit We get the result for RowMatrix computeCovariance function: For the input data 1.0,2.0,3.0,4.0,5.0 2.0,3.0,1.0,2.0,6.0 RowMatrix function

[GitHub] spark issue #23126: [SPARK-26158] [MLLIB] fix covariance accuracy problem fo...

2018-11-23 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/23126 Compare Spark computeCovariance function in RowMatrix for DenseVector and Numpy's function cov, Find two problem, below is the result: 1)The Spark function computeCovarian

[GitHub] spark pull request #23126: [SPARK-26158] [MLLIB] fix covariance accuracy pro...

2018-11-23 Thread KyleLi1985
GitHub user KyleLi1985 opened a pull request: https://github.com/apache/spark/pull/23126 [SPARK-26158] [MLLIB] fix covariance accuracy problem for DenseVector ## What changes were proposed in this pull request? Enhance accuracy of the covariance logic in RowMatrix for function

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-14 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > Thanks @KyleLi1985 this looks like a nice win in the end. Thanks for your investigation. @srowen @HyukjinKwon @mgaido91 Thanks for review. It is my pleas

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-10 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 @SparkQA retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark pull request #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmea...

2018-11-10 Thread KyleLi1985
Github user KyleLi1985 commented on a diff in the pull request: https://github.com/apache/spark/pull/22893#discussion_r232457128 --- Diff: python/pyspark/ml/clustering.py --- @@ -88,6 +88,14 @@ def clusterSizes(self): """ return

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-09 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 @SparkQA test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-09 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 It seems the related file spark/python/pyspark/ml/clustering.py has been changed, during these days. My local latest commit stay on "bfe60fc on 30 Jul". So I need re-fork spar

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-09 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 @AmplabJenkins test this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-08 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 I form the final test case for sparse case and dense case on realistic data to test new commit [SparkMLlibTest.txt](https://github.com/apache/spark/files/2561442/SparkMLlibTest.txt

[GitHub] spark pull request #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmea...

2018-11-08 Thread KyleLi1985
Github user KyleLi1985 commented on a diff in the pull request: https://github.com/apache/spark/pull/22893#discussion_r231838390 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala --- @@ -521,19 +521,21 @@ object MLUtils extends Logging { * The bound

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-03 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > OK, the Spark part doesn't seem relevant. The input might be more realistic here, yes. I was commenting that your test code doesn't show what you're testing, though I under

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-02 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > So the pull request right now doesn't reflect what you tested, but you tested the version pasted above. You're saying that the optimization just never helps the dense-dense case,

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-02 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where t

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-01 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where t

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-01 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > Hm, actually that's the best case. You're exercising the case where the code path you prefer is fast. And the case where the precision bound applies is exactly the case where t

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-01 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 There is my test for situation sparse-sparse, dense-dense, sparse-dense case ` import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.mllib.linalg

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-01 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM). > > Are all the vectors dense? I sup

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-01 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO. This part only use F2j to calculate as I said in last comment, so the performance is

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-11-01 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > I don't think BLAS matters here as these are all vector-vector operations and f2jblas is used directly (i.e. stays in the JVM). > > Are all the vectors dense? I sup

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-10-31 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > then I think you have to try with native BLAS installed, otherwise the results are not valid IMHO. Ok, For a fair result, I will try

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-10-31 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 > @KyleLi1985 do you have native BLAS installed? Like code said : // For level-1 routines, we use Java implementat

[GitHub] spark issue #22893: [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic...

2018-10-31 Thread KyleLi1985
Github user KyleLi1985 commented on the issue: https://github.com/apache/spark/pull/22893 End-to-End TEST Situation: Use below code to test ` test("kmeanproblem") { val rdd = sc .textFile("/Users/liliang/Desktop/inputdata.txt"

[GitHub] spark pull request #22893: One part of Spark MLlib Kmean Logic Performance p...

2018-10-30 Thread KyleLi1985
GitHub user KyleLi1985 opened a pull request: https://github.com/apache/spark/pull/22893 One part of Spark MLlib Kmean Logic Performance problem [SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic Performance problem ## What changes were proposed in this pull request