GitHub user KyleLi1985 opened a pull request:
https://github.com/apache/spark/pull/23271
[SPARK-26318][SQL] Enhance function merge performance in Row
## What changes were proposed in this pull request?
Enhance function merge performance in Row
Like do 1 time
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r237551217
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
---
@@ -128,6 +128,82 @@ class RowMatrix @Since("
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r237532703
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
---
@@ -128,6 +128,82 @@ class RowMatrix @Since("
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r237505113
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
---
@@ -128,6 +128,69 @@ class RowMatrix @Since("
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r236927771
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/RowMatrixSuite.scala
---
@@ -266,6 +266,16 @@ class RowMatrixSuite extends
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/23126#discussion_r236927721
--- Diff: mllib/src/test/java/org/apache/spark/ml/feature/JavaPCASuite.java
---
@@ -67,7 +66,7 @@ public void testPCA() {
JavaRDD dataRDD
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Add test case in RowMatrixSuite for this PR,
The breeze output is
6.711333870761802E-11 -3.833375461575691E-12
-3.833375461575691E-12 2.916662578525011E-12
Before
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
align JavaPCASuite expected data process behavior with PCA function fit
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Ok, I will do it later
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Um, the unit test in spark indeed cover both case. But there is function
closeToZero to handle accuracy problem, so
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Sure, the test cases include sparse and dense case.
Do these case again for new commit
we use data from
http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
It would be better, update the commit
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Plug do some more test on real data after add this commit
we use data from
http://archive.ics.uci.edu/ml/datasets/EEG+Steady-State+Visual+Evoked+Potential+Signals
and data
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
After add this commit
We get the result for RowMatrix computeCovariance function:
For the input data
1.0,2.0,3.0,4.0,5.0
2.0,3.0,1.0,2.0,6.0
RowMatrix function
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/23126
Compare Spark computeCovariance function in RowMatrix for DenseVector and
Numpy's function cov,
Find two problem, below is the result:
1)The Spark function computeCovarian
GitHub user KyleLi1985 opened a pull request:
https://github.com/apache/spark/pull/23126
[SPARK-26158] [MLLIB] fix covariance accuracy problem for DenseVector
## What changes were proposed in this pull request?
Enhance accuracy of the covariance logic in RowMatrix for function
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> Thanks @KyleLi1985 this looks like a nice win in the end. Thanks for your
investigation.
@srowen @HyukjinKwon @mgaido91 Thanks for review. It is my pleas
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
@SparkQA retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22893#discussion_r232457128
--- Diff: python/pyspark/ml/clustering.py ---
@@ -88,6 +88,14 @@ def clusterSizes(self):
"""
return
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
@SparkQA test this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
It seems the related file spark/python/pyspark/ml/clustering.py has been
changed, during these days. My local latest commit stay on "bfe60fc on 30
Jul". So I need re-fork spar
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
@AmplabJenkins test this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
I form the final test case for sparse case and dense case on realistic data
to test new commit
[SparkMLlibTest.txt](https://github.com/apache/spark/files/2561442/SparkMLlibTest.txt
Github user KyleLi1985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/22893#discussion_r231838390
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
@@ -521,19 +521,21 @@ object MLUtils extends Logging {
* The bound
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> OK, the Spark part doesn't seem relevant. The input might be more
realistic here, yes. I was commenting that your test code doesn't show what
you're testing, though I under
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> So the pull request right now doesn't reflect what you tested, but you
tested the version pasted above. You're saying that the optimization just never
helps the dense-dense case,
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> Hm, actually that's the best case. You're exercising the case where the
code path you prefer is fast. And the case where the precision bound applies is
exactly the case where t
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> Hm, actually that's the best case. You're exercising the case where the
code path you prefer is fast. And the case where the precision bound applies is
exactly the case where t
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> Hm, actually that's the best case. You're exercising the case where the
code path you prefer is fast. And the case where the precision bound applies is
exactly the case where t
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
There is my test for situation sparse-sparse, dense-dense, sparse-dense case
`
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> I don't think BLAS matters here as these are all vector-vector operations
and f2jblas is used directly (i.e. stays in the JVM).
>
> Are all the vectors dense? I sup
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> then I think you have to try with native BLAS installed, otherwise the
results are not valid IMHO.
This part only use F2j to calculate as I said in last comment, so the
performance is
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> I don't think BLAS matters here as these are all vector-vector operations
and f2jblas is used directly (i.e. stays in the JVM).
>
> Are all the vectors dense? I sup
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> then I think you have to try with native BLAS installed, otherwise the
results are not valid IMHO.
Ok, For a fair result, I will try
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
> @KyleLi1985 do you have native BLAS installed?
Like code said : // For level-1 routines, we use Java implementat
Github user KyleLi1985 commented on the issue:
https://github.com/apache/spark/pull/22893
End-to-End TEST Situation:
Use below code to test
`
test("kmeanproblem") {
val rdd = sc
.textFile("/Users/liliang/Desktop/inputdata.txt"
GitHub user KyleLi1985 opened a pull request:
https://github.com/apache/spark/pull/22893
One part of Spark MLlib Kmean Logic Performance problem
[SPARK-25868][MLlib] One part of Spark MLlib Kmean Logic Performance problem
## What changes were proposed in this pull request
37 matches
Mail list logo