Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-103178283
Ok thx !
Sent from my iPhone
> On May 18, 2015, at 8:47 PM, Sean Owen wrote:
>
> We can't directly but there's an aut
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-103169503
Sure, but I'm traveling now. Would you mind closing it for me?
Sent from my iPhone
> On May 18, 2015, at 8:19 PM, Sean Owen wrote:
>
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2634#discussion_r23507379
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2634#discussion_r23495418
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2634#discussion_r23430407
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/4144#issuecomment-70969423
@mengxr
You may want to refer to the newer code in GitHub
<https://github.com/derrickburns/generalized-kmeans-clustering> instead of
the old PR
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/4144#issuecomment-70948894
@mengxr
FYI, I'm about to work on the performance of clustering millions of sparse
vectors of very high dimension particularly when using KL diver
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/4144#issuecomment-70946876
@mengxr
A nit: I'd pull the range creation on line 331 out of the inner loop.
Sent from my iPhone
> On Jan 21, 2015, at 3:37 PM,
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/4144#issuecomment-70946101
@mengxr
It looks like the final costs rdd is still persisted in exit from the
initialization method.
Sent from my iPhone
> On Jan 21, 2
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/4144#issuecomment-70941228
@mengxr I back-ported your port to my com.massivedatascience.clusterer
GitHub project (modulo the conversion of data to dense form). :)
---
If your project is set
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/4144#issuecomment-70940665
The conversion of vectors to dense form will only work if the dimension of
the space is small, in which case there was little need to provide vectors in
sparse form
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2634#discussion_r23198417
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2634#discussion_r23198020
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-70589276
In my application (n-gram contexts), the sparse vectors can be of extremely
high dimension. To make the problem manageable, I select the k most important
dimensions
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-70583162
@mengxr
One more thing regarding sparse vectors. Sparse vectors can become dense
under cluster creation, which, in turn, can cause the running time of
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-70443890
@mengxr
I have implemented several variants of Kullback-Leibler divergence in
my separate
GitHub repository
<https://github.com/derrickbu
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-70345238
I see the problem with sparse vectors and the KL divergence.
I implemented a smoothing operation to approximate KL divergence.
Sent from my iPhone
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2634#discussion_r22774169
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala ---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2634#discussion_r22768878
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala ---
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2634#discussion_r22764827
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/FastEuclideanOps.scala
---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-69404412
Thanks for the information on the speedup that you obtained by eliminating
Breeze. I was unaware that the performance is so poor. To what do you
attribute
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-68985129
The pull request that you integrated on December 3 is redundant to this
one. Therefore you need not worry about merging in those changes. Simply
select this
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-68804845
I've said this before, so please forgive me for being repetitive.
The new implementation is a rewrite, not a patch, so it is not possible to
parce
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-68425655
Thx for taking this on.
Sent from my iPhone
> On Dec 30, 2014, at 3:23 PM, Xiangrui Meng
wrote:
>
> @derrickburns I'm goin
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-68414436
No problem. Thx!
Sent from my iPhone
> On Dec 30, 2014, at 3:23 PM, Xiangrui Meng
wrote:
>
> @derrickburns I'm goin
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-68218597
That would be great!
On Sat, Dec 27, 2014 at 12:59 PM, Nicholas Chammas wrote:
> @mengxr <https://github.com/mengxr> Now that 1.2.0 is ou
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-60336780
@mengxr Is there interest in this pull request? Should I delete it?
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-58449199
The closure data capture problem occurs MultiKMeans.scala:105.
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-58126789
@mengxr I restored the KMeans class and public methods on that class. I did
not mark them for deprecation.
I also re-formatted the code to follow the Spark
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-58122976
Ah, I see. Will fix.
Sent from my iPhone
> On Oct 6, 2014, at 5:55 PM, Xiangrui Meng
wrote:
>
> @derrickburns I don't know
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-58106509
@mengxr
Is there an IntelliJ or Eclipse configuration that i can use to reformat
the code according to the guidelines?
The breaking change in
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-58106085
I know exactly which closure is exceeding the size limitation. The problem,
is that I cannot see how to make the closure capture less data!
On Mon, Oct 6
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-57894117
I ran the style tests. The pass. Is there something else in the style guide
that is not captured in the tests ?
I have expended much effort to avoid
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-57714029
@mengxr
This is as expected. I need help in solving the closure data capture
problem.
Sent from my iPhone
> On Oct 2, 2014, at 2:10
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2634#issuecomment-57698333
@mingxr I created a new clean pull request. I *still need help* to
understand/fix a closure that is capturing too much data.
---
If your project is set up for it
GitHub user derrickburns opened a pull request:
https://github.com/apache/spark/pull/2634
[SPARK-3218, SPARK-3219, SPARK-3261, SPARK-3424] [RESUBMIT] MLLIB K-Means
Clusterer
This commit introduces a general distance function trait, `PointOps`, for
the Spark K-Means clusterer
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-57696626
I will close this pull request and create another.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well
Github user derrickburns closed the pull request at:
https://github.com/apache/spark/pull/2419
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-57691032
Uh oh.
Sent from my iPhone
> On Oct 2, 2014, at 11:35 AM, Nicholas Chammas
wrote:
>
> @derrickburns Side note: It looks like
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-57404998
@mengxr
1. I fixed the merge issue and also remerged to capture more recent changes.
2. I did as you suggested and introduced a local variable to hold a
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-57238377
Will do
Sent from my iPhone
> On Sep 29, 2014, at 2:32 PM, Xiangrui Meng
wrote:
>
> @derrickburns Could you merge the PR cleanly
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-57023670
@mengxr I ran the tests according to your instructions. One issue remains
that I cannot resolve. *I need help with that issue*.
## Issues
First
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-56907546
@mengxr I will do as you suggest.
On Thu, Sep 25, 2014 at 4:15 PM, Xiangrui Meng
wrote:
> @derrickburns <https://github.com/derrickburns
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-56775648
I know that this may not be the most pressing issue for anyone on Spark,
but I would like to complete this work. I cannot do that without some help.
@mengxr
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-56422112
I still need some help with determining why some test fail.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-56285643
I deleted that file in my original pull request. â
Sent from Mailbox
On Fri, Sep 19, 2014 at 4:32 PM, Nicholas Chammas
wrote:
> FYI
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-56127880
I don't understand the test failure. Can someone help me?
Sent from my iPhone
> On Sep 16, 2014, at 6:59 PM, Nicholas Chammas
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-55996597
@mengxr, can someone help me to understand the test failure?
The test "task size should be small in both training and prediction" in
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-55992851
Jenkins, retest this please.
On Wed, Sep 17, 2014 at 9:08 PM, Derrick Burns
wrote:
> Tests fixed. Please re-run.
>
> On We
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-55992818
Tests fixed. Please re-run.
On Wed, Sep 17, 2014 at 7:04 PM, Apache Spark QA
wrote:
> QA tests have finished
>
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-55972212
@mengxr per your request, here is a pull request that addresses many of the
outstanding issues with the 1.1.0 Spark K-Means clusterer.
---
If your project is set
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-55971460
To understand and evaluate this pull request, I would suggest that a
reviewer do the following:
1) Look at the `PointOps` trait and its `FastEuclideanOps
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-55968915
@srowen, I moved most of the line comments into code comments that I have
committed. Thx!
---
If your project is set up for it, you can reply to this email and
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-55898846
I agree that some of my comments should go in the code.
As for the Big Bang change, I understand your concern. The distance
functions touches practically
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17640478
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/MultiKMeans.scala ---
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17640379
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala ---
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17640296
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -87,7 +87,7 @@ class KMeansSuite extends FunSuite with
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17640286
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/clustering/KMeansSuite.scala ---
@@ -75,7 +75,7 @@ class KMeansSuite extends FunSuite with
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17640161
--- Diff:
mllib/src/test/java/org/apache/spark/mllib/clustering/JavaKMeansSuite.java ---
@@ -76,15 +76,12 @@ public void runKMeansUsingConstructor
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17640135
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/package.scala ---
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17640123
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/metrics/EuclideanOps.scala
---
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17640107
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/MultiKMeansClusterer.scala
---
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17640097
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/MultiKMeans.scala ---
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17640049
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/MultiKMeans.scala ---
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17640037
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/MultiKMeans.scala ---
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17640018
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LocalKMeans.scala ---
@@ -1,127 +0,0 @@
-/*
- * Licensed to the Apache Software
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17639992
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/LocalKMeans.scala ---
@@ -1,127 +0,0 @@
-/*
- * Licensed to the Apache Software
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17639973
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansRandom.scala ---
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17639919
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansPlusPlus.scala ---
@@ -0,0 +1,198 @@
+/*
+ * Licensed to the Apache
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17639873
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansParallel.scala ---
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17639860
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansParallel.scala ---
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache
Github user derrickburns commented on a diff in the pull request:
https://github.com/apache/spark/pull/2419#discussion_r17639842
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansParallel.scala ---
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache
Github user derrickburns commented on the pull request:
https://github.com/apache/spark/pull/2419#issuecomment-55830264
Thanks @nchammas, I added the Apache license headers.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well
GitHub user derrickburns opened a pull request:
https://github.com/apache/spark/pull/2419
This commit addresses SPARK-3218, SPARK-3219, SPARK-3261, and SPARK-3424...
This commit introduces a general distance function (PointOps) for the
KMeans clusterer. There are no
74 matches
Mail list logo