[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-31 Thread freeman-lab
Github user freeman-lab commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-61234517 @mengxr I implemented the new parameterization (and tried to make the docs on it more intuitive), see what you think! --- If your project is set up for it, you can

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-31 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-61234935 [Test build #22607 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22607/consoleFull) for PR 2942 at commit

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-61241988 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-31 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-61241985 [Test build #22607 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22607/consoleFull) for PR 2942 at commit

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-31 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-61319592 @freeman-lab I made some changes: https://github.com/freeman-lab/spark/pull/1 , which includes the following: 1. discount on previous counts 2. detecting

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-31 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-61354018 [Test build #22673 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22673/consoleFull) for PR 2942 at commit

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-31 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-61356545 [Test build #22673 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22673/consoleFull) for PR 2942 at commit

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-61356547 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-31 Thread freeman-lab
Github user freeman-lab commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-61356758 @mengxr great updates! LGMT. Just need to update the doc/examples in a couple places I think. --- If your project is set up for it, you can reply to this email and

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-31 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-61356950 [Test build #22677 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22677/consoleFull) for PR 2942 at commit

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-61358356 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-31 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-61358857 LGTM. Merged into master. Thanks for adding streaming k-means! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-31 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2942 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60880665 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-29 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60880661 [Test build #22428 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22428/consoleFull) for PR 2942 at commit

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread rxin
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19454416 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread rxin
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19454435 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60794676 @anantasty This PR is still in review. If you are interested in Python binding of streaming algorithms. Could you help add one for StreamingLinearRegression? Thanks!

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread anantasty
Github user anantasty commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60795596 I would certainly be interested in doing that. I just wasn't sure if it was better to do it as a separate PR/ task. On Oct 28, 2014 11:19 AM, Xiangrui Meng

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread freeman-lab
Github user freeman-lab commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60796301 @anantasty Agreed, should be separate, but would be very cool to have! Ping me as well, happy to provide feedback. --- If your project is set up for it, you can

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490145 --- Diff: docs/mllib-clustering.md --- @@ -153,3 +153,75 @@ provided in the [Self-Contained Applications](quick-start.html#self-contained-ap section of

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490241 --- Diff: docs/mllib-clustering.md --- @@ -153,3 +153,75 @@ provided in the [Self-Contained Applications](quick-start.html#self-contained-ap section of

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490261 --- Diff: docs/mllib-clustering.md --- @@ -153,3 +153,75 @@ provided in the [Self-Contained Applications](quick-start.html#self-contained-ap section of

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490254 --- Diff: docs/mllib-clustering.md --- @@ -153,3 +153,75 @@ provided in the [Self-Contained Applications](quick-start.html#self-contained-ap section of

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490284 --- Diff: docs/mllib-clustering.md --- @@ -153,3 +153,75 @@ provided in the [Self-Contained Applications](quick-start.html#self-contained-ap section of

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490351 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490369 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490345 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490338 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/StreamingKMeans.scala --- @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490483 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490476 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490470 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490486 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490467 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490527 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490523 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19490587 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19492147 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19492141 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19492205 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60806389 @freeman-lab I made a quick pass over the implementation. It looks great! I will check the math and the test code with someone who knows everything about streaming

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread freeman-lab
Github user freeman-lab commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60850276 @mengxr @coderxiang @rxin Thanks all for the feedback! I'm implementing these changes. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60873448 Had an offline discussion with @freeman-lab . We decided to introduce the concept of `timeUnit` to describe decay. A `timeUnit` (like a second) could be either a `batch`

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60875441 [Test build #22426 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22426/consoleFull) for PR 2942 at commit

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60875506 [Test build #22426 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22426/consoleFull) for PR 2942 at commit

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60875507 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-28 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60876198 [Test build #22428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22428/consoleFull) for PR 2942 at commit

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-27 Thread anantasty
Github user anantasty commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60554980 Should we create another PR for the python bindings/example? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-27 Thread coderxiang
Github user coderxiang commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19446715 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-27 Thread coderxiang
Github user coderxiang commented on a diff in the pull request: https://github.com/apache/spark/pull/2942#discussion_r19446734 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala --- @@ -0,0 +1,246 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-25 Thread freeman-lab
GitHub user freeman-lab opened a pull request: https://github.com/apache/spark/pull/2942 Streaming KMeans [MLLIB][SPARK-3254] This adds a Streaming KMeans algorithm to MLlib. It uses an update rule that generalizes the mini-batch KMeans update to incorporate a decay factor, which

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-25 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60475562 [Test build #22209 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22209/consoleFull) for PR 2942 at commit

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60477107 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: Streaming KMeans [MLLIB][SPARK-3254]

2014-10-25 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2942#issuecomment-60477105 [Test build #22209 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22209/consoleFull) for PR 2942 at commit