subject:"\[jira\] \[Commented\] \(FLINK\-2131\) Add Initialization schemes for K\-means clustering"


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825913#comment-15825913
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user thvasilo commented on the issue:

https://github.com/apache/flink/pull/757
  
Sure @sachingoel0101 feel free to split up the PRs to reduce overhead.

For added initialization schemes let me throw [this recent 
NIPS](https://papers.nips.cc/paper/6478-fast-and-provably-good-seedings-for-k-means)
 paper in there, as it might be relatively easy to implement, but we can add it 
on later as well.


> Add Initialization schemes for K-means clustering
> -
>
> Key: FLINK-2131
> URL: https://issues.apache.org/jira/browse/FLINK-2131
> Project: Flink
>  Issue Type: Task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>
> The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
> in case the user doesn't provide the initial centers, they may ask for a 
> particular initialization scheme to be followed. The most commonly used are 
> these:
> 1. Random initialization: Self-explanatory
> 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> For very large data sets, or for large values of k, the kmeans|| method is 
> preferred as it provides the same approximation guarantees as kmeans++ and 
> requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825906#comment-15825906
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user skonto commented on the issue:

https://github.com/apache/flink/pull/757
  
@sachingoel0101 no problem been there ;) That would be good thnx!


> Add Initialization schemes for K-means clustering
> -
>
> Key: FLINK-2131
> URL: https://issues.apache.org/jira/browse/FLINK-2131
> Project: Flink
>  Issue Type: Task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>
> The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
> in case the user doesn't provide the initial centers, they may ask for a 
> particular initialization scheme to be followed. The most commonly used are 
> these:
> 1. Random initialization: Self-explanatory
> 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> For very large data sets, or for large values of k, the kmeans|| method is 
> preferred as it provides the same approximation guarantees as kmeans++ and 
> requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825896#comment-15825896
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the issue:

https://github.com/apache/flink/pull/757
  
@skonto Been a bit busy. My apologies. 
I was working on this again some time back and would like to split this 
into two PRs. One for K means itself, another for adding initialization 
schemes. How does that sound? 
Managing everything at once is a bit of headache because the first two 
commits are from two other contributors. 
I'll try to push a commit in the next 2-3 days. 


> Add Initialization schemes for K-means clustering
> -
>
> Key: FLINK-2131
> URL: https://issues.apache.org/jira/browse/FLINK-2131
> Project: Flink
>  Issue Type: Task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>
> The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
> in case the user doesn't provide the initial centers, they may ask for a 
> particular initialization scheme to be followed. The most commonly used are 
> these:
> 1. Random initialization: Self-explanatory
> 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> For very large data sets, or for large values of k, the kmeans|| method is 
> preferred as it provides the same approximation guarantees as kmeans++ and 
> requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825878#comment-15825878
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user skonto commented on the issue:

https://github.com/apache/flink/pull/757
  
@sachingoel0101 @tillrohrmann  ?


> Add Initialization schemes for K-means clustering
> -
>
> Key: FLINK-2131
> URL: https://issues.apache.org/jira/browse/FLINK-2131
> Project: Flink
>  Issue Type: Task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>
> The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
> in case the user doesn't provide the initial centers, they may ask for a 
> particular initialization scheme to be followed. The most commonly used are 
> these:
> 1. Random initialization: Self-explanatory
> 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> For very large data sets, or for large values of k, the kmeans|| method is 
> preferred as it provides the same approximation guarantees as kmeans++ and 
> requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2016-10-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556922#comment-15556922
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the issue:

https://github.com/apache/flink/pull/757
  
I'll update based on your comments in a few days. ^^

On Oct 8, 2016 06:21, "Stavros Kontopoulos" 
wrote:

> @sachingoel0101  @tillrohrmann
>  any plans for this PR?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or mute
> the thread
> 

> .
>



> Add Initialization schemes for K-means clustering
> -
>
> Key: FLINK-2131
> URL: https://issues.apache.org/jira/browse/FLINK-2131
> Project: Flink
>  Issue Type: Task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>
> The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
> in case the user doesn't provide the initial centers, they may ask for a 
> particular initialization scheme to be followed. The most commonly used are 
> these:
> 1. Random initialization: Self-explanatory
> 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> For very large data sets, or for large values of k, the kmeans|| method is 
> preferred as it provides the same approximation guarantees as kmeans++ and 
> requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2016-10-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15556280#comment-15556280
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user skonto commented on the issue:

https://github.com/apache/flink/pull/757
  
@sachingoel0101 @tillrohrmann any plans for this PR?


> Add Initialization schemes for K-means clustering
> -
>
> Key: FLINK-2131
> URL: https://issues.apache.org/jira/browse/FLINK-2131
> Project: Flink
>  Issue Type: Task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>
> The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
> in case the user doesn't provide the initial centers, they may ask for a 
> particular initialization scheme to be followed. The most commonly used are 
> these:
> 1. Random initialization: Self-explanatory
> 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> For very large data sets, or for large values of k, the kmeans|| method is 
> preferred as it provides the same approximation guarantees as kmeans++ and 
> requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537379#comment-15537379
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user skonto commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r81433030
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
--- End diff --

Add reference for kmeans|| too.  eg. Bahmani et al 
Same for kmeans++.



> Add Initialization schemes for K-means clustering
> -
>
> Key: FLINK-2131
> URL: https://issues.apache.org/jira/browse/FLINK-2131
> Project: Flink
>  Issue Type: Task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>
> The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
> in case the user doesn't provide the initial centers, they may ask for a 
> particular initialization scheme to be followed. The most commonly used are 
> these:
> 1. Random initialization: Self-explanatory
> 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> For very large data sets, or for large values of k, the kmeans|| method is 
> preferred as it provides the same approximation guarantees as kmeans++ and 
> requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537374#comment-15537374
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user skonto commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r81432816
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537307#comment-15537307
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user skonto commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r81430771
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/clustering/KMeansITSuite.scala
 ---
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.math
+import org.apache.flink.ml.math.DenseVector
+import org.apache.flink.test.util.FlinkTestBase
+import org.scalatest.{FlatSpec, Matchers}
+
+class KMeansITSuite extends FlatSpec with Matchers with FlinkTestBase {
+
+  behavior of "The KMeans implementation"
+
+  def fixture = new {
+val env = ExecutionEnvironment.getExecutionEnvironment
+val kmeans = KMeans().
+  setInitialCentroids(ClusteringData.centroidData).
+  setNumIterations(ClusteringData.iterations)
+
+val trainingDS = env.fromCollection(ClusteringData.trainingData)
+
+kmeans.fit(trainingDS)
+  }
+
+  it should "cluster data points into 'K' cluster centers" in {
+val f = fixture
+
+val centroidsResult = f.kmeans.centroids.get.collect().apply(0)
+
+val centroidsExpected = ClusteringData.expectedCentroids
+
+// the sizes must match
+centroidsResult.length should be === centroidsExpected.length
+
+// create a lookup table for better matching
+val expectedMap = centroidsExpected map (e => 
e.label->e.vector.asInstanceOf[DenseVector]) toMap
+
+// each of the results must be in lookup table
+centroidsResult.iterator.foreach(result => {
+  val expectedVector = expectedMap.get(result.label).get
+
+  // the type must match (not None)
+  expectedVector shouldBe a [math.DenseVector]
+
+  val expectedData = expectedVector.asInstanceOf[DenseVector].data
+  val resultData = result.vector.asInstanceOf[DenseVector].data
+
+  // match the individual values of the vector
+  expectedData zip resultData foreach {
+case (expectedVector, entryVector) =>
+  entryVector should be(expectedVector +- 0.1)
+  }
+})
+  }
+
+  it should "predict points to cluster centers" in {
+val f = fixture
+
+val vectorsWithExpectedLabels = ClusteringData.testData
+// create a lookup table for better matching
+val expectedMap = vectorsWithExpectedLabels map (v =>
+  v.vector.asInstanceOf[DenseVector] -> v.label
+  ) toMap
+
+// calculate the vector to cluster mapping on the plain vectors
+val plainVectors = vectorsWithExpectedLabels.map(v => v.vector)
+val predictedVectors = 
f.kmeans.predict(f.env.fromCollection(plainVectors))
+
+// check if all vectors were labeled correctly
+predictedVectors.collect() foreach (result => {
+  val expectedLabel = 
expectedMap.get(result._1.asInstanceOf[DenseVector]).get
+  result._2 should be(expectedLabel)
+})
+
+  }
+
+  it should "initialize k cluster centers randomly" in {
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+val kmeans = KMeans()
+  .setNumClusters(10)
+  .setNumIterations(ClusteringData.iterations)
+  .setInitializationStrategy("random")
+
+val trainingDS = env.fromCollection(ClusteringData.trainingData)
+kmeans.fit(trainingDS)
+
+println(trainingDS.mapWithBcVariable(kmeans.centroids.get) {
+  (vector, centroid) => 
Math.pow(ClusteringData.MinClusterDistance(vector, centroid)._1, 2)
+}.reduce(_ + _).collect().toArray.apply(0))
+  }
+
+  it should

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537304#comment-15537304
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user skonto commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r81430683
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/clustering/KMeansITSuite.scala
 ---
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.math
+import org.apache.flink.ml.math.DenseVector
+import org.apache.flink.test.util.FlinkTestBase
+import org.scalatest.{FlatSpec, Matchers}
+
+class KMeansITSuite extends FlatSpec with Matchers with FlinkTestBase {
+
+  behavior of "The KMeans implementation"
+
+  def fixture = new {
+val env = ExecutionEnvironment.getExecutionEnvironment
+val kmeans = KMeans().
+  setInitialCentroids(ClusteringData.centroidData).
+  setNumIterations(ClusteringData.iterations)
+
+val trainingDS = env.fromCollection(ClusteringData.trainingData)
+
+kmeans.fit(trainingDS)
+  }
+
+  it should "cluster data points into 'K' cluster centers" in {
+val f = fixture
+
+val centroidsResult = f.kmeans.centroids.get.collect().apply(0)
+
+val centroidsExpected = ClusteringData.expectedCentroids
+
+// the sizes must match
+centroidsResult.length should be === centroidsExpected.length
+
+// create a lookup table for better matching
+val expectedMap = centroidsExpected map (e => 
e.label->e.vector.asInstanceOf[DenseVector]) toMap
+
+// each of the results must be in lookup table
+centroidsResult.iterator.foreach(result => {
+  val expectedVector = expectedMap.get(result.label).get
+
+  // the type must match (not None)
+  expectedVector shouldBe a [math.DenseVector]
+
+  val expectedData = expectedVector.asInstanceOf[DenseVector].data
+  val resultData = result.vector.asInstanceOf[DenseVector].data
+
+  // match the individual values of the vector
+  expectedData zip resultData foreach {
+case (expectedVector, entryVector) =>
+  entryVector should be(expectedVector +- 0.1)
+  }
+})
+  }
+
+  it should "predict points to cluster centers" in {
+val f = fixture
+
+val vectorsWithExpectedLabels = ClusteringData.testData
+// create a lookup table for better matching
+val expectedMap = vectorsWithExpectedLabels map (v =>
+  v.vector.asInstanceOf[DenseVector] -> v.label
+  ) toMap
+
+// calculate the vector to cluster mapping on the plain vectors
+val plainVectors = vectorsWithExpectedLabels.map(v => v.vector)
+val predictedVectors = 
f.kmeans.predict(f.env.fromCollection(plainVectors))
+
+// check if all vectors were labeled correctly
+predictedVectors.collect() foreach (result => {
+  val expectedLabel = 
expectedMap.get(result._1.asInstanceOf[DenseVector]).get
+  result._2 should be(expectedLabel)
+})
+
+  }
+
+  it should "initialize k cluster centers randomly" in {
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+val kmeans = KMeans()
+  .setNumClusters(10)
+  .setNumIterations(ClusteringData.iterations)
+  .setInitializationStrategy("random")
+
+val trainingDS = env.fromCollection(ClusteringData.trainingData)
+kmeans.fit(trainingDS)
+
+println(trainingDS.mapWithBcVariable(kmeans.centroids.get) {
--- End diff --

assertion?


> Add Initialization schemes for K-means clustering
> -
>
> Key: FLINK-2131
>

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537293#comment-15537293
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user skonto commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r81430196
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537288#comment-15537288
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user skonto commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r81430070
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537282#comment-15537282
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user skonto commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r81429867
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537259#comment-15537259
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user skonto commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r81429258
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537261#comment-15537261
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user skonto commented on the issue:

https://github.com/apache/flink/pull/757
  
any progress with this PR?


> Add Initialization schemes for K-means clustering
> -
>
> Key: FLINK-2131
> URL: https://issues.apache.org/jira/browse/FLINK-2131
> Project: Flink
>  Issue Type: Task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>
> The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
> in case the user doesn't provide the initial centers, they may ask for a 
> particular initialization scheme to be followed. The most commonly used are 
> these:
> 1. Random initialization: Self-explanatory
> 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> For very large data sets, or for large values of k, the kmeans|| method is 
> preferred as it provides the same approximation guarantees as kmeans++ and 
> requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537255#comment-15537255
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user skonto commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r81429036
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given number

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2016-05-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15307258#comment-15307258
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user chiwanpark commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-222599628
  
Any updates on this? If there is no update, I would like to push additional 
effort based on this.


> Add Initialization schemes for K-means clustering
> -
>
> Key: FLINK-2131
> URL: https://issues.apache.org/jira/browse/FLINK-2131
> Project: Flink
>  Issue Type: Task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>
> The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
> in case the user doesn't provide the initial centers, they may ask for a 
> particular initialization scheme to be followed. The most commonly used are 
> these:
> 1. Random initialization: Self-explanatory
> 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> For very large data sets, or for large values of k, the kmeans|| method is 
> preferred as it provides the same approximation guarantees as kmeans++ and 
> requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2016-02-06 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15136131#comment-15136131
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-180943918
  
Thanks for the review. I will push a fix soon. 


> Add Initialization schemes for K-means clustering
> -
>
> Key: FLINK-2131
> URL: https://issues.apache.org/jira/browse/FLINK-2131
> Project: Flink
>  Issue Type: Task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>
> The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
> in case the user doesn't provide the initial centers, they may ask for a 
> particular initialization scheme to be followed. The most commonly used are 
> these:
> 1. Random initialization: Self-explanatory
> 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> For very large data sets, or for large values of k, the kmeans|| method is 
> preferred as it provides the same approximation guarantees as kmeans++ and 
> requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134205#comment-15134205
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r52017468
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134208#comment-15134208
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r52017873
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134220#comment-15134220
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r52018656
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134221#comment-15134221
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r52018897
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134225#comment-15134225
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r52019088
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134228#comment-15134228
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r52019259
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134233#comment-15134233
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r52019827
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134198#comment-15134198
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r52017071
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134196#comment-15134196
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r52017027
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134236#comment-15134236
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r52020220
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134239#comment-15134239
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/757#discussion_r52020543
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/clustering/KMeans.scala
 ---
@@ -0,0 +1,614 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.flink.ml.clustering
+
+import org.apache.flink.api.common.functions.RichFilterFunction
+import 
org.apache.flink.api.java.functions.FunctionAnnotation.ForwardedFields
+import org.apache.flink.api.scala.{DataSet, _}
+import org.apache.flink.configuration.Configuration
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.FlinkMLTools.ModuloKeyPartitioner
+import org.apache.flink.ml.common.{LabeledVector, _}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BLAS, Vector}
+import org.apache.flink.ml.metrics.distances.EuclideanDistanceMetric
+import org.apache.flink.ml.pipeline._
+
+import scala.collection.JavaConverters._
+import scala.util.Random
+
+
+/**
+ * Implements the KMeans algorithm which calculates cluster centroids 
based on set of training data
+ * points and a set of k initial centroids.
+ *
+ * [[KMeans]] is a [[Predictor]] which needs to be trained on a set of 
data points and can then be
+ * used to assign new points to the learned cluster centroids.
+ *
+ * The KMeans algorithm works as described on Wikipedia
+ * (http://en.wikipedia.org/wiki/K-means_clustering):
+ *
+ * Given an initial set of k means m1(1),…,mk(1) (see below), the 
algorithm proceeds by alternating
+ * between two steps:
+ *
+ * ===Assignment step:===
+ *
+ * Assign each observation to the cluster whose mean yields the least 
within-cluster sum  of
+ * squares (WCSS). Since the sum of squares is the squared Euclidean 
distance, this is intuitively
+ * the "nearest" mean. (Mathematically, this means partitioning the 
observations according to the
+ * Voronoi diagram generated by the means).
+ *
+ * `S_i^(t) = { x_p : || x_p - m_i^(t) ||^2 ≤ || x_p - m_j^(t) ||^2 
\forall j, 1 ≤ j ≤ k}`,
+ * where each `x_p`  is assigned to exactly one `S^{(t)}`, even if it 
could be assigned to two or
+ * more of them.
+ *
+ * ===Update step:===
+ *
+ * Calculate the new means to be the centroids of the observations in the 
new clusters.
+ *
+ * `m^{(t+1)}_i = ( 1 / |S^{(t)}_i| ) \sum_{x_j \in S^{(t)}_i} x_j`
+ *
+ * Since the arithmetic mean is a least-squares estimator, this also 
minimizes the within-cluster
+ * sum of squares (WCSS) objective.
+ *
+ * @example
+ * {{{
+ *   val trainingDS: DataSet[Vector] = 
env.fromCollection(Clustering.trainingData)
+ *   val initialCentroids: DataSet[LabledVector] = 
env.fromCollection(Clustering.initCentroids)
+ *
+ *   val kmeans = KMeans()
+ * .setInitialCentroids(initialCentroids)
+ * .setNumIterations(10)
+ *
+ *   kmeans.fit(trainingDS)
+ *
+ *   // getting the computed centroids
+ *   val centroidsResult = kmeans.centroids.get.collect()
+ *
+ *   // get matching clusters for new points
+ *   val testDS: DataSet[Vector] = 
env.fromCollection(Clustering.testData)
+ *   val clusters: DataSet[LabeledVector] = kmeans.predict(testDS)
+ * }}}
+ *
+ * =Parameters=
+ *
+ * - [[org.apache.flink.ml.clustering.KMeans.NumIterations]]:
+ * Defines the number of iterations to recalculate the centroids of the 
clusters. As it
+ * is a heuristic algorithm, there is no guarantee that it will converge 
to the global optimum. The
+ * centroids of the clusters and the reassignment of the data points will 
be repeated till the
+ * given

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-11-12 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002297#comment-15002297
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-156148237
  
Ping.


> Add Initialization schemes for K-means clustering
> -
>
> Key: FLINK-2131
> URL: https://issues.apache.org/jira/browse/FLINK-2131
> Project: Flink
>  Issue Type: Task
>  Components: Machine Learning Library
>Reporter: Sachin Goel
>Assignee: Sachin Goel
>
> The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
> in case the user doesn't provide the initial centers, they may ask for a 
> particular initialization scheme to be followed. The most commonly used are 
> these:
> 1. Random initialization: Self-explanatory
> 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
> 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
> For very large data sets, or for large values of k, the kmeans|| method is 
> preferred as it provides the same approximation guarantees as kmeans++ and 
> requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-07-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629545#comment-14629545
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-121924666
  
Okay. @tillrohrmann, can you review this?


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-07-16 Thread ASF GitHub Bot (JIRA)

[
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14629575#comment-14629575
]

ASF GitHub Bot commented on FLINK-2131:
---

Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-121928501

Will do, once I have some free time. Currently I've to finish the high
availability support. Try to hurry up.

On Thu, Jul 16, 2015 at 12:50 PM, Sachin Goel notificati...@github.com
wrote:

Okay. @tillrohrmann https://github.com/tillrohrmann, can you review
this?

—
Reply to this email directly or view it on GitHub
https://github.com/apache/flink/pull/757#issuecomment-121924666.

Add Initialization schemes for K-means clustering
-

Key: FLINK-2131
URL: https://issues.apache.org/jira/browse/FLINK-2131
Project: Flink
Issue Type: Task
Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

The Lloyd's [KMeans] algorithm takes initial centroids as its input. However,
in case the user doesn't provide the initial centers, they may ask for a
particular initialization scheme to be followed. The most commonly used are
these:
1. Random initialization: Self-explanatory
2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
For very large data sets, or for large values of k, the kmeans|| method is
preferred as it provides the same approximation guarantees as kmeans++ and
requires lesser number of passes over the input data.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-07-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627996#comment-14627996
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-121605233
  
@thvasilo I've incorporated different initialization strategies in the 
KMeans algorithm itself. Please review.


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-07-15 Thread ASF GitHub Bot (JIRA)

[
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628551#comment-14628551
]

ASF GitHub Bot commented on FLINK-2131:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-121713405

Hello Sachin,

I'm currently on vacation until August so someone else needs to so the
reviews until then.

Regards,
Theodore

--
Sent from a mobile device. May contain autocorrect errors.
On Jul 15, 2015 3:50 PM, Sachin Goel notificati...@github.com wrote:

@thvasilo https://github.com/thvasilo I've incorporated different
initialization strategies in the KMeans algorithm itself. Please review.

—
Reply to this email directly or view it on GitHub
https://github.com/apache/flink/pull/757#issuecomment-121605233.

Add Initialization schemes for K-means clustering
-

Key: FLINK-2131
URL: https://issues.apache.org/jira/browse/FLINK-2131
Project: Flink
Issue Type: Task
Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

2015-07-01 Thread ASF GitHub Bot (JIRA)

[
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610503#comment-14610503
]

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117722491

@thvasilo , right now, there aren't other features in the library which
need sampling. Perhaps it isn't a good idea to file a separate feature request.
But, if need arises, I'll certainly be willing to write a general sampling code.

Add Initialization schemes for K-means clustering
-

Key: FLINK-2131
URL: https://issues.apache.org/jira/browse/FLINK-2131
Project: Flink
Issue Type: Task
Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607912#comment-14607912
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117049680
  
Further, the probability distribution doesn't need to be scaled down to 
between [0,1]. We just take care that of while building the cumulative 
distribution.


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607946#comment-14607946
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117054526
  
OK, thanks for the explanation. I will look at this PR this week hopefully.


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607916#comment-14607916
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117051023
  
Sorry about the formatting though. I'll fix it. I haven't worked on this in 
a while. 
I'll incorporate your suggestions from the previous PR.


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607895#comment-14607895
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117042891
  
Hello Sachin, could you explain what the discrete sampler does?


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

[
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607907#comment-14607907
]

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117047575

Hi @thvasilo, thanks for taking the time to go through it.
Consider for example a probability distribution P(X_0) = 0.2, P(X_1) = 0.3,
P(X_2) = 0.5
To sample an element out of X_0, X_1 and X_2, we can generate a random
number but we need to map intervals of real numbers to the values X_0, X_1 and
X_2. This is what the discreteSampler does.
It forms a cumulative distribution as [0.2, 0.5, 1.0] and then, if the
generated random no is in [0, 0.2), we pick X_0, and so on.

Add Initialization schemes for K-means clustering
-

Key: FLINK-2131
URL: https://issues.apache.org/jira/browse/FLINK-2131
Project: Flink
Issue Type: Task
Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607954#comment-14607954
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117055846
  
Okay. I'll update it today itself with a few trivial fixes.


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering

[
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608434#comment-14608434
]

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 commented on the pull request:

https://github.com/apache/flink/pull/757#issuecomment-117220314

Hey @thvasilo , I'm going to break up this PR further. The motivation is
that, the Sampling code should be available as a general feature. Given a
probability distribution over data, user should be able to sample as many
points as they want.

The Sampler will take the DataSet as input, number of samples required and
a function which determines the relative probability of a particular element
being picked, apart from specifying whether the elements should be sampled with
replacement or without replacement.
Let me know your thoughts. I'll work out a version in the meantime. If this
is desirable, I will file a JIRA ticket and open a separate PR.

Add Initialization schemes for K-means clustering
-

Key: FLINK-2131
URL: https://issues.apache.org/jira/browse/FLINK-2131
Project: Flink
Issue Type: Task
Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608437#comment-14608437
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

GitHub user sachingoel0101 reopened a pull request:

https://github.com/apache/flink/pull/757

[FLINK-2131][ml]: Initialization schemes for k-means clustering

This adds two most common initialization strategies for the k-means 
clustering algorithm, namely, Random initialization and kmeans++ initialization.
Further details are at https://issues.apache.org/jira/browse/FLINK-2131
[Edit]: Work on kmeans|| has been started and just needs to be finalized.
[Edit]: kmeans|| implementation finished. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sachingoel0101/flink 
clustering_initializations

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/757.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #757


commit dc2de88bf5e3148bb116cad607fc3c61d9dceac6
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-02T06:44:30Z

Random and kmeans++ initialization methods added

commit 4a39a19c1425259c71ac6d922b4d9a9f2e7d1c6e
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-02T15:42:58Z

Merge https://github.com/apache/flink into clustering_initializations

commit cdbb3a0801d364935d455798c695f4615ae74e76
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-02T19:49:24Z

Merge https://github.com/apache/flink into clustering_initializations

commit 7496e21462e4efc0813450971ae6cbc94d2b2c15
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-02T22:41:20Z

Initialization costs of random and kmeans++ added

commit 8033c87b71686bd3955281db12583592549406cb
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-05T21:54:10Z

Merge https://github.com/apache/flink into clustering_initializations

commit 29ed1d3fb31aa038d6ed1a5bf16d58f19565cdf8
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-05T22:52:02Z

Removed cost parameter from Algorithm itself. Leaving it to the user for 
now. Also added support for weighted input data sets

commit 5286c3c21d5019f6ba8ab67c2074570087bc1b3a
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-06T05:04:55Z

An initial draft of kmeans-par method

commit f3bfad4fc0c6576af14f1e981f8e778445856355
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-08T10:36:32Z

All three initialization schemes implemented and tested

commit 8496b8fd627ade8dbe7b92949d35d3cce704f1cc
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-08T10:36:58Z

Merge https://github.com/apache/flink into clustering_initializations

commit 3765a3e6a77a8bdbac21d03be1c43263925b1495
Author: Sachin Goel sachingoel0...@gmail.com
Date:   2015-06-30T08:57:41Z

Merge remote-tracking branch 'upstream/master' into 
clustering_initializations




 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering


[ 
https://issues.apache.org/jira/browse/FLINK-2131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608435#comment-14608435
 ] 

ASF GitHub Bot commented on FLINK-2131:
---

Github user sachingoel0101 closed the pull request at:

https://github.com/apache/flink/pull/757


 Add Initialization schemes for K-means clustering
 -

 Key: FLINK-2131
 URL: https://issues.apache.org/jira/browse/FLINK-2131
 Project: Flink
  Issue Type: Task
  Components: Machine Learning Library
Reporter: Sachin Goel
Assignee: Sachin Goel

 The Lloyd's [KMeans] algorithm takes initial centroids as its input. However, 
 in case the user doesn't provide the initial centers, they may ask for a 
 particular initialization scheme to be followed. The most commonly used are 
 these:
 1. Random initialization: Self-explanatory
 2. kmeans++ initialization: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
 3. kmeans|| : http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
 For very large data sets, or for large values of k, the kmeans|| method is 
 preferred as it provides the same approximation guarantees as kmeans++ and 
 requires lesser number of passes over the input data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2131) Add Initialization schemes for K-means clustering