[GitHub] spark pull request: [SPARK-6937][MLLIB] Fixed bug in PICExample in...

2015-04-15 Thread javadba
GitHub user javadba opened a pull request:

https://github.com/apache/spark/pull/5531

[SPARK-6937][MLLIB] Fixed bug in PICExample in which the radius were not 
being accepted on c...

 Tiny bug in PowerIterationClusteringExample in which radius not accepted 
from command line

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Huawei-Spark/spark picsub

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5531.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5531


commit 2aab8cf0dbd6db71078080bbc0491f166aecb741
Author: sboeschhuawei stephen.boe...@huawei.com
Date:   2015-04-15T16:56:48Z

Fixed bug in PICExample in which the radius were not being accepted on 
command line




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-11 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24521887
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Number of sampled points on innermost circle.. There are 
proportionally more points
+ *  within the outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example mllib.PowerIterationClusteringExample
+ * -k 3 --n 30 --numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ * 0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+  .text(snumber of circles (/clusters), default: 
${defaultParams.k})
+  .action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+  .text(snumber of points, default: ${defaultParams.numPoints})
+  .action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+  .text(snumber of iterations, default: 
${defaultParams.numIterations})
+  .action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def run(params: Params) {
+val conf = new SparkConf()
+.setMaster(local)
+.setAppName(sPowerIterationClustering with $params)
+val sc = new SparkContext(conf)
+
+Logger.getRootLogger.setLevel(Level.WARN)
+
+val circlesRdd = generateCirclesRdd(sc, params.k, params.numPoints, 
params.outerRadius)
+val model = new PowerIterationClustering()
+.setK(params.k)
+.setMaxIterations(params.numIterations)
+.run(circlesRdd)
+
+val clusters = 
model.assignments.collect.groupBy(_._2).mapValues(_.map(_._1))
+val assignments = clusters.toList.sortBy { case (k, v) = v.length}
+val assignmentsStr = assignments
+.map { case (k, v) = s$k - ${v.sorted.mkString([, ,, 
])}}.mkString(,)
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e

[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-11 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24521948
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Number of sampled points on innermost circle.. There are 
proportionally more points
+ *  within the outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example mllib.PowerIterationClusteringExample
+ * -k 3 --n 30 --numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ * 0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+  .text(snumber of circles (/clusters), default: 
${defaultParams.k})
--- End diff --

fyi I have created a Youtrack issue for Intellij for this formatting 
limitation:  

https://youtrack.jetbrains.com/issue/IDEA-136404



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-11 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24519073
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Number of sampled points on innermost circle.. There are 
proportionally more points
+ *  within the outer/larger circles
+ *   numIterations:   Number of Power Iterations
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-11 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24519062
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Number of sampled points on innermost circle.. There are 
proportionally more points
+ *  within the outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
--- End diff --

oh yea.. added configuration param

  opt[Int]('r', r)
  .text(sradius of outermost circle, default: 
${defaultParams.outerRadius})
  .action((x, c) = c.copy(numPoints = x))



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-11 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24518027
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Number of sampled points on innermost circle.. There are 
proportionally more points
+ *  within the outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example mllib.PowerIterationClusteringExample
+ * -k 3 --n 30 --numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ * 0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+  .text(snumber of circles (/clusters), default: 
${defaultParams.k})
+  .action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+  .text(snumber of points, default: ${defaultParams.numPoints})
+  .action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+  .text(snumber of iterations, default: 
${defaultParams.numIterations})
+  .action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def run(params: Params) {
+val conf = new SparkConf()
+.setMaster(local)
+.setAppName(sPowerIterationClustering with $params)
+val sc = new SparkContext(conf)
+
+Logger.getRootLogger.setLevel(Level.WARN)
+
+val circlesRdd = generateCirclesRdd(sc, params.k, params.numPoints, 
params.outerRadius)
+val model = new PowerIterationClustering()
+.setK(params.k)
+.setMaxIterations(params.numIterations)
+.run(circlesRdd)
+
+val clusters = 
model.assignments.collect.groupBy(_._2).mapValues(_.map(_._1))
+val assignments = clusters.toList.sortBy { case (k, v) = v.length}
+val assignmentsStr = assignments
+.map { case (k, v) = s$k - ${v.sorted.mkString([, ,, 
])}}.mkString(,)
+println(sCluster assignments: $assignmentsStr)
+
+sc.stop()
+  }
+
+  def generateCirclesRdd(sc: SparkContext,
+  nCircles: Int = 3,
+  nPoints: Int = 30,
+  outerRadius: Double): RDD[(Long, Long, Double)] = {
+
+val radii = for (cx - 0 until nCircles) yield outerRadius / 
(nCircles-cx)
+val groupSizes = for (cx - 0 until nCircles) yield (cx + 1) * nPoints
+var ix = 0
+val points = for (cx - 0 until nCircles

[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-11 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24517979
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Number of sampled points on innermost circle.. There are 
proportionally more points
+ *  within the outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example mllib.PowerIterationClusteringExample
+ * -k 3 --n 30 --numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ * 0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+  .text(snumber of circles (/clusters), default: 
${defaultParams.k})
+  .action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+  .text(snumber of points, default: ${defaultParams.numPoints})
+  .action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+  .text(snumber of iterations, default: 
${defaultParams.numIterations})
+  .action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def run(params: Params) {
+val conf = new SparkConf()
+.setMaster(local)
+.setAppName(sPowerIterationClustering with $params)
+val sc = new SparkContext(conf)
+
+Logger.getRootLogger.setLevel(Level.WARN)
+
+val circlesRdd = generateCirclesRdd(sc, params.k, params.numPoints, 
params.outerRadius)
+val model = new PowerIterationClustering()
+.setK(params.k)
+.setMaxIterations(params.numIterations)
+.run(circlesRdd)
+
+val clusters = 
model.assignments.collect.groupBy(_._2).mapValues(_.map(_._1))
+val assignments = clusters.toList.sortBy { case (k, v) = v.length}
+val assignmentsStr = assignments
+.map { case (k, v) = s$k - ${v.sorted.mkString([, ,, 
])}}.mkString(,)
+println(sCluster assignments: $assignmentsStr)
+
+sc.stop()
+  }
+
+  def generateCirclesRdd(sc: SparkContext,
+  nCircles: Int = 3,
+  nPoints: Int = 30,
+  outerRadius: Double): RDD[(Long, Long, Double)] = {
+
+val radii = for (cx - 0 until nCircles) yield outerRadius / 
(nCircles-cx)
+val groupSizes = for (cx - 0 until nCircles) yield (cx + 1) * nPoints
+var ix = 0
+val points = for (cx - 0 until nCircles;
--- End

[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-11 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24518800
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-11 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24518861
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-11 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24518525
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Number of sampled points on innermost circle.. There are 
proportionally more points
+ *  within the outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example mllib.PowerIterationClusteringExample
+ * -k 3 --n 30 --numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ * 0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+  .text(snumber of circles (/clusters), default: 
${defaultParams.k})
+  .action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+  .text(snumber of points, default: ${defaultParams.numPoints})
+  .action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+  .text(snumber of iterations, default: 
${defaultParams.numIterations})
+  .action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def run(params: Params) {
+val conf = new SparkConf()
+.setMaster(local)
+.setAppName(sPowerIterationClustering with $params)
+val sc = new SparkContext(conf)
+
+Logger.getRootLogger.setLevel(Level.WARN)
+
+val circlesRdd = generateCirclesRdd(sc, params.k, params.numPoints, 
params.outerRadius)
+val model = new PowerIterationClustering()
+.setK(params.k)
+.setMaxIterations(params.numIterations)
+.run(circlesRdd)
+
+val clusters = 
model.assignments.collect.groupBy(_._2).mapValues(_.map(_._1))
+val assignments = clusters.toList.sortBy { case (k, v) = v.length}
+val assignmentsStr = assignments
+.map { case (k, v) = s$k - ${v.sorted.mkString([, ,, 
])}}.mkString(,)
+println(sCluster assignments: $assignmentsStr)
+
+sc.stop()
+  }
+
+  def generateCirclesRdd(sc: SparkContext,
+  nCircles: Int = 3,
+  nPoints: Int = 30,
+  outerRadius: Double): RDD[(Long, Long, Double)] = {
+
+val radii = for (cx - 0 until nCircles) yield outerRadius / 
(nCircles-cx)
+val groupSizes = for (cx - 0 until nCircles) yield (cx + 1) * nPoints
+var ix = 0
+val points = for (cx - 0 until nCircles

[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-10 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24471635
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Number of sampled points on innermost circle.. There are 
proportionally more points
+ *  within the outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example mllib.PowerIterationClusteringExample
+ * -k 3 --n 30 --numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ * 0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+  .text(snumber of circles (/clusters), default: 
${defaultParams.k})
+  .action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+  .text(snumber of points, default: ${defaultParams.numPoints})
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-10 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24471616
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Number of sampled points on innermost circle.. There are 
proportionally more points
+ *  within the outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example mllib.PowerIterationClusteringExample
+ * -k 3 --n 30 --numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ * 0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+  .text(snumber of circles (/clusters), default: 
${defaultParams.k})
--- End diff --

err yes.. back to the drawing board for the IJ auto formatting. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-10 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24471723
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Number of sampled points on innermost circle.. There are 
proportionally more points
+ *  within the outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example mllib.PowerIterationClusteringExample
+ * -k 3 --n 30 --numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ * 0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+  .text(snumber of circles (/clusters), default: 
${defaultParams.k})
+  .action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+  .text(snumber of points, default: ${defaultParams.numPoints})
+  .action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+  .text(snumber of iterations, default: 
${defaultParams.numIterations})
+  .action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def run(params: Params) {
+val conf = new SparkConf()
+.setMaster(local)
+.setAppName(sPowerIterationClustering with $params)
+val sc = new SparkContext(conf)
+
+Logger.getRootLogger.setLevel(Level.WARN)
+
+val circlesRdd = generateCirclesRdd(sc, params.k, params.numPoints, 
params.outerRadius)
+val model = new PowerIterationClustering()
+.setK(params.k)
+.setMaxIterations(params.numIterations)
+.run(circlesRdd)
+
+val clusters = 
model.assignments.collect.groupBy(_._2).mapValues(_.map(_._1))
+val assignments = clusters.toList.sortBy { case (k, v) = v.length}
+val assignmentsStr = assignments
+.map { case (k, v) = s$k - ${v.sorted.mkString([, ,, 
])}}.mkString(,)
+println(sCluster assignments: $assignmentsStr)
+
+sc.stop()
+  }
+
+  def generateCirclesRdd(sc: SparkContext,
+  nCircles: Int = 3,
+  nPoints: Int = 30,
+  outerRadius: Double): RDD[(Long, Long, Double)] = {
+
+val radii = for (cx - 0 until nCircles) yield outerRadius / 
(nCircles-cx)
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-10 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24394701
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+import scopt.OptionParser
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Total number of sampled points. There are proportionally more 
points within the
+ *  outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 
--numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  case class Params(
+  input: String = null,
+  k: Int = 3,
+  numPoints: Int = 30,
+  numIterations: Int = 10,
+  outerRadius: Double = 3.0
+  ) extends AbstractParams[Params]
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+.text(snumber of circles (/clusters), default: 
${defaultParams.k})
+.action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+.text(snumber of points, default: ${defaultParams.numPoints})
+.action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+.text(snumber of iterations, default: 
${defaultParams.numIterations})
+.action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def generateCircle(n: Int, r: Double): Array[(Double, Double)] = {
+val pi2 = 2 * math.Pi
+(0.0 until pi2 by pi2 / n).map { x =
+  (r * math.cos(x), r * math.sin(x))
+}.toArray
+  }
+
+  def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, 
nTotalPoints: Int = 30,
+ outerRadius: Double):
+  RDD[(Long, Long, Double)] = {
+// The circles are generated as follows:
+// The Radii are equal to the largestRadius/(C - circleIndex)
+//   where  C=Number of circles
+//  and the circleIndex is 0 for the innermost and (nCircles-1) 
for the outermost circle
+// The number of points in each circle (and thus in each final 
cluster) is:
+// x, 2x, .., nCircles*x
+// Where x is found from  x = N * C(C+1)/2
+//  The # points in the LAST circle is adjusted downwards so that the 
total sum is equal
+//  to the nTotalPoints
+
+val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) 
/ 2.0

[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-09 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24392886
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+import scopt.OptionParser
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Total number of sampled points. There are proportionally more 
points within the
+ *  outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 
--numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  case class Params(
+  input: String = null,
+  k: Int = 3,
+  numPoints: Int = 30,
+  numIterations: Int = 10,
+  outerRadius: Double = 3.0
+  ) extends AbstractParams[Params]
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+.text(snumber of circles (/clusters), default: 
${defaultParams.k})
+.action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+.text(snumber of points, default: ${defaultParams.numPoints})
+.action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+.text(snumber of iterations, default: 
${defaultParams.numIterations})
+.action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def generateCircle(n: Int, r: Double): Array[(Double, Double)] = {
+val pi2 = 2 * math.Pi
+(0.0 until pi2 by pi2 / n).map { x =
+  (r * math.cos(x), r * math.sin(x))
+}.toArray
+  }
+
+  def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, 
nTotalPoints: Int = 30,
+ outerRadius: Double):
+  RDD[(Long, Long, Double)] = {
+// The circles are generated as follows:
+// The Radii are equal to the largestRadius/(C - circleIndex)
+//   where  C=Number of circles
+//  and the circleIndex is 0 for the innermost and (nCircles-1) 
for the outermost circle
+// The number of points in each circle (and thus in each final 
cluster) is:
+// x, 2x, .., nCircles*x
+// Where x is found from  x = N * C(C+1)/2
+//  The # points in the LAST circle is adjusted downwards so that the 
total sum is equal
+//  to the nTotalPoints
+
+val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) 
/ 2.0

[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-09 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24393348
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+import scopt.OptionParser
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Total number of sampled points. There are proportionally more 
points within the
+ *  outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 
--numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  case class Params(
+  input: String = null,
+  k: Int = 3,
+  numPoints: Int = 30,
+  numIterations: Int = 10,
+  outerRadius: Double = 3.0
+  ) extends AbstractParams[Params]
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+.text(snumber of circles (/clusters), default: 
${defaultParams.k})
+.action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+.text(snumber of points, default: ${defaultParams.numPoints})
+.action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+.text(snumber of iterations, default: 
${defaultParams.numIterations})
+.action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def generateCircle(n: Int, r: Double): Array[(Double, Double)] = {
+val pi2 = 2 * math.Pi
+(0.0 until pi2 by pi2 / n).map { x =
+  (r * math.cos(x), r * math.sin(x))
+}.toArray
+  }
+
+  def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, 
nTotalPoints: Int = 30,
--- End diff --

is there a code style xml for apache spark - since IJ sets the indentation 
to like 40


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-09 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24394006
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+import scopt.OptionParser
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Total number of sampled points. There are proportionally more 
points within the
+ *  outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 
--numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  case class Params(
+  input: String = null,
+  k: Int = 3,
+  numPoints: Int = 30,
+  numIterations: Int = 10,
+  outerRadius: Double = 3.0
+  ) extends AbstractParams[Params]
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+.text(snumber of circles (/clusters), default: 
${defaultParams.k})
+.action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+.text(snumber of points, default: ${defaultParams.numPoints})
+.action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+.text(snumber of iterations, default: 
${defaultParams.numIterations})
+.action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def generateCircle(n: Int, r: Double): Array[(Double, Double)] = {
+val pi2 = 2 * math.Pi
+(0.0 until pi2 by pi2 / n).map { x =
+  (r * math.cos(x), r * math.sin(x))
+}.toArray
+  }
+
+  def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, 
nTotalPoints: Int = 30,
+ outerRadius: Double):
+  RDD[(Long, Long, Double)] = {
+// The circles are generated as follows:
+// The Radii are equal to the largestRadius/(C - circleIndex)
+//   where  C=Number of circles
+//  and the circleIndex is 0 for the innermost and (nCircles-1) 
for the outermost circle
+// The number of points in each circle (and thus in each final 
cluster) is:
+// x, 2x, .., nCircles*x
+// Where x is found from  x = N * C(C+1)/2
+//  The # points in the LAST circle is adjusted downwards so that the 
total sum is equal
+//  to the nTotalPoints
+
+val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) 
/ 2.0

[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-09 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24393968
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+import scopt.OptionParser
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Total number of sampled points. There are proportionally more 
points within the
+ *  outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 
--numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  case class Params(
+  input: String = null,
+  k: Int = 3,
+  numPoints: Int = 30,
+  numIterations: Int = 10,
+  outerRadius: Double = 3.0
+  ) extends AbstractParams[Params]
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+.text(snumber of circles (/clusters), default: 
${defaultParams.k})
+.action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+.text(snumber of points, default: ${defaultParams.numPoints})
+.action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+.text(snumber of iterations, default: 
${defaultParams.numIterations})
+.action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def generateCircle(n: Int, r: Double): Array[(Double, Double)] = {
+val pi2 = 2 * math.Pi
+(0.0 until pi2 by pi2 / n).map { x =
+  (r * math.cos(x), r * math.sin(x))
+}.toArray
+  }
+
+  def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, 
nTotalPoints: Int = 30,
+ outerRadius: Double):
+  RDD[(Long, Long, Double)] = {
+// The circles are generated as follows:
+// The Radii are equal to the largestRadius/(C - circleIndex)
+//   where  C=Number of circles
+//  and the circleIndex is 0 for the innermost and (nCircles-1) 
for the outermost circle
+// The number of points in each circle (and thus in each final 
cluster) is:
+// x, 2x, .., nCircles*x
+// Where x is found from  x = N * C(C+1)/2
+//  The # points in the LAST circle is adjusted downwards so that the 
total sum is equal
+//  to the nTotalPoints
+
+val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) 
/ 2.0

[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-09 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24393995
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+import scopt.OptionParser
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Total number of sampled points. There are proportionally more 
points within the
+ *  outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 
--numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  case class Params(
+  input: String = null,
+  k: Int = 3,
+  numPoints: Int = 30,
+  numIterations: Int = 10,
+  outerRadius: Double = 3.0
+  ) extends AbstractParams[Params]
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+.text(snumber of circles (/clusters), default: 
${defaultParams.k})
+.action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+.text(snumber of points, default: ${defaultParams.numPoints})
+.action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+.text(snumber of iterations, default: 
${defaultParams.numIterations})
+.action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def generateCircle(n: Int, r: Double): Array[(Double, Double)] = {
+val pi2 = 2 * math.Pi
+(0.0 until pi2 by pi2 / n).map { x =
+  (r * math.cos(x), r * math.sin(x))
+}.toArray
+  }
+
+  def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, 
nTotalPoints: Int = 30,
+ outerRadius: Double):
+  RDD[(Long, Long, Double)] = {
+// The circles are generated as follows:
+// The Radii are equal to the largestRadius/(C - circleIndex)
+//   where  C=Number of circles
+//  and the circleIndex is 0 for the innermost and (nCircles-1) 
for the outermost circle
+// The number of points in each circle (and thus in each final 
cluster) is:
+// x, 2x, .., nCircles*x
+// Where x is found from  x = N * C(C+1)/2
+//  The # points in the LAST circle is adjusted downwards so that the 
total sum is equal
+//  to the nTotalPoints
+
+val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) 
/ 2.0

[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-09 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24394002
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+import scopt.OptionParser
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Total number of sampled points. There are proportionally more 
points within the
+ *  outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 
--numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  case class Params(
+  input: String = null,
+  k: Int = 3,
+  numPoints: Int = 30,
+  numIterations: Int = 10,
+  outerRadius: Double = 3.0
+  ) extends AbstractParams[Params]
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+.text(snumber of circles (/clusters), default: 
${defaultParams.k})
+.action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+.text(snumber of points, default: ${defaultParams.numPoints})
+.action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+.text(snumber of iterations, default: 
${defaultParams.numIterations})
+.action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def generateCircle(n: Int, r: Double): Array[(Double, Double)] = {
+val pi2 = 2 * math.Pi
+(0.0 until pi2 by pi2 / n).map { x =
+  (r * math.cos(x), r * math.sin(x))
+}.toArray
+  }
+
+  def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, 
nTotalPoints: Int = 30,
+ outerRadius: Double):
+  RDD[(Long, Long, Double)] = {
+// The circles are generated as follows:
+// The Radii are equal to the largestRadius/(C - circleIndex)
+//   where  C=Number of circles
+//  and the circleIndex is 0 for the innermost and (nCircles-1) 
for the outermost circle
+// The number of points in each circle (and thus in each final 
cluster) is:
+// x, 2x, .., nCircles*x
+// Where x is found from  x = N * C(C+1)/2
+//  The # points in the LAST circle is adjusted downwards so that the 
total sum is equal
+//  to the nTotalPoints
+
+val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) 
/ 2.0

[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-09 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24393130
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+import scopt.OptionParser
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Total number of sampled points. There are proportionally more 
points within the
+ *  outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 
--numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  case class Params(
+  input: String = null,
+  k: Int = 3,
+  numPoints: Int = 30,
+  numIterations: Int = 10,
+  outerRadius: Double = 3.0
+  ) extends AbstractParams[Params]
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+.text(snumber of circles (/clusters), default: 
${defaultParams.k})
+.action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+.text(snumber of points, default: ${defaultParams.numPoints})
+.action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+.text(snumber of iterations, default: 
${defaultParams.numIterations})
+.action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def generateCircle(n: Int, r: Double): Array[(Double, Double)] = {
+val pi2 = 2 * math.Pi
+(0.0 until pi2 by pi2 / n).map { x =
+  (r * math.cos(x), r * math.sin(x))
+}.toArray
+  }
+
+  def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, 
nTotalPoints: Int = 30,
+ outerRadius: Double):
+  RDD[(Long, Long, Double)] = {
+// The circles are generated as follows:
+// The Radii are equal to the largestRadius/(C - circleIndex)
+//   where  C=Number of circles
+//  and the circleIndex is 0 for the innermost and (nCircles-1) 
for the outermost circle
+// The number of points in each circle (and thus in each final 
cluster) is:
+// x, 2x, .., nCircles*x
+// Where x is found from  x = N * C(C+1)/2
+//  The # points in the LAST circle is adjusted downwards so that the 
total sum is equal
+//  to the nTotalPoints
+
+val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) 
/ 2.0

[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-09 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24393254
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+import scopt.OptionParser
--- End diff --

I manually fixed the order. Maybe I did optimize imports again: apparently 
the default behavior in IJ is completely different than the spark requirements. 
 Will try that plugin.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-09 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24393498
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+import scopt.OptionParser
--- End diff --

That plugin is apparently for IJ13 and says should not be used. So that is 
a bit confusing.. anyways I did some more manual rearranging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-09 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24393834
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+import scopt.OptionParser
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Total number of sampled points. There are proportionally more 
points within the
+ *  outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 
--numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  case class Params(
+  input: String = null,
+  k: Int = 3,
+  numPoints: Int = 30,
+  numIterations: Int = 10,
+  outerRadius: Double = 3.0
+  ) extends AbstractParams[Params]
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+.text(snumber of circles (/clusters), default: 
${defaultParams.k})
+.action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+.text(snumber of points, default: ${defaultParams.numPoints})
+.action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+.text(snumber of iterations, default: 
${defaultParams.numIterations})
+.action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def generateCircle(n: Int, r: Double): Array[(Double, Double)] = {
+val pi2 = 2 * math.Pi
+(0.0 until pi2 by pi2 / n).map { x =
+  (r * math.cos(x), r * math.sin(x))
+}.toArray
+  }
+
+  def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, 
nTotalPoints: Int = 30,
--- End diff --

This is a problem: it is not reasonable to manually move these 
indentations.  I am doing it this time, but have posted an SOF question on this.


http://stackoverflow.com/questions/28426355/intellij-code-style-setting-for-wrapping-on-multi-line-function-arguments



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-09 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24393018
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+import scopt.OptionParser
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Total number of sampled points. There are proportionally more 
points within the
+ *  outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 
--numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  case class Params(
+  input: String = null,
+  k: Int = 3,
+  numPoints: Int = 30,
+  numIterations: Int = 10,
+  outerRadius: Double = 3.0
+  ) extends AbstractParams[Params]
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+.text(snumber of circles (/clusters), default: 
${defaultParams.k})
+.action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+.text(snumber of points, default: ${defaultParams.numPoints})
+.action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+.text(snumber of iterations, default: 
${defaultParams.numIterations})
+.action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def generateCircle(n: Int, r: Double): Array[(Double, Double)] = {
+val pi2 = 2 * math.Pi
+(0.0 until pi2 by pi2 / n).map { x =
+  (r * math.cos(x), r * math.sin(x))
+}.toArray
+  }
+
+  def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, 
nTotalPoints: Int = 30,
+ outerRadius: Double):
+  RDD[(Long, Long, Double)] = {
+// The circles are generated as follows:
+// The Radii are equal to the largestRadius/(C - circleIndex)
+//   where  C=Number of circles
+//  and the circleIndex is 0 for the innermost and (nCircles-1) 
for the outermost circle
+// The number of points in each circle (and thus in each final 
cluster) is:
+// x, 2x, .., nCircles*x
+// Where x is found from  x = N * C(C+1)/2
+//  The # points in the LAST circle is adjusted downwards so that the 
total sum is equal
+//  to the nTotalPoints
+
+val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) 
/ 2.0

[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-09 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24393918
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+import scopt.OptionParser
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Total number of sampled points. There are proportionally more 
points within the
+ *  outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 
--numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  case class Params(
+  input: String = null,
+  k: Int = 3,
+  numPoints: Int = 30,
+  numIterations: Int = 10,
+  outerRadius: Double = 3.0
+  ) extends AbstractParams[Params]
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+.text(snumber of circles (/clusters), default: 
${defaultParams.k})
+.action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+.text(snumber of points, default: ${defaultParams.numPoints})
+.action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+.text(snumber of iterations, default: 
${defaultParams.numIterations})
+.action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def generateCircle(n: Int, r: Double): Array[(Double, Double)] = {
+val pi2 = 2 * math.Pi
+(0.0 until pi2 by pi2 / n).map { x =
+  (r * math.cos(x), r * math.sin(x))
+}.toArray
+  }
+
+  def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, 
nTotalPoints: Int = 30,
+ outerRadius: Double):
+  RDD[(Long, Long, Double)] = {
+// The circles are generated as follows:
+// The Radii are equal to the largestRadius/(C - circleIndex)
+//   where  C=Number of circles
+//  and the circleIndex is 0 for the innermost and (nCircles-1) 
for the outermost circle
+// The number of points in each circle (and thus in each final 
cluster) is:
+// x, 2x, .., nCircles*x
+// Where x is found from  x = N * C(C+1)/2
+//  The # points in the LAST circle is adjusted downwards so that the 
total sum is equal
+//  to the nTotalPoints
+
+val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) 
/ 2.0

[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-09 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24394635
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+import scopt.OptionParser
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Total number of sampled points. There are proportionally more 
points within the
+ *  outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 
--numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  case class Params(
+  input: String = null,
+  k: Int = 3,
+  numPoints: Int = 30,
+  numIterations: Int = 10,
+  outerRadius: Double = 3.0
+  ) extends AbstractParams[Params]
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+.text(snumber of circles (/clusters), default: 
${defaultParams.k})
+.action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+.text(snumber of points, default: ${defaultParams.numPoints})
+.action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+.text(snumber of iterations, default: 
${defaultParams.numIterations})
+.action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def generateCircle(n: Int, r: Double): Array[(Double, Double)] = {
+val pi2 = 2 * math.Pi
+(0.0 until pi2 by pi2 / n).map { x =
+  (r * math.cos(x), r * math.sin(x))
+}.toArray
+  }
+
+  def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, 
nTotalPoints: Int = 30,
+ outerRadius: Double):
+  RDD[(Long, Long, Double)] = {
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-09 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24393289
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+import scopt.OptionParser
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample [options]
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-09 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4495#discussion_r24393317
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala
 ---
@@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import org.apache.log4j.{Level, Logger}
+import org.apache.spark.mllib.clustering.PowerIterationClustering
+import org.apache.spark.rdd.RDD
+import org.apache.spark.{SparkConf, SparkContext}
+import scopt.OptionParser
+
+/**
+ * An example Power Iteration Clustering app.  Takes an input of K 
concentric circles
+ * with a total of n sampled points (total here means across ALL of the 
circles).
+ * The output should be K clusters - each cluster containing precisely the 
points associated
+ * with each of the input circles.
+ *
+ * Run with
+ * {{{
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample [options]
+ *
+ * Where options include:
+ *   k:  Number of circles/ clusters
+ *   n:  Total number of sampled points. There are proportionally more 
points within the
+ *  outer/larger circles
+ *   numIterations:   Number of Power Iterations
+ *   outerRadius:  radius of the outermost of the concentric circles
+ * }}}
+ *
+ * Here is a sample run and output:
+ *
+ * ./bin/run-example 
org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 
--numIterations 15
+ *
+ * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14],
+ *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]
+ *
+ *
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object PowerIterationClusteringExample {
+
+  case class Params(
+  input: String = null,
+  k: Int = 3,
+  numPoints: Int = 30,
+  numIterations: Int = 10,
+  outerRadius: Double = 3.0
+  ) extends AbstractParams[Params]
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params](PIC Circles) {
+  head(PowerIterationClusteringExample: an example PIC app using 
concentric circles.)
+  opt[Int]('k', k)
+.text(snumber of circles (/clusters), default: 
${defaultParams.k})
+.action((x, c) = c.copy(k = x))
+  opt[Int]('n', n)
+.text(snumber of points, default: ${defaultParams.numPoints})
+.action((x, c) = c.copy(numPoints = x))
+  opt[Int](numIterations)
+.text(snumber of iterations, default: 
${defaultParams.numIterations})
+.action((x, c) = c.copy(numIterations = x))
+}
+
+parser.parse(args, defaultParams).map { params =
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  def generateCircle(n: Int, r: Double): Array[(Double, Double)] = {
--- End diff --

not anymore. removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...

2015-02-09 Thread javadba
GitHub user javadba opened a pull request:

https://github.com/apache/spark/pull/4495

[SPARK-5503][MLLIB] Example code for Power Iteration Clustering



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Huawei-Spark/spark picexamples

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4495.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4495


commit 5864d4ada20f7b087e56741d0f39a36844f3a073
Author: sboeschhuawei stephen.boe...@huawei.com
Date:   2015-02-03T21:11:08Z

placeholder for pic examples

commit c50913012654372da9b0b6a4d1b7e4e410f9afda
Author: sboeschhuawei stephen.boe...@huawei.com
Date:   2015-02-03T21:14:36Z

placeholder for pic examples

commit 03e8de460da07a6451bd20493dc4653c20cbe952
Author: sboeschhuawei stephen.boe...@huawei.com
Date:   2015-02-10T00:57:10Z

Added PICExample

commit efeec458cb08b39bee4f026a2e5233f3bfc2d0c2
Author: sboeschhuawei stephen.boe...@huawei.com
Date:   2015-02-10T06:24:48Z

Update to PICExample from Xiangrui's comments




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23803786
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PowerIterationClustering {
+
+  private val logger = Logger.getLogger(getClass.getName())
+
+  type LabeledPoint = (VertexId, BDV[Double])
+  type Points = Seq[LabeledPoint]
+  type DGraph = Graph[Double, Double]
+  type IndexedVector[Double] = (Long, BDV[Double])
+
+  // Terminate iteration when norm changes by less than this value
+  val defaultMinNormChange: Double = 1e-11
+
+  // Default number of iterations for PIC loop
+  val defaultIterations: Int = 20
+
+  // Do not allow divide by zero: change to this value instead
+  val defaultDivideByZeroVal: Double = 1e-15
+
+  // Default number of runs by the KMeans.run() method
+  val defaultKMeansRuns = 10
--- End diff --

OK. This is actually a mistake/bug: the default was intended to be 1. Will 
set to 1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23808251
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringSuite.scala
 ---
@@ -0,0 +1,317 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx.{EdgeRDD, Edge, Graph}
+import 
org.apache.spark.mllib.clustering.PowerIterationClustering.{LabeledPoint, 
Points, IndexedVector}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.rdd.RDD
+import org.scalatest.FunSuite
+
+import scala.util.Random
+
+class PowerIterationClusteringSuite extends FunSuite with 
MLlibTestSparkContext {
+
+  val logger = Logger.getLogger(getClass.getName)
+
+  import org.apache.spark.mllib.clustering.PowerIterationClusteringSuite._
+
+  test(concentricCirclesTest) {
+concentricCirclesTest()
+  }
+
+  def concentricCirclesTest() = {
+val sigma = 1.0
+val nIterations = 10
+
+val circleSpecs = Seq(
+  // Best results for 30 points
+  CircleSpec(Point(0.0, 0.0), 0.03, 0.1, 3),
+  CircleSpec(Point(0.0, 0.0), 0.3, 0.03, 12),
+  CircleSpec(Point(0.0, 0.0), 1.0, 0.01, 15)
+  // Add following to get 100 points
+  , CircleSpec(Point(0.0, 0.0), 1.5, 0.005, 30),
+  CircleSpec(Point(0.0, 0.0), 2.0, 0.002, 40)
+)
+
+val nClusters = circleSpecs.size
+val cdata = createConcentricCirclesData(circleSpecs)
+val vertices = new Random().shuffle(cdata.map { p =
+  (p.label, new BDV(Array(p.x, p.y)))
+})
+
+val nVertices = vertices.length
+val G = createGaussianAffinityMatrix(sc, vertices)
+val (ccenters, estCollected) = PIC.run(sc, G, nClusters, nIterations)
+logger.info(sCluster centers: ${ccenters.mkString(,)}  +
+  s\nEstimates: ${estCollected.mkString([, ,, ])})
+assert(ccenters.size == circleSpecs.length, Did not get correct 
number of centers)
+
+  }
+
+}
+
+object PowerIterationClusteringSuite {
+  val logger = Logger.getLogger(getClass.getName)
+  val A = Array
+  val PIC = PowerIterationClustering
+
+  // Default sigma for Gaussian Distance calculations
+  val defaultSigma = 1.0
+
+  // Default minimum affinity between points - lower than this it is 
considered
+  // zero and no edge will be created
+  val defaultMinAffinity = 1e-11
+
+  def pdoub(d: Double) = f$d%1.6f
+
+  case class Point(label: Long, x: Double, y: Double) {
+def this(x: Double, y: Double) = this(-1L, x, y)
+
+override def toString() = s($label, (${pdoub(x)},${pdoub(y)}))
+  }
+
+  object Point {
+def apply(x: Double, y: Double) = new Point(-1L, x, y)
+  }
+
+  case class CircleSpec(center: Point, radius: Double, noiseToRadiusRatio: 
Double,
+nPoints: Int, uniformDistOnCircle: Boolean = true)
+
+  def createConcentricCirclesData(circleSpecs: Seq[CircleSpec]) = {
+import org.apache.spark.mllib.random.StandardNormalGenerator
+val normalGen = new StandardNormalGenerator
+var idStart = 0
+val circles = for (csp - circleSpecs) yield {
+  idStart += 1000
+  val circlePoints = for (thetax - 0 until csp.nPoints) yield {
+val theta = thetax * 2 * Math.PI / csp.nPoints
+val (x, y) = (csp.radius * Math.cos(theta)
+  * (1 + normalGen.nextValue * csp.noiseToRadiusRatio),
+  csp.radius * Math.sin(theta) * (1 + normalGen.nextValue * 
csp.noiseToRadiusRatio))
+(Point(idStart + thetax, x, y))
+  }
+  circlePoints
+}
+val points = circles.flatten.sortBy(_.label)
+logger.info

[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23807937
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PowerIterationClustering {
+
+  private val logger = Logger.getLogger(getClass.getName())
+
+  type LabeledPoint = (VertexId, BDV[Double])
+  type Points = Seq[LabeledPoint]
+  type DGraph = Graph[Double, Double]
+  type IndexedVector[Double] = (Long, BDV[Double])
+
+  // Terminate iteration when norm changes by less than this value
+  val defaultMinNormChange: Double = 1e-11
+
+  // Default number of iterations for PIC loop
+  val defaultIterations: Int = 20
+
+  // Do not allow divide by zero: change to this value instead
+  val defaultDivideByZeroVal: Double = 1e-15
+
+  // Default number of runs by the KMeans.run() method
+  val defaultKMeansRuns = 10
+
+  /**
+   *
+   * Run a Power Iteration Clustering
+   *
+   * @param sc  Spark Context
+   * @param G   Affinity Matrix in a Sparse Graph structure
+   * @param nClusters  Number of clusters to create
+   * @param nIterations Number of iterations of the PIC algorithm
+   *that calculates primary PseudoEigenvector and 
Eigenvalue
+   * @param nRuns  Number of runs for the KMeans clustering
+   * @return Tuple of (Seq[(Cluster Id,Cluster Center)],
+   * Seq[(VertexId, ClusterID Membership)]
+   */
+  def run(sc: SparkContext,
+  G: Graph[Double, Double],
+  nClusters: Int,
+  nIterations: Int = defaultIterations,
+  nRuns: Int = defaultKMeansRuns)
+  : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = {
+val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations)
+// TODO: avoid local collect and then sc.parallelize.
+val localVt = vt.collect.sortBy(_._1)
+val vectRdd = sc.parallelize(localVt.map(v = (v._1, 
Vectors.dense(v._2
+vectRdd.cache()
+val model = KMeans.train(vectRdd.map {
+  _._2
+}, nClusters, nRuns)
+vectRdd.unpersist()
+if (logger.isDebugEnabled) {
+  logger.debug(sEigenvalue = $lambda EigenVector: 
${localVt.mkString(,)})
+}
+val estimates = vectRdd.zip(model.predict(vectRdd.map(_._2)))
+if (logger.isDebugEnabled) {
+  logger.debug(slambda=$lambda  eigen=${localVt.mkString(,)})
+}
+val ccs = (0 until 
model.clusterCenters.length).zip(model.clusterCenters)
+if (logger.isDebugEnabled) {
+  logger.debug(sKmeans model cluster centers: ${ccs.mkString(,)})
+}
+val estCollected = estimates.collect.sortBy(_._1._1)
+if (logger.isDebugEnabled) {
+  val clusters = estCollected.map(_._2)
+  val counts = estCollected.groupBy(_._2).mapValues {
+_.length
+  }
+  logger.debug(sCluster counts: Counts: ${counts.mkString(,)}
++ s\nCluster Estimates: ${estCollected.mkString

[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23807026
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PowerIterationClustering {
+
+  private val logger = Logger.getLogger(getClass.getName())
+
+  type LabeledPoint = (VertexId, BDV[Double])
+  type Points = Seq[LabeledPoint]
+  type DGraph = Graph[Double, Double]
+  type IndexedVector[Double] = (Long, BDV[Double])
+
+  // Terminate iteration when norm changes by less than this value
+  val defaultMinNormChange: Double = 1e-11
+
+  // Default number of iterations for PIC loop
+  val defaultIterations: Int = 20
+
+  // Do not allow divide by zero: change to this value instead
+  val defaultDivideByZeroVal: Double = 1e-15
+
+  // Default number of runs by the KMeans.run() method
+  val defaultKMeansRuns = 10
+
+  /**
+   *
+   * Run a Power Iteration Clustering
+   *
+   * @param sc  Spark Context
+   * @param G   Affinity Matrix in a Sparse Graph structure
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23801499
  
--- Diff: docs/mllib-clustering-pic.md ---
@@ -0,0 +1,30 @@
+---
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23802807
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23808004
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PowerIterationClustering {
+
+  private val logger = Logger.getLogger(getClass.getName())
+
+  type LabeledPoint = (VertexId, BDV[Double])
+  type Points = Seq[LabeledPoint]
+  type DGraph = Graph[Double, Double]
+  type IndexedVector[Double] = (Long, BDV[Double])
+
+  // Terminate iteration when norm changes by less than this value
+  val defaultMinNormChange: Double = 1e-11
+
+  // Default number of iterations for PIC loop
+  val defaultIterations: Int = 20
+
+  // Do not allow divide by zero: change to this value instead
+  val defaultDivideByZeroVal: Double = 1e-15
+
+  // Default number of runs by the KMeans.run() method
+  val defaultKMeansRuns = 10
+
+  /**
+   *
+   * Run a Power Iteration Clustering
+   *
+   * @param sc  Spark Context
+   * @param G   Affinity Matrix in a Sparse Graph structure
+   * @param nClusters  Number of clusters to create
+   * @param nIterations Number of iterations of the PIC algorithm
+   *that calculates primary PseudoEigenvector and 
Eigenvalue
+   * @param nRuns  Number of runs for the KMeans clustering
+   * @return Tuple of (Seq[(Cluster Id,Cluster Center)],
+   * Seq[(VertexId, ClusterID Membership)]
+   */
+  def run(sc: SparkContext,
+  G: Graph[Double, Double],
+  nClusters: Int,
+  nIterations: Int = defaultIterations,
+  nRuns: Int = defaultKMeansRuns)
+  : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = {
+val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations)
+// TODO: avoid local collect and then sc.parallelize.
+val localVt = vt.collect.sortBy(_._1)
+val vectRdd = sc.parallelize(localVt.map(v = (v._1, 
Vectors.dense(v._2
+vectRdd.cache()
+val model = KMeans.train(vectRdd.map {
+  _._2
+}, nClusters, nRuns)
+vectRdd.unpersist()
+if (logger.isDebugEnabled) {
+  logger.debug(sEigenvalue = $lambda EigenVector: 
${localVt.mkString(,)})
+}
+val estimates = vectRdd.zip(model.predict(vectRdd.map(_._2)))
+if (logger.isDebugEnabled) {
+  logger.debug(slambda=$lambda  eigen=${localVt.mkString(,)})
+}
+val ccs = (0 until 
model.clusterCenters.length).zip(model.clusterCenters)
+if (logger.isDebugEnabled) {
+  logger.debug(sKmeans model cluster centers: ${ccs.mkString(,)})
+}
+val estCollected = estimates.collect.sortBy(_._1._1)
+if (logger.isDebugEnabled) {
+  val clusters = estCollected.map(_._2)
+  val counts = estCollected.groupBy(_._2).mapValues {
+_.length
+  }
+  logger.debug(sCluster counts: Counts: ${counts.mkString(,)}
++ s\nCluster Estimates: ${estCollected.mkString

[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23807925
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PowerIterationClustering {
+
+  private val logger = Logger.getLogger(getClass.getName())
+
+  type LabeledPoint = (VertexId, BDV[Double])
+  type Points = Seq[LabeledPoint]
+  type DGraph = Graph[Double, Double]
+  type IndexedVector[Double] = (Long, BDV[Double])
+
+  // Terminate iteration when norm changes by less than this value
+  val defaultMinNormChange: Double = 1e-11
+
+  // Default number of iterations for PIC loop
+  val defaultIterations: Int = 20
+
+  // Do not allow divide by zero: change to this value instead
+  val defaultDivideByZeroVal: Double = 1e-15
+
+  // Default number of runs by the KMeans.run() method
+  val defaultKMeansRuns = 10
+
+  /**
+   *
+   * Run a Power Iteration Clustering
+   *
+   * @param sc  Spark Context
+   * @param G   Affinity Matrix in a Sparse Graph structure
+   * @param nClusters  Number of clusters to create
+   * @param nIterations Number of iterations of the PIC algorithm
+   *that calculates primary PseudoEigenvector and 
Eigenvalue
+   * @param nRuns  Number of runs for the KMeans clustering
+   * @return Tuple of (Seq[(Cluster Id,Cluster Center)],
+   * Seq[(VertexId, ClusterID Membership)]
+   */
+  def run(sc: SparkContext,
+  G: Graph[Double, Double],
+  nClusters: Int,
+  nIterations: Int = defaultIterations,
+  nRuns: Int = defaultKMeansRuns)
+  : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = {
+val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations)
+// TODO: avoid local collect and then sc.parallelize.
+val localVt = vt.collect.sortBy(_._1)
+val vectRdd = sc.parallelize(localVt.map(v = (v._1, 
Vectors.dense(v._2
+vectRdd.cache()
+val model = KMeans.train(vectRdd.map {
+  _._2
+}, nClusters, nRuns)
+vectRdd.unpersist()
+if (logger.isDebugEnabled) {
+  logger.debug(sEigenvalue = $lambda EigenVector: 
${localVt.mkString(,)})
+}
+val estimates = vectRdd.zip(model.predict(vectRdd.map(_._2)))
+if (logger.isDebugEnabled) {
+  logger.debug(slambda=$lambda  eigen=${localVt.mkString(,)})
+}
+val ccs = (0 until 
model.clusterCenters.length).zip(model.clusterCenters)
+if (logger.isDebugEnabled) {
+  logger.debug(sKmeans model cluster centers: ${ccs.mkString(,)})
+}
+val estCollected = estimates.collect.sortBy(_._1._1)
+if (logger.isDebugEnabled) {
+  val clusters = estCollected.map(_._2)
+  val counts = estCollected.groupBy(_._2).mapValues {
+_.length
+  }
+  logger.debug(sCluster counts: Counts: ${counts.mkString(,)}
++ s\nCluster Estimates: ${estCollected.mkString

[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23807487
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PowerIterationClustering {
+
+  private val logger = Logger.getLogger(getClass.getName())
+
+  type LabeledPoint = (VertexId, BDV[Double])
+  type Points = Seq[LabeledPoint]
+  type DGraph = Graph[Double, Double]
+  type IndexedVector[Double] = (Long, BDV[Double])
+
+  // Terminate iteration when norm changes by less than this value
+  val defaultMinNormChange: Double = 1e-11
+
+  // Default number of iterations for PIC loop
+  val defaultIterations: Int = 20
+
+  // Do not allow divide by zero: change to this value instead
+  val defaultDivideByZeroVal: Double = 1e-15
+
+  // Default number of runs by the KMeans.run() method
+  val defaultKMeansRuns = 10
+
+  /**
+   *
+   * Run a Power Iteration Clustering
+   *
+   * @param sc  Spark Context
+   * @param G   Affinity Matrix in a Sparse Graph structure
+   * @param nClusters  Number of clusters to create
+   * @param nIterations Number of iterations of the PIC algorithm
+   *that calculates primary PseudoEigenvector and 
Eigenvalue
+   * @param nRuns  Number of runs for the KMeans clustering
+   * @return Tuple of (Seq[(Cluster Id,Cluster Center)],
+   * Seq[(VertexId, ClusterID Membership)]
+   */
+  def run(sc: SparkContext,
+  G: Graph[Double, Double],
+  nClusters: Int,
+  nIterations: Int = defaultIterations,
+  nRuns: Int = defaultKMeansRuns)
+  : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = {
+val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations)
+// TODO: avoid local collect and then sc.parallelize.
+val localVt = vt.collect.sortBy(_._1)
+val vectRdd = sc.parallelize(localVt.map(v = (v._1, 
Vectors.dense(v._2
+vectRdd.cache()
--- End diff --

true, it was used sorted for other purpose, will remove.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23807503
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PowerIterationClustering {
+
+  private val logger = Logger.getLogger(getClass.getName())
+
+  type LabeledPoint = (VertexId, BDV[Double])
+  type Points = Seq[LabeledPoint]
+  type DGraph = Graph[Double, Double]
+  type IndexedVector[Double] = (Long, BDV[Double])
+
+  // Terminate iteration when norm changes by less than this value
+  val defaultMinNormChange: Double = 1e-11
+
+  // Default number of iterations for PIC loop
+  val defaultIterations: Int = 20
+
+  // Do not allow divide by zero: change to this value instead
+  val defaultDivideByZeroVal: Double = 1e-15
+
+  // Default number of runs by the KMeans.run() method
+  val defaultKMeansRuns = 10
+
+  /**
+   *
+   * Run a Power Iteration Clustering
+   *
+   * @param sc  Spark Context
+   * @param G   Affinity Matrix in a Sparse Graph structure
+   * @param nClusters  Number of clusters to create
+   * @param nIterations Number of iterations of the PIC algorithm
+   *that calculates primary PseudoEigenvector and 
Eigenvalue
+   * @param nRuns  Number of runs for the KMeans clustering
+   * @return Tuple of (Seq[(Cluster Id,Cluster Center)],
+   * Seq[(VertexId, ClusterID Membership)]
+   */
+  def run(sc: SparkContext,
+  G: Graph[Double, Double],
+  nClusters: Int,
+  nIterations: Int = defaultIterations,
+  nRuns: Int = defaultKMeansRuns)
+  : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = {
+val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations)
+// TODO: avoid local collect and then sc.parallelize.
+val localVt = vt.collect.sortBy(_._1)
+val vectRdd = sc.parallelize(localVt.map(v = (v._1, 
Vectors.dense(v._2
+vectRdd.cache()
+val model = KMeans.train(vectRdd.map {
+  _._2
+}, nClusters, nRuns)
+vectRdd.unpersist()
+if (logger.isDebugEnabled) {
+  logger.debug(sEigenvalue = $lambda EigenVector: 
${localVt.mkString(,)})
+}
+val estimates = vectRdd.zip(model.predict(vectRdd.map(_._2)))
+if (logger.isDebugEnabled) {
+  logger.debug(slambda=$lambda  eigen=${localVt.mkString(,)})
+}
+val ccs = (0 until 
model.clusterCenters.length).zip(model.clusterCenters)
+if (logger.isDebugEnabled) {
+  logger.debug(sKmeans model cluster centers: ${ccs.mkString(,)})
+}
+val estCollected = estimates.collect.sortBy(_._1._1)
+if (logger.isDebugEnabled) {
+  val clusters = estCollected.map(_._2)
+  val counts = estCollected.groupBy(_._2).mapValues {
+_.length
+  }
+  logger.debug(sCluster counts: Counts: ${counts.mkString(,)}
++ s\nCluster Estimates: ${estCollected.mkString

[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23807540
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PowerIterationClustering {
+
+  private val logger = Logger.getLogger(getClass.getName())
+
+  type LabeledPoint = (VertexId, BDV[Double])
+  type Points = Seq[LabeledPoint]
+  type DGraph = Graph[Double, Double]
+  type IndexedVector[Double] = (Long, BDV[Double])
+
+  // Terminate iteration when norm changes by less than this value
+  val defaultMinNormChange: Double = 1e-11
+
+  // Default number of iterations for PIC loop
+  val defaultIterations: Int = 20
+
+  // Do not allow divide by zero: change to this value instead
+  val defaultDivideByZeroVal: Double = 1e-15
+
+  // Default number of runs by the KMeans.run() method
+  val defaultKMeansRuns = 10
+
+  /**
+   *
+   * Run a Power Iteration Clustering
+   *
+   * @param sc  Spark Context
+   * @param G   Affinity Matrix in a Sparse Graph structure
+   * @param nClusters  Number of clusters to create
+   * @param nIterations Number of iterations of the PIC algorithm
+   *that calculates primary PseudoEigenvector and 
Eigenvalue
+   * @param nRuns  Number of runs for the KMeans clustering
+   * @return Tuple of (Seq[(Cluster Id,Cluster Center)],
+   * Seq[(VertexId, ClusterID Membership)]
+   */
+  def run(sc: SparkContext,
+  G: Graph[Double, Double],
+  nClusters: Int,
+  nIterations: Int = defaultIterations,
+  nRuns: Int = defaultKMeansRuns)
+  : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = {
+val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations)
+// TODO: avoid local collect and then sc.parallelize.
+val localVt = vt.collect.sortBy(_._1)
+val vectRdd = sc.parallelize(localVt.map(v = (v._1, 
Vectors.dense(v._2
+vectRdd.cache()
+val model = KMeans.train(vectRdd.map {
+  _._2
+}, nClusters, nRuns)
+vectRdd.unpersist()
+if (logger.isDebugEnabled) {
+  logger.debug(sEigenvalue = $lambda EigenVector: 
${localVt.mkString(,)})
+}
+val estimates = vectRdd.zip(model.predict(vectRdd.map(_._2)))
+if (logger.isDebugEnabled) {
+  logger.debug(slambda=$lambda  eigen=${localVt.mkString(,)})
+}
+val ccs = (0 until 
model.clusterCenters.length).zip(model.clusterCenters)
+if (logger.isDebugEnabled) {
+  logger.debug(sKmeans model cluster centers: ${ccs.mkString(,)})
+}
+val estCollected = estimates.collect.sortBy(_._1._1)
+if (logger.isDebugEnabled) {
+  val clusters = estCollected.map(_._2)
+  val counts = estCollected.groupBy(_._2).mapValues {
+_.length
+  }
+  logger.debug(sCluster counts: Counts: ${counts.mkString(,)}
++ s\nCluster Estimates: ${estCollected.mkString

[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23802566
  
--- Diff: docs/mllib-clustering-pic.md ---
@@ -0,0 +1,30 @@
+---
+layout: global
+title: Clustering - MLlib
+displayTitle: a href=mllib-guide.htmlMLlib/a - Power Iteration 
Clustering
+---
+
+* Table of contents
+{:toc}
+
+
+## Power Iteration Clustering
+
+Power iteration clustering is a scalable and efficient algorithm for 
clustering points given pointwise mutual affinity values.  Internally the 
algorithm:
+
+* computes the Gaussian distance between all pairs of points and 
represents these distances in an Affinity Matrix
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23802606
  
--- Diff: mllib/pom.xml ---
@@ -103,6 +108,13 @@
   typetest-jar/type
   scopetest/scope
 /dependency
+!--dependency
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23807449
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PowerIterationClustering {
+
+  private val logger = Logger.getLogger(getClass.getName())
+
+  type LabeledPoint = (VertexId, BDV[Double])
+  type Points = Seq[LabeledPoint]
+  type DGraph = Graph[Double, Double]
+  type IndexedVector[Double] = (Long, BDV[Double])
+
+  // Terminate iteration when norm changes by less than this value
+  val defaultMinNormChange: Double = 1e-11
+
+  // Default number of iterations for PIC loop
+  val defaultIterations: Int = 20
+
+  // Do not allow divide by zero: change to this value instead
+  val defaultDivideByZeroVal: Double = 1e-15
+
+  // Default number of runs by the KMeans.run() method
+  val defaultKMeansRuns = 10
+
+  /**
+   *
+   * Run a Power Iteration Clustering
+   *
+   * @param sc  Spark Context
+   * @param G   Affinity Matrix in a Sparse Graph structure
+   * @param nClusters  Number of clusters to create
+   * @param nIterations Number of iterations of the PIC algorithm
+   *that calculates primary PseudoEigenvector and 
Eigenvalue
+   * @param nRuns  Number of runs for the KMeans clustering
+   * @return Tuple of (Seq[(Cluster Id,Cluster Center)],
+   * Seq[(VertexId, ClusterID Membership)]
+   */
+  def run(sc: SparkContext,
--- End diff --

yes, got it: use the rdd.sparkContext.   RE: changing input type from Graph 
to RDD[(Long,Long,Double)]: OK - that does give more flexibility if we choose 
to have different backends (e.g. not graphx) for this implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23802691
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
--- End diff --

Done. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23807660
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PowerIterationClustering {
+
+  private val logger = Logger.getLogger(getClass.getName())
+
+  type LabeledPoint = (VertexId, BDV[Double])
+  type Points = Seq[LabeledPoint]
+  type DGraph = Graph[Double, Double]
+  type IndexedVector[Double] = (Long, BDV[Double])
+
+  // Terminate iteration when norm changes by less than this value
+  val defaultMinNormChange: Double = 1e-11
+
+  // Default number of iterations for PIC loop
+  val defaultIterations: Int = 20
+
+  // Do not allow divide by zero: change to this value instead
+  val defaultDivideByZeroVal: Double = 1e-15
+
+  // Default number of runs by the KMeans.run() method
+  val defaultKMeansRuns = 10
+
+  /**
+   *
+   * Run a Power Iteration Clustering
+   *
+   * @param sc  Spark Context
+   * @param G   Affinity Matrix in a Sparse Graph structure
+   * @param nClusters  Number of clusters to create
+   * @param nIterations Number of iterations of the PIC algorithm
+   *that calculates primary PseudoEigenvector and 
Eigenvalue
+   * @param nRuns  Number of runs for the KMeans clustering
+   * @return Tuple of (Seq[(Cluster Id,Cluster Center)],
+   * Seq[(VertexId, ClusterID Membership)]
+   */
+  def run(sc: SparkContext,
+  G: Graph[Double, Double],
+  nClusters: Int,
+  nIterations: Int = defaultIterations,
+  nRuns: Int = defaultKMeansRuns)
+  : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = {
+val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations)
+// TODO: avoid local collect and then sc.parallelize.
+val localVt = vt.collect.sortBy(_._1)
+val vectRdd = sc.parallelize(localVt.map(v = (v._1, 
Vectors.dense(v._2
+vectRdd.cache()
+val model = KMeans.train(vectRdd.map {
+  _._2
+}, nClusters, nRuns)
+vectRdd.unpersist()
+if (logger.isDebugEnabled) {
+  logger.debug(sEigenvalue = $lambda EigenVector: 
${localVt.mkString(,)})
+}
+val estimates = vectRdd.zip(model.predict(vectRdd.map(_._2)))
+if (logger.isDebugEnabled) {
+  logger.debug(slambda=$lambda  eigen=${localVt.mkString(,)})
+}
+val ccs = (0 until 
model.clusterCenters.length).zip(model.clusterCenters)
+if (logger.isDebugEnabled) {
+  logger.debug(sKmeans model cluster centers: ${ccs.mkString(,)})
+}
+val estCollected = estimates.collect.sortBy(_._1._1)
+if (logger.isDebugEnabled) {
+  val clusters = estCollected.map(_._2)
+  val counts = estCollected.groupBy(_._2).mapValues {
+_.length
+  }
+  logger.debug(sCluster counts: Counts: ${counts.mkString(,)}
++ s\nCluster Estimates: ${estCollected.mkString

[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23802882
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PowerIterationClustering {
--- End diff --

OK working on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23802897
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PowerIterationClustering {
+
+  private val logger = Logger.getLogger(getClass.getName())
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23818389
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PowerIterationClustering {
+
+  private val logger = Logger.getLogger(getClass.getName())
+
+  type LabeledPoint = (VertexId, BDV[Double])
+  type Points = Seq[LabeledPoint]
+  type DGraph = Graph[Double, Double]
+  type IndexedVector[Double] = (Long, BDV[Double])
+
+  // Terminate iteration when norm changes by less than this value
+  val defaultMinNormChange: Double = 1e-11
+
+  // Default number of iterations for PIC loop
+  val defaultIterations: Int = 20
+
+  // Do not allow divide by zero: change to this value instead
+  val defaultDivideByZeroVal: Double = 1e-15
+
+  // Default number of runs by the KMeans.run() method
+  val defaultKMeansRuns = 10
+
+  /**
+   *
+   * Run a Power Iteration Clustering
+   *
+   * @param sc  Spark Context
+   * @param G   Affinity Matrix in a Sparse Graph structure
+   * @param nClusters  Number of clusters to create
+   * @param nIterations Number of iterations of the PIC algorithm
+   *that calculates primary PseudoEigenvector and 
Eigenvalue
+   * @param nRuns  Number of runs for the KMeans clustering
+   * @return Tuple of (Seq[(Cluster Id,Cluster Center)],
+   * Seq[(VertexId, ClusterID Membership)]
+   */
+  def run(sc: SparkContext,
+  G: Graph[Double, Double],
+  nClusters: Int,
+  nIterations: Int = defaultIterations,
+  nRuns: Int = defaultKMeansRuns)
+  : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = {
+val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations)
+// TODO: avoid local collect and then sc.parallelize.
+val localVt = vt.collect.sortBy(_._1)
+val vectRdd = sc.parallelize(localVt.map(v = (v._1, 
Vectors.dense(v._2
+vectRdd.cache()
+val model = KMeans.train(vectRdd.map {
+  _._2
+}, nClusters, nRuns)
+vectRdd.unpersist()
+if (logger.isDebugEnabled) {
+  logger.debug(sEigenvalue = $lambda EigenVector: 
${localVt.mkString(,)})
+}
+val estimates = vectRdd.zip(model.predict(vectRdd.map(_._2)))
+if (logger.isDebugEnabled) {
+  logger.debug(slambda=$lambda  eigen=${localVt.mkString(,)})
+}
+val ccs = (0 until 
model.clusterCenters.length).zip(model.clusterCenters)
+if (logger.isDebugEnabled) {
+  logger.debug(sKmeans model cluster centers: ${ccs.mkString(,)})
+}
+val estCollected = estimates.collect.sortBy(_._1._1)
+if (logger.isDebugEnabled) {
+  val clusters = estCollected.map(_._2)
+  val counts = estCollected.groupBy(_._2).mapValues {
+_.length
+  }
+  logger.debug(sCluster counts: Counts: ${counts.mkString(,)}
++ s\nCluster Estimates: ${estCollected.mkString

[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23819495
  
--- Diff: data/mllib/pic_data.txt ---
@@ -0,0 +1,299 @@
+1000   0.000.0001250.380.012684
0.0006380.0510910.0001510.0442080.004264
0.0006170.0077460.0365690.0018130.000305
0.0031710.0041140.0005300.0168000.003396
0.0175660.0347560.180.0510960.22
0.0017490.0002100.0060650.0069690.016719
0.0060280.0033780.0032000.0250720.000291
0.0001160.0016330.280.0113050.19
0.0103590.0065330.0475930.0274110.59
0.0175580.0005180.0009460.0442120.94
0.0054040.0267620.0099410.0038010.27
0.0001610.0009010.190.0005180.034732
0.590.0001260.0009700.0118140.005997
0.0002050.0018320.0087920.0363180.000149
0.0327810.0106920.0005300.0105570.016641
0.0081800.0016060.920.0074450.026718
0.0274570.0009570.0059010.0003140.000162
0.0008560.0047760.0081140.0036930.38
0.0249650.0442560.0071800.220.010297
0.0009940.0442550.0017250.0165410.003658
0.000288
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23811947
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PowerIterationClustering {
+
+  private val logger = Logger.getLogger(getClass.getName())
+
+  type LabeledPoint = (VertexId, BDV[Double])
+  type Points = Seq[LabeledPoint]
+  type DGraph = Graph[Double, Double]
+  type IndexedVector[Double] = (Long, BDV[Double])
+
+  // Terminate iteration when norm changes by less than this value
+  val defaultMinNormChange: Double = 1e-11
+
+  // Default number of iterations for PIC loop
+  val defaultIterations: Int = 20
+
+  // Do not allow divide by zero: change to this value instead
+  val defaultDivideByZeroVal: Double = 1e-15
+
+  // Default number of runs by the KMeans.run() method
+  val defaultKMeansRuns = 10
+
+  /**
+   *
+   * Run a Power Iteration Clustering
+   *
+   * @param sc  Spark Context
+   * @param G   Affinity Matrix in a Sparse Graph structure
+   * @param nClusters  Number of clusters to create
+   * @param nIterations Number of iterations of the PIC algorithm
+   *that calculates primary PseudoEigenvector and 
Eigenvalue
+   * @param nRuns  Number of runs for the KMeans clustering
+   * @return Tuple of (Seq[(Cluster Id,Cluster Center)],
+   * Seq[(VertexId, ClusterID Membership)]
+   */
+  def run(sc: SparkContext,
+  G: Graph[Double, Double],
+  nClusters: Int,
+  nIterations: Int = defaultIterations,
+  nRuns: Int = defaultKMeansRuns)
+  : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = {
+val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations)
+// TODO: avoid local collect and then sc.parallelize.
+val localVt = vt.collect.sortBy(_._1)
+val vectRdd = sc.parallelize(localVt.map(v = (v._1, 
Vectors.dense(v._2
+vectRdd.cache()
+val model = KMeans.train(vectRdd.map {
+  _._2
+}, nClusters, nRuns)
+vectRdd.unpersist()
+if (logger.isDebugEnabled) {
+  logger.debug(sEigenvalue = $lambda EigenVector: 
${localVt.mkString(,)})
+}
+val estimates = vectRdd.zip(model.predict(vectRdd.map(_._2)))
+if (logger.isDebugEnabled) {
+  logger.debug(slambda=$lambda  eigen=${localVt.mkString(,)})
+}
+val ccs = (0 until 
model.clusterCenters.length).zip(model.clusterCenters)
+if (logger.isDebugEnabled) {
+  logger.debug(sKmeans model cluster centers: ${ccs.mkString(,)})
+}
+val estCollected = estimates.collect.sortBy(_._1._1)
+if (logger.isDebugEnabled) {
+  val clusters = estCollected.map(_._2)
+  val counts = estCollected.groupBy(_._2).mapValues {
+_.length
+  }
+  logger.debug(sCluster counts: Counts: ${counts.mkString(,)}
++ s\nCluster Estimates: ${estCollected.mkString

[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-29 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23818325
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala
 ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PowerIterationClustering {
+
+  private val logger = Logger.getLogger(getClass.getName())
+
+  type LabeledPoint = (VertexId, BDV[Double])
+  type Points = Seq[LabeledPoint]
+  type DGraph = Graph[Double, Double]
+  type IndexedVector[Double] = (Long, BDV[Double])
+
+  // Terminate iteration when norm changes by less than this value
+  val defaultMinNormChange: Double = 1e-11
+
+  // Default number of iterations for PIC loop
+  val defaultIterations: Int = 20
+
+  // Do not allow divide by zero: change to this value instead
+  val defaultDivideByZeroVal: Double = 1e-15
+
+  // Default number of runs by the KMeans.run() method
+  val defaultKMeansRuns = 10
+
+  /**
+   *
+   * Run a Power Iteration Clustering
+   *
+   * @param sc  Spark Context
+   * @param G   Affinity Matrix in a Sparse Graph structure
+   * @param nClusters  Number of clusters to create
+   * @param nIterations Number of iterations of the PIC algorithm
+   *that calculates primary PseudoEigenvector and 
Eigenvalue
+   * @param nRuns  Number of runs for the KMeans clustering
+   * @return Tuple of (Seq[(Cluster Id,Cluster Center)],
+   * Seq[(VertexId, ClusterID Membership)]
--- End diff --

OK I did not consider THAT large of scale - where even a single row or 
column does not fit into single Worker memory.  I am looking into it now to see 
if there were any obvious hindrances. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-28 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/4254#issuecomment-71947549
  
I have moved out the Gaussian / Affinity matrix calculations.  It is not 
clear where their new home / if they have a new home. Presently the testcases 
rely upon them - so they are now in the testing suite.  The core 
PowerIterationClustering class is still working, now with a Graph input.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-28 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23728755
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PIClustering {
+
+  private val logger = Logger.getLogger(getClass.getName())
+
+  type LabeledPoint = (VertexId, BDV[Double])
+  type Points = Seq[LabeledPoint]
+  type DGraph = Graph[Double, Double]
+  type IndexedVector[Double] = (Long, BDV[Double])
+
+  // Terminate iteration when norm changes by less than this value
+  private[mllib] val DefaultMinNormChange: Double = 1e-11
+
+  // Default σ for Gaussian Distance calculations
+  private[mllib] val DefaultSigma = 1.0
+
+  // Default number of iterations for PIC loop
+  private[mllib] val DefaultIterations: Int = 20
+
+  // Default minimum affinity between points - lower than this it is 
considered
+  // zero and no edge will be created
+  private[mllib] val DefaultMinAffinity = 1e-11
+
+  // Do not allow divide by zero: change to this value instead
+  val DefaultDivideByZeroVal: Double = 1e-15
+
+  // Default number of runs by the KMeans.run() method
+  val DefaultKMeansRuns = 10
+
+  /**
+   *
+   * Run a Power Iteration Clustering
+   *
+   * @param sc  Spark Context
+   * @param points  Input Points in format of [(VertexId,(x,y)]
+   *where VertexId is a Long
+   * @param nClusters  Number of clusters to create
+   * @param nIterations Number of iterations of the PIC algorithm
+   *that calculates primary PseudoEigenvector and 
Eigenvalue
+   * @param sigma   Sigma for Gaussian distribution calculation according 
to
+   *[1/2 *sqrt(pi*sigma)] exp (- (x-y)**2 / 2sigma**2
+   * @param minAffinity  Minimum Affinity between two Points in the input 
dataset: below
+   * this threshold the affinity will be considered 
close to zero and
+   * no Edge will be created between those Points in 
the sparse matrix
+   * @param nRuns  Number of runs for the KMeans clustering
+   * @return Tuple of (Seq[(Cluster Id,Cluster Center)],
+   * Seq[(VertexId, ClusterID Membership)]
+   */
+  def run(sc: SparkContext,
+  points: Points,
+  nClusters: Int,
+  nIterations: Int = DefaultIterations,
+  sigma: Double = DefaultSigma,
+  minAffinity: Double = DefaultMinAffinity,
+  nRuns: Int = DefaultKMeansRuns)
+  : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = {
+val vidsRdd = sc.parallelize(points.map(_._1).sorted)
+val nVertices = points.length
+
+val (wRdd, rowSums) = createNormalizedAffinityMatrix(sc, points, sigma)
+val initialVt = createInitialVector(sc, points.map(_._1), rowSums)
+if (logger.isDebugEnabled) {
+  logger.debug(sVt(0)=${
+printVector(new BDV(initialVt.map {
+  _._2
+}.toArray

[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-28 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23728797
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PIClustering {
+
+  private val logger = Logger.getLogger(getClass.getName())
+
+  type LabeledPoint = (VertexId, BDV[Double])
+  type Points = Seq[LabeledPoint]
+  type DGraph = Graph[Double, Double]
+  type IndexedVector[Double] = (Long, BDV[Double])
+
+  // Terminate iteration when norm changes by less than this value
+  private[mllib] val DefaultMinNormChange: Double = 1e-11
--- End diff --

Typical scala style is to use CamelCase for constants. Would you pls 
explain briefly?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-28 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/4254#issuecomment-71929175
  

RE:
   Is it possible to do Gaussian similarity in another PR? It should be 
part of the feature transformation but not within PIC. It would be easier for 
code review if the PR is minimal.

This is a big change late in the game.  We could do the following more 
easily:  change the signature to PIC to allow the input to be the Affinity 
Matrix (user defined).  Then we retain the code for the Gaussian similarity 
calculation to be sent in . Our tests  and documentation and verification all 
depend on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-28 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23728909
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/PIClusteringSuite.scala 
---
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.graphx._
+import org.apache.spark.{SparkConf, SparkContext}
+import org.scalatest.FunSuite
+
+import scala.util.Random
+
+class PIClusteringSuite extends FunSuite with LocalSparkContext {
+
+  val logger = Logger.getLogger(getClass.getName)
+
+  import org.apache.spark.mllib.clustering.PIClusteringSuite._
+
+  val PIC = PIClustering
+  val A = Array
+
+  test(concentricCirclesTest) {
+concentricCirclesTest()
+  }
+
+
+  def concentricCirclesTest() = {
+val sigma = 1.0
+val nIterations = 10
+
+val circleSpecs = Seq(
+  // Best results for 30 points
+  CircleSpec(Point(0.0, 0.0), 0.03, 0.1, 3),
+  CircleSpec(Point(0.0, 0.0), 0.3, 0.03, 12),
+  CircleSpec(Point(0.0, 0.0), 1.0, 0.01, 15)
+  // Add following to get 100 points
+  , CircleSpec(Point(0.0, 0.0), 1.5, 0.005, 30),
+  CircleSpec(Point(0.0, 0.0), 2.0, 0.002, 40)
+)
+
+val nClusters = circleSpecs.size
+val cdata = createConcentricCirclesData(circleSpecs)
+withSpark { sc =
+  val vertices = new Random().shuffle(cdata.map { p =
+(p.label, new BDV(Array(p.x, p.y)))
+  })
+
+  val nVertices = vertices.length
+  val (ccenters, estCollected) = PIC.run(sc, vertices, nClusters, 
nIterations)
+  logger.info(sCluster centers: ${ccenters.mkString(,)}  +
+s\nEstimates: ${estCollected.mkString([, ,, ])})
+  assert(ccenters.size == circleSpecs.length, Did not get correct 
number of centers)
+
+}
+  }
+
+}
+
+object PIClusteringSuite {
+  val logger = Logger.getLogger(getClass.getName)
+  val A = Array
+
+  def pdoub(d: Double) = f$d%1.6f
+
+  case class Point(label: Long, x: Double, y: Double) {
+def this(x: Double, y: Double) = this(-1L, x, y)
+
+override def toString() = s($label, (${pdoub(x)},${pdoub(y)}))
+  }
+
+  object Point {
+def apply(x: Double, y: Double) = new Point(-1L, x, y)
+  }
+
+  case class CircleSpec(center: Point, radius: Double, noiseToRadiusRatio: 
Double,
+nPoints: Int, uniformDistOnCircle: Boolean = true)
+
+  def createConcentricCirclesData(circleSpecs: Seq[CircleSpec]) = {
+import org.apache.spark.mllib.random.StandardNormalGenerator
+val normalGen = new StandardNormalGenerator
+var idStart = 0
+val circles = for (csp - circleSpecs) yield {
+  idStart += 1000
+  val circlePoints = for (thetax - 0 until csp.nPoints) yield {
+val theta = thetax * 2 * Math.PI / csp.nPoints
+val (x, y) = (csp.radius * Math.cos(theta)
+  * (1 + normalGen.nextValue * csp.noiseToRadiusRatio),
+  csp.radius * Math.sin(theta) * (1 + normalGen.nextValue * 
csp.noiseToRadiusRatio))
+(Point(idStart + thetax, x, y))
+  }
+  circlePoints
+}
+val points = circles.flatten.sortBy(_.label)
+logger.info(printPoints(points))
+points
+  }
+
+  def printPoints(points: Seq[Point]) = {
+points.mkString([,  , , ])
+  }
+
+  def main(args: Array[String]) {
+val pictest = new PIClusteringSuite
+pictest.concentricCirclesTest()
+  }
+}
+
+/**
+ * Provides a method to run tests against a {@link SparkContext} variable 
that is correctly stopped
+ * after each test.
+ * TODO: import this from the graphx test cases package i.e. may need 
update to pom.xml

[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-28 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/4254#issuecomment-71932032
  
Hi,
  We have suggestion here: to separate the creation/definition of the input 
graph from the PIC:

val G = PIC.createGaussianAffinityMatrix(sc, vertices)  // Create the 
affinity matrix in a Graph structure
val (ccenters, estCollected) = PIC.run(sc, G, nClusters, nIterations) 
// Run PIC

The new signatures are:

Take a Graph input instead of the input vertices:

  def run(sc: SparkContext,
  G: Graph[Double, Double],
  nClusters: Int,
  nIterations: Int = defaultIterations,
  sigma: Double = defaultSigma,
  minAffinity: Double = defaultMinAffinity,
  nRuns: Int = defaultKMeansRuns)

And here is the new method for creating the input Graph:

  def createGaussianAffinityMatrix(sc: SparkContext,
   points: Points,
   sigma: Double = defaultSigma,
   minAffinity: Double = defaultMinAffinity)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-28 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23728804
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/PIClusteringSuite.scala 
---
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.graphx._
+import org.apache.spark.{SparkConf, SparkContext}
+import org.scalatest.FunSuite
+
+import scala.util.Random
+
+class PIClusteringSuite extends FunSuite with LocalSparkContext {
+
+  val logger = Logger.getLogger(getClass.getName)
+
+  import org.apache.spark.mllib.clustering.PIClusteringSuite._
+
+  val PIC = PIClustering
+  val A = Array
+
+  test(concentricCirclesTest) {
+concentricCirclesTest()
+  }
+
+
+  def concentricCirclesTest() = {
+val sigma = 1.0
+val nIterations = 10
+
+val circleSpecs = Seq(
+  // Best results for 30 points
+  CircleSpec(Point(0.0, 0.0), 0.03, 0.1, 3),
+  CircleSpec(Point(0.0, 0.0), 0.3, 0.03, 12),
+  CircleSpec(Point(0.0, 0.0), 1.0, 0.01, 15)
+  // Add following to get 100 points
+  , CircleSpec(Point(0.0, 0.0), 1.5, 0.005, 30),
+  CircleSpec(Point(0.0, 0.0), 2.0, 0.002, 40)
+)
+
+val nClusters = circleSpecs.size
+val cdata = createConcentricCirclesData(circleSpecs)
+withSpark { sc =
+  val vertices = new Random().shuffle(cdata.map { p =
+(p.label, new BDV(Array(p.x, p.y)))
+  })
+
+  val nVertices = vertices.length
+  val (ccenters, estCollected) = PIC.run(sc, vertices, nClusters, 
nIterations)
+  logger.info(sCluster centers: ${ccenters.mkString(,)}  +
+s\nEstimates: ${estCollected.mkString([, ,, ])})
+  assert(ccenters.size == circleSpecs.length, Did not get correct 
number of centers)
+
+}
+  }
+
+}
+
+object PIClusteringSuite {
+  val logger = Logger.getLogger(getClass.getName)
+  val A = Array
+
+  def pdoub(d: Double) = f$d%1.6f
+
+  case class Point(label: Long, x: Double, y: Double) {
+def this(x: Double, y: Double) = this(-1L, x, y)
+
+override def toString() = s($label, (${pdoub(x)},${pdoub(y)}))
+  }
+
+  object Point {
+def apply(x: Double, y: Double) = new Point(-1L, x, y)
+  }
+
+  case class CircleSpec(center: Point, radius: Double, noiseToRadiusRatio: 
Double,
+nPoints: Int, uniformDistOnCircle: Boolean = true)
+
+  def createConcentricCirclesData(circleSpecs: Seq[CircleSpec]) = {
+import org.apache.spark.mllib.random.StandardNormalGenerator
+val normalGen = new StandardNormalGenerator
+var idStart = 0
+val circles = for (csp - circleSpecs) yield {
+  idStart += 1000
+  val circlePoints = for (thetax - 0 until csp.nPoints) yield {
+val theta = thetax * 2 * Math.PI / csp.nPoints
+val (x, y) = (csp.radius * Math.cos(theta)
+  * (1 + normalGen.nextValue * csp.noiseToRadiusRatio),
+  csp.radius * Math.sin(theta) * (1 + normalGen.nextValue * 
csp.noiseToRadiusRatio))
+(Point(idStart + thetax, x, y))
+  }
+  circlePoints
+}
+val points = circles.flatten.sortBy(_.label)
+logger.info(printPoints(points))
+points
+  }
+
+  def printPoints(points: Seq[Point]) = {
+points.mkString([,  , , ])
+  }
+
+  def main(args: Array[String]) {
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA

[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-28 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23728761
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV}
+import org.apache.log4j.Logger
+import org.apache.spark.SparkContext
+import org.apache.spark.graphx._
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+import scala.language.existentials
+
+/**
+ * Implements the scalable graph clustering algorithm Power Iteration 
Clustering (see
+ * www.icml2010.org/papers/387.pdf).  From the abstract:
+ *
+ * The input data is first transformed to a normalized Affinity Matrix via 
Gaussian pairwise
+ * distance calculations. Power iteration is then used to find a 
dimensionality-reduced
+ * representation.  The resulting pseudo-eigenvector provides effective 
clustering - as
+ * performed by Parallel KMeans.
+ */
+object PIClustering {
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...

2015-01-28 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/4254#discussion_r23728918
  
--- Diff: mllib/src/test/resources/log4j.mllib.properties ---
@@ -0,0 +1,41 @@
+#
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...

2015-01-15 Thread javadba
Github user javadba closed the pull request at:

https://github.com/apache/spark/pull/1586


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...

2014-12-17 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-67383303
  
OK Michael thanks for the update.

2014-12-17 11:21 GMT-08:00 Michael Armbrust notificati...@github.com:

 Hey @javadba https://github.com/javadba, thanks for all the work on
 this, especially the time figuring out the surprisingly complicated
 semantics for this function. Also, sorry for the delay with 
review/merging!
 I'd love to add this, but right now I'm concerned that the way we are
 adding UDFs is unsustainable. I've written up some thoughts on the right
 way to proceed in SPARK-4867
 https://issues.apache.org/jira/browse/SPARK-4867.

 Since I'm trying really hard to keep the PR queue small (mostly to help
 avoid PRs that languish like this one has been), I propose we close this
 issue for now and reopen once the UDF framework has been updated. I've
 linked your issue to that one so you or others can use this code as a
 starting point.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/spark/pull/1586#issuecomment-67377013.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-4060] [MLlib] exposing special rdd func...

2014-10-28 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/2907#issuecomment-60875650
  
RE: use case.  We are considering to  use the treeAggregate function within 
the implementation of SpectralClustering. In addition it was noted that the 
EigenvalueDecomposition.symmetricEigs is private: it is likely we would like to 
use that too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2014-08-14 Thread javadba
GitHub user javadba opened a pull request:

https://github.com/apache/spark/pull/1954

Merge pull request #1 from apache/master

Syncing to upstream

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/javadba/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1954.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1954


commit 1b5213183b8a9d5160bbcea1c420a04c0b6f5dfa
Author: StephenBoesch java...@gmail.com
Date:   2014-08-13T01:00:25Z

Merge pull request #1 from apache/master

Syncing to upstream




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2014-08-14 Thread javadba
Github user javadba closed the pull request at:

https://github.com/apache/spark/pull/1954


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2014-08-14 Thread javadba
GitHub user javadba opened a pull request:

https://github.com/apache/spark/pull/1955

Merge pull request #1 from apache/master

Syncing to upstream

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/javadba/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1955.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1955


commit 1b5213183b8a9d5160bbcea1c420a04c0b6f5dfa
Author: StephenBoesch java...@gmail.com
Date:   2014-08-13T01:00:25Z

Merge pull request #1 from apache/master

Syncing to upstream




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2014-08-14 Thread javadba
Github user javadba closed the pull request at:

https://github.com/apache/spark/pull/1955


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2014-08-14 Thread javadba
GitHub user javadba reopened a pull request:

https://github.com/apache/spark/pull/1955

Merge pull request #1 from apache/master

Syncing to upstream

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/javadba/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1955.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1955


commit 1b5213183b8a9d5160bbcea1c420a04c0b6f5dfa
Author: StephenBoesch java...@gmail.com
Date:   2014-08-13T01:00:25Z

Merge pull request #1 from apache/master

Syncing to upstream




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Merge pull request #1 from apache/master

2014-08-14 Thread javadba
Github user javadba closed the pull request at:

https://github.com/apache/spark/pull/1955


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...

2014-08-08 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-51614247
  
I have been waiting here for @ueshin and @marmbrus to decide on next steps. 
 Please be clear what are the next steps at this point.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...

2014-08-08 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-51640101
  
@marmbrusI am fine with delays on this - I just was unclear as to 
whether there some expectation on action on my part.Overall this is a minor 
enhancement but it has generated a non-trivial amount of interaction and 
effort. I can understand if this were put on hold. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Fix tiny bug (likely copy and paste error) in ...

2014-08-05 Thread javadba
GitHub user javadba opened a pull request:

https://github.com/apache/spark/pull/1792

Fix tiny bug (likely copy and paste error) in closing jdbc connection

I inquired on  dev mailing list about the motivation for checking the jdbc 
statement instead of the connection in the close() logic of JdbcRDD. Ted Yu 
believes there essentially is none-  it is a simple cut and paste issue. So 
here is the tiny fix to patch it. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/javadba/spark closejdbc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1792.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1792


commit 095b2c9127b161437013476d48d855714716449f
Author: Stephen Boesch javadba
Date:   2014-08-05T19:46:55Z

Fix tiny bug (likely copy and paste error) in closing jdbc connection




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2869 - Fix tiny bug in JdbcRdd for closi...

2014-08-05 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/1792#discussion_r15849947
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ---
@@ -106,7 +106,7 @@ class JdbcRDD[T: ClassTag](
 case e: Exception = logWarning(Exception closing statement, e)
   }
   try {
-if (null != conn  ! stmt.isClosed()) conn.close()
+if (null != conn  ! conn.isClosed()) conn.close()
--- End diff --

sure - done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2638 MapOutputTracker concurrency improv...

2014-08-04 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1542#issuecomment-51022369
  
Thanks for commenting Josh. I will see about putting together something on 
this including solid testcases.  ETA later in the coming week. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...

2014-08-04 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-51065193
  
Hi,
  For some reason the CORE module testing has ballooned in overall testing 
time: it took over 7.5 hours to run. There was one timeout error out of 736 
tests - and it is quite unlikely to have anything to do with the code added in 
this PR.  

Here is the test that failed and then the overall results:

DriverSuite:
Spark assembly has been built with Hive, including Datanucleus jars on 
classpath
- driver should exit after finishing *** FAILED ***
  TestFailedDueToTimeoutException was thrown during property 
evaluation. (DriverSuite.scala:40)
Message: The code passed to failAfter did not complete within 60 
seconds.
Location: (DriverSuite.scala:41)
Occurred at table row 0 (zero based, not counting headings), which 
had values (
  master = local
)

  Tests: succeeded 723, failed 1, canceled 0, ignored 7, pending 0
*** 1 TEST FAILED ***
[INFO] 

[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [  
1.180 s]
[INFO] Spark Project Core . FAILURE [  
07:35 h]

So I am not presently in a position to run regression tests - given the 
overall runtime will be doulbe-digit hours.  Would someone please run Jenkins 
on this code?  




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...

2014-08-04 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-51146419
  
@ueshinI repeatably verified that simply changing OCTET_LEN to 
OCTET_LENGTH ended up causing SOF.  By repeatably I mean:

  Set the 'constant'  val OCTET_LENGTH=OCTET_LENGTH
  observe the error
  change to something like val OCTET_LENGTH=OCTET_LEN or  val 
OCTET_LENGTH=OCTET_LENG
  observe the error has gone away
  Rinse, cleanse, repeat

 i have been able to demonstrate this multiple times.  Now  the regression 
tests have been  run against the modified and reliable code.   

Please re-run your tests in a fresh area.  I will do the same .. but i am 
hesitant to consider to revert because we have positive test results now with 
the latest commit (as well as my results of the problem before the commit). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...

2014-08-04 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-51149151
  
@ueshin  

I have git clone'd to a completely new area, and I reverted my last commit. 
 

git clone https://github.com/javadba/spark.git strlen2 
cd strlen2
git chckout strlen
git revert 22eddbce6a201c8f5b5c31859ceb972e60657377
 mvn -DskipTests  -Pyarn -Phive -Phadoop-2.3 clean compile package
 mvn  -Pyarn -Phive -Phadoop-2.3 test 
-DwildcardSuites=org.apache.spark.sql.hive.execution.HiveQuerySuite,org.apache.spark.sql.SQLQuerySuite,org.apache.spark.sql.catalyst.expressions.ExpressionEvaluationSuite

I get precisely the same error:

HiveQuerySuite:
21:03:31.120 WARN org.apache.spark.util.Utils: Your hostname, mithril 
resolves to a loopback address: 127.0.1.1; using 10.0.0.33 instead (on 
interface eth0)
21:03:31.121 WARN org.apache.spark.util.Utils: Set SPARK_LOCAL_IP if 
you need to bind to another address
21:03:37.294 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
21:03:40.045 WARN com.jolbox.bonecp.BoneCPConfig: Max Connections  1. 
Setting to 20
21:03:49.464 WARN com.jolbox.bonecp.BoneCPConfig: Max Connections  1. 
Setting to 20
21:03:49.487 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version 
information not found in metastore. hive.metastore.schema.verification is not 
enabled so recording the schema version 0.12.0
21:03:57.157 WARN com.jolbox.bonecp.BoneCPConfig: Max Connections  1. 
Setting to 20
21:03:57.593 WARN com.jolbox.bonecp.BoneCPConfig: Max Connections  1. 
Setting to 20
- single case
- double case
- case else null
- having no references
- boolean = number
- CREATE TABLE AS runs once
- between
- div
- division
*** RUN ABORTED ***
  java.lang.StackOverflowError:
  at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
  at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
  at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
  at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
  at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
  at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
  at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
  at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
  at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
  at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)

Now, let's revert the revert :
git log 
commit db09cd132c2d7e995287eea54f3415726934138c 
Author: Stephen Boesch javadba 
Date:   Mon Aug 4 20:54:24 2014 -0700

  Revert Use Octet/Char_Len instead of Octet/Char_length due to 
apparent preexisting spark ParserCombinator bug.

This reverts commit 22eddbce6a201c8f5b5c31859ceb972e60657377.
git revert db09cd132c2d7e995287eea54f3415726934138c
mvn  -Pyarn -Phive -Phadoop-2.3 test 
-DwildcardSuites=org.apache.spark.sql.hive.execution.HiveQuerySuite,org.apache.spark.sql.SQLQuerySuite,org.apache.spark.sql.catalyst.expressions.ExpressionEvaluationSuite


Now those three test sutes pass again (specifically HiveQuerySuite did not 
fail)

And .. just to be *extra* sure here- that we can toggle between pass/fail 
arbitrary # of times: 

commit 602adedc9ca58d99957eb12bd91098ffe904604c
Author: Stephen Boesch javadba
Date:   Mon Aug 4 21:18:53 2014 -0700

Revert Revert Use Octet/Char_Len instead of Octet/Char_length due 
to apparent preexisting spark ParserCombinator bug.

git revert 602adedc9ca58d99957eb12bd91098ffe904604c

And once again HiveQuerySuite fails with the same error.

So I have established clearly  the following:  
the strlen branch on my fork fails with SOF if we rollback the commit 
that changes OCTET/CHAR_LENGTH - OCTET/CHAR_LEN. 






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark pull request: SPARK-2712 - Add a small note to maven doc tha...

2014-08-03 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1615#issuecomment-50994967
  
Apologies for the long delay - it was induced by a moderately arduous 
process of learning git workflows/rebase-ing.   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-03 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-51016510
  
The following division test in HiveQueryTest is failing with 
StackOverflowError. I have no idea why given there is no obvious connection to 
the added code in this PR:

  createQueryTest(division,
 SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1)


 - division
 *** RUN ABORTED ***
   java.lang.StackOverflowError:
   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-03 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-51016969
  
I have narrowed the problem down to the SQLParser. I will update when the 
precise cause is determined, likely within the hour.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-03 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-51017366
  
This is strange. The change that causes that SOF is just renaming 
OCTET/CHAR_LEN to OCTET/CHAR_LENGTH

Working

protected val CHAR_LEN = Keyword(CHAR_LEN)
protected val OCTET_LEN = Keyword(OCTET_LEN)

StackOverflowError

 protected val CHAR_LENGTH = Keyword(CHAR_LENGTH)
 protected val OCTET_LENGTH = Keyword(OCTET_LENGTH)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-03 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-51018296
  
Surprising result here:  the following change makes this work:

StackOverflowError:

  protected val OCTET_LENGTH = Keyword(OCTET_LENGTH)

Works fine:

  protected val OCTET_LENGTH = Keyword(OCTET_LEN)

Also works fine:

  protected val OCTET_LENGTH = Keyword(OCTET_LENG)

Let's double check - make sure this really repeatably fails on 
OCTET_LENGTH:

And Yes! It does fail again with OCTET_LENGTH.   We have a clear test 
failure scenario. 

So here  we are  we need to do OCTET/CHAR_LEN  and NOT OCTET/CHAR_LENGTH - 
until the root cause of this unrelated parser bug is found!

Should I open a separate JIRA for the parser bug?

BTW my theory is that there is something happening when one KEYWORD  
contains another KEYWORD.  But OTOH the LEN keyword is not causing an issue. So 
this is a subtle case to understand
 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...

2014-08-03 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-51020150
  
The rename change was committed/pushed and the most germane tests pass. I 
am re-running full regression.  One thing I have noticed already:  the 
flume-sink external project is failing - looks to be unrelated to any of my 
work. But I am looking into it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50856746
  
@ueshin   I mostly agree except:
 
let us keep the length which can be used for non-strings e.g. 
length(12345678) = 8

Then since length does handle strings as well by using codePoints as you 
suggest: then CharLength is synonym for Length (not the other way around)

AFA OctetLength: I am quite fine with that and will implement it according 
to your suggestion. 
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50858453
  
@marmbrus   OK fine with that.  

Then given the inputs from ueshin, we are presently at:

len(gth)/char_length  : take a single string argument and use codePointCount

octet_length  : takes two arguments:  (string, charset)  and returns the 
number of bytes



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50858624
  
@marmbrusRE: Charlength for the expression  - also fine, will do.  (btw 
how did you highlight in the comment?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50923836
  
I am delayed in providing the next implementation, due to continuing 
investigation here.

1) The default encoding seems to have changed in jdk 6 (ISO-8859-1) to jdk7 
(UTF-8?).

Here the getBytes method returns substantially different results based on 
which JDK

JDK 6:  \u2345.getBytes.length   A: 3
JDK 7:  \u2345.getBytes.length   A: 1


2)  I have been absorbing the Hive binary/string implementations.  The 
logic is short/easy to follow. But I am working through how to properly test 
this.  Based on Takuya's (correct) comment about hive length support for both 
binary and string: let us take a look at it first:

From hive.o.a.h.h.ql.udf.UDFLength

@Description(name = length,
value = _FUNC_(str | binary) - Returns the length of str or number 
of bytes in binary data,
extended = Example:\n
+SELECT _FUNC_('Facebook') FROM src LIMIT 1;\n +   8)
@VectorizedExpressions({StringLength.class})
public class UDFLength extends UDF {
  private final IntWritable result = new IntWritable();

  public IntWritable evaluate(Text s) {
if (s == null) {
  return null;
}

byte[] data = s.getBytes();
int len = 0;
for (int i = 0; i  s.getLength(); i++) {
  if (GenericUDFUtils.isUtfStartByte(data[i])) {
len++;
  }
}

result.set(len);
return result;
  }
public IntWritable evaluate(BytesWritable bw){
if (bw == null){
  return null;

}
result.set(bw.getLength());
return result;
  }
}

So in hive the invocations of length would be:

String: select length(my_string) from some_table;
binary: select length(cast(my_string as binary)) from some_table;

As noted above - the result will likely not be consistent across all 
instances of hive: in particular hive on jdk6 should be having a different 
answer (I only have jdk 7 in my testing environment).  I am still pondering how 
to handle these differences.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50936508
  
Hi,
  I have updated the codebase to match our present level of discussions - 
and to include the merge from upstream.

Takuya's test cases have been incorporated as well - but I found some 
differences in my results. @ueshin  please review the following, some of which 
do differ from your results presented above - especially for UTF-16

checkEvaluation(OctetLength(
  Literal(\uF93D\uF936\uF949\uF942, StringType), ISO-8859-1), 4)
  // Chinese characters get truncated by ISO-8859-1 encoding
checkEvaluation(OctetLength(
  Literal(\uF93D\uF936\uF949\uF942, StringType), UTF-8), 12) // 
chinese characters
checkEvaluation(OctetLength(
  Literal(\uD840\uDC0B\uD842\uDFB7, StringType), UTF-8), 8) // 2 
surrogate pairs
checkEvaluation(OctetLength(
  Literal(\uF93D\uF936\uF949\uF942, StringType), UTF-16), 10) // 
chinese characters
checkEvaluation(OctetLength(
  Literal(\uD840\uDC0B\uD842\uDFB7, StringType), UTF-16), 10) // 2 
surrogate pairs
checkEvaluation(OctetLength(
  Literal(\uF93D\uF936\uF949\uF942, StringType), UTF-32), 16) // 
chinese characters
checkEvaluation(OctetLength(
  Literal(\uD840\uDC0B\uD842\uDFB7, StringType), UTF-32), 8) // 2 
surrogate pairs



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50936636
  
grrr.. My push got some other extraneous changes of mine. I will fix that 
now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50946925
  
Cleaned up now. Re-ran the affected tests - and now re-running all tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50947836
  
@ueshin  Actually after re-checking it appears that my results match the 
ones you had placed in your comment. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/1586#discussion_r15726383
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -208,6 +211,96 @@ case class EndsWith(left: Expression, right: 
Expression)
   def compare(l: String, r: String) = l.endsWith(r)
 }
 
+
+/**
+ * A function that returns the number of bytes in an expression
+ */
+case class Length(child: Expression) extends UnaryExpression {
+
+  type EvaluatedType = Any
+
+  override def dataType = IntegerType
+
+  override def foldable = child.foldable
+
+  override def nullable = child.nullable
+
+  override def toString = sLength($child)
+
+  override def eval(input: Row): EvaluatedType = {
+val inputVal = child.eval(input)
+if (inputVal == null) {
+  null
+} else if (!inputVal.isInstanceOf[String]) {
+  inputVal.toString.length
+} else {
+  OctetLenUtils.len(inputVal.asInstanceOf[String])
--- End diff --

How can use just String.codePointCount?  you need to provide the start/end 
points and we do not know them w/o doing this kind of count


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/1586#discussion_r15726388
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -208,6 +211,96 @@ case class EndsWith(left: Expression, right: 
Expression)
   def compare(l: String, r: String) = l.endsWith(r)
 }
 
+
+/**
+ * A function that returns the number of bytes in an expression
+ */
+case class Length(child: Expression) extends UnaryExpression {
+
+  type EvaluatedType = Any
+
+  override def dataType = IntegerType
+
+  override def foldable = child.foldable
+
+  override def nullable = child.nullable
+
+  override def toString = sLength($child)
+
+  override def eval(input: Row): EvaluatedType = {
+val inputVal = child.eval(input)
+if (inputVal == null) {
+  null
+} else if (!inputVal.isInstanceOf[String]) {
+  inputVal.toString.length
+} else {
+  OctetLenUtils.len(inputVal.asInstanceOf[String])
--- End diff --

I do agree String#getBytes will return different results based on for 
example JDK6 vs JDK7. Do you have a recommendation on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/1586#discussion_r15726391
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -208,6 +211,96 @@ case class EndsWith(left: Expression, right: 
Expression)
   def compare(l: String, r: String) = l.endsWith(r)
 }
 
+
+/**
+ * A function that returns the number of bytes in an expression
+ */
+case class Length(child: Expression) extends UnaryExpression {
+
+  type EvaluatedType = Any
+
+  override def dataType = IntegerType
+
+  override def foldable = child.foldable
+
+  override def nullable = child.nullable
+
+  override def toString = sLength($child)
+
+  override def eval(input: Row): EvaluatedType = {
+val inputVal = child.eval(input)
+if (inputVal == null) {
+  null
+} else if (!inputVal.isInstanceOf[String]) {
+  inputVal.toString.length
+} else {
+  OctetLenUtils.len(inputVal.asInstanceOf[String])
+}
+  }
+
+}
+
+object OctetLengthConstants {
+  val DefaultEncoding = ISO-8859-1
--- End diff --

OK, maybe that makes sense. I will change it, and let's continue to 
consider .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50951491
  
BTW I am seeing testing errors in core for MapOutputTracker involving OOME. 
I had to restart testing a few times already. This is not anything to do with 
the code of this PR, just trying to do overall regrssion testing but failing in 
unrelated tests.  Oh. it just failed again:



ClosureCleanerSuite:
- closures inside an object *** FAILED ***
  org.apache.spark.SparkException: Error communicating with 
MapOutputTracker
  at 
org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:111)
  at 
org.apache.spark.MapOutputTracker.sendTracker(MapOutputTracker.scala:117)
  at 
org.apache.spark.MapOutputTrackerMaster.stop(MapOutputTracker.scala:324)
  at org.apache.spark.SparkEnv.stop(SparkEnv.scala:83)
  at org.apache.spark.SparkContext.stop(SparkContext.scala:1003)
  at 
org.apache.spark.LocalSparkContext$.stop(LocalSparkContext.scala:50)
  at 
org.apache.spark.LocalSparkContext$.withSpark(LocalSparkContext.scala:61)
  at org.apache.spark.util.TestObject$.run(ClosureCleanerSuite.scala:76)
  at 
org.apache.spark.util.ClosureCleanerSuite$$anonfun$1.apply$mcV$sp(ClosureCleanerSuite.scala:27)
  at 
org.apache.spark.util.ClosureCleanerSuite$$anonfun$1.apply(ClosureCleanerSuite.scala:27)
  ...
  Cause: akka.pattern.AskTimeoutException: 
Recipient[Actor[akka://spark/user/MapOutputTracker#221210107]] had already been 
terminated.
  at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
  at 
org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:106)
  at 
org.apache.spark.MapOutputTracker.sendTracker(MapOutputTracker.scala:117)
  at 
org.apache.spark.MapOutputTrackerMaster.stop(MapOutputTracker.scala:324)
  at org.apache.spark.SparkEnv.stop(SparkEnv.scala:83)
  at org.apache.spark.SparkContext.stop(SparkContext.scala:1003)
  at 
org.apache.spark.LocalSparkContext$.stop(LocalSparkContext.scala:50)
  at 
org.apache.spark.LocalSparkContext$.withSpark(LocalSparkContext.scala:61)
  at org.apache.spark.util.TestObject$.run(ClosureCleanerSuite.scala:76)
  at 
org.apache.spark.util.ClosureCleanerSuite$$anonfun$1.apply$mcV$sp(ClosureCleanerSuite.scala:27)
  ...
*** RUN ABORTED ***
  java.lang.OutOfMemoryError: unable to create new native thread
  at java.lang.Thread.start0(Native Method)
  at java.lang.Thread.start(Thread.java:679)
  at 
org.eclipse.jetty.util.thread.QueuedThreadPool.startThread(QueuedThreadPool.java:441)
  at 
org.eclipse.jetty.util.thread.QueuedThreadPool.doStart(QueuedThreadPool.java:108)
  at 
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
  at 
org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:81)
  at 
org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:58)
  at 
org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:96)
  at org.eclipse.jetty.server.Server.doStart(Server.java:282)
  at 
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
  ...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50951617
  
Ran the tests again and got same Error communicating with 
MapOutputTracker but in a different test: this time:

ImplicitOrderingSuite:
- basic inference of Orderings
*** RUN ABORTED ***
 org.apache.spark.SparkException: Error communicating with 
MapOutputTracker
 ..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on the pull request:

https://github.com/apache/spark/pull/1586#issuecomment-50951779
  
I ran another time and get an OOME this time.


ImplicitOrderingSuite:
*** RUN ABORTED ***
  java.lang.OutOfMemoryError: unable to create new native thread
  at java.lang.Thread.start0(Native Method)
  at java.lang.Thread.start(Thread.java:679)
  at 
org.eclipse.jetty.util.thread.QueuedThreadPool.startThread(QueuedThreadPool.java:441)
  at 
org.eclipse.jetty.util.thread.QueuedThreadPool.doStart(QueuedThreadPool.java:108)
  at 
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
  at 
org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:81)
  at 
org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:58)
  at 
org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:96)
  at org.eclipse.jetty.server.Server.doStart(Server.java:282)
  at 
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)

I  do not understand. You can see the file diff's for this PR: the modified 
logic has nothing to do with these CORE operations - it should only be 
following modules affected:
sql/catalyst
sql/core
sql/hive
sql/hive/compatibility 

But these errors are happing in:
core



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/1586#discussion_r15726702
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -208,6 +211,96 @@ case class EndsWith(left: Expression, right: 
Expression)
   def compare(l: String, r: String) = l.endsWith(r)
 }
 
+
+/**
+ * A function that returns the number of bytes in an expression
+ */
+case class Length(child: Expression) extends UnaryExpression {
+
+  type EvaluatedType = Any
+
+  override def dataType = IntegerType
+
+  override def foldable = child.foldable
+
+  override def nullable = child.nullable
+
+  override def toString = sLength($child)
+
+  override def eval(input: Row): EvaluatedType = {
+val inputVal = child.eval(input)
+if (inputVal == null) {
+  null
+} else if (!inputVal.isInstanceOf[String]) {
+  inputVal.toString.length
+} else {
+  OctetLenUtils.len(inputVal.asInstanceOf[String])
--- End diff --

OK I will try this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...

2014-08-01 Thread javadba
Github user javadba commented on a diff in the pull request:

https://github.com/apache/spark/pull/1586#discussion_r15726780
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala
 ---
@@ -208,6 +211,96 @@ case class EndsWith(left: Expression, right: 
Expression)
   def compare(l: String, r: String) = l.endsWith(r)
 }
 
+
+/**
+ * A function that returns the number of bytes in an expression
+ */
+case class Length(child: Expression) extends UnaryExpression {
+
+  type EvaluatedType = Any
+
+  override def dataType = IntegerType
+
+  override def foldable = child.foldable
+
+  override def nullable = child.nullable
+
+  override def toString = sLength($child)
+
+  override def eval(input: Row): EvaluatedType = {
+val inputVal = child.eval(input)
+if (inputVal == null) {
+  null
+} else if (!inputVal.isInstanceOf[String]) {
+  inputVal.toString.length
+} else {
+  OctetLenUtils.len(inputVal.asInstanceOf[String])
--- End diff --

Looks like working. I am pushing the change.  In the meanwhile will rerun 
all regression tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


  1   2   >