[GitHub] spark pull request: [SPARK-6937][MLLIB] Fixed bug in PICExample in...
GitHub user javadba opened a pull request: https://github.com/apache/spark/pull/5531 [SPARK-6937][MLLIB] Fixed bug in PICExample in which the radius were not being accepted on c... Tiny bug in PowerIterationClusteringExample in which radius not accepted from command line You can merge this pull request into a Git repository by running: $ git pull https://github.com/Huawei-Spark/spark picsub Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5531.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5531 commit 2aab8cf0dbd6db71078080bbc0491f166aecb741 Author: sboeschhuawei stephen.boe...@huawei.com Date: 2015-04-15T16:56:48Z Fixed bug in PICExample in which the radius were not being accepted on command line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24521887 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import scopt.OptionParser + +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Number of sampled points on innermost circle.. There are proportionally more points + * within the outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example mllib.PowerIterationClusteringExample + * -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + * 0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) + .text(snumber of circles (/clusters), default: ${defaultParams.k}) + .action((x, c) = c.copy(k = x)) + opt[Int]('n', n) + .text(snumber of points, default: ${defaultParams.numPoints}) + .action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) + .text(snumber of iterations, default: ${defaultParams.numIterations}) + .action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def run(params: Params) { +val conf = new SparkConf() +.setMaster(local) +.setAppName(sPowerIterationClustering with $params) +val sc = new SparkContext(conf) + +Logger.getRootLogger.setLevel(Level.WARN) + +val circlesRdd = generateCirclesRdd(sc, params.k, params.numPoints, params.outerRadius) +val model = new PowerIterationClustering() +.setK(params.k) +.setMaxIterations(params.numIterations) +.run(circlesRdd) + +val clusters = model.assignments.collect.groupBy(_._2).mapValues(_.map(_._1)) +val assignments = clusters.toList.sortBy { case (k, v) = v.length} +val assignmentsStr = assignments +.map { case (k, v) = s$k - ${v.sorted.mkString([, ,, ])}}.mkString(,) --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24521948 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import scopt.OptionParser + +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Number of sampled points on innermost circle.. There are proportionally more points + * within the outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example mllib.PowerIterationClusteringExample + * -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + * 0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) + .text(snumber of circles (/clusters), default: ${defaultParams.k}) --- End diff -- fyi I have created a Youtrack issue for Intellij for this formatting limitation: https://youtrack.jetbrains.com/issue/IDEA-136404 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24519073 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import scopt.OptionParser + +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Number of sampled points on innermost circle.. There are proportionally more points + * within the outer/larger circles + * numIterations: Number of Power Iterations --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24519062 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import scopt.OptionParser + +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Number of sampled points on innermost circle.. There are proportionally more points + * within the outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles --- End diff -- oh yea.. added configuration param opt[Int]('r', r) .text(sradius of outermost circle, default: ${defaultParams.outerRadius}) .action((x, c) = c.copy(numPoints = x)) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24518027 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import scopt.OptionParser + +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Number of sampled points on innermost circle.. There are proportionally more points + * within the outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example mllib.PowerIterationClusteringExample + * -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + * 0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) + .text(snumber of circles (/clusters), default: ${defaultParams.k}) + .action((x, c) = c.copy(k = x)) + opt[Int]('n', n) + .text(snumber of points, default: ${defaultParams.numPoints}) + .action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) + .text(snumber of iterations, default: ${defaultParams.numIterations}) + .action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def run(params: Params) { +val conf = new SparkConf() +.setMaster(local) +.setAppName(sPowerIterationClustering with $params) +val sc = new SparkContext(conf) + +Logger.getRootLogger.setLevel(Level.WARN) + +val circlesRdd = generateCirclesRdd(sc, params.k, params.numPoints, params.outerRadius) +val model = new PowerIterationClustering() +.setK(params.k) +.setMaxIterations(params.numIterations) +.run(circlesRdd) + +val clusters = model.assignments.collect.groupBy(_._2).mapValues(_.map(_._1)) +val assignments = clusters.toList.sortBy { case (k, v) = v.length} +val assignmentsStr = assignments +.map { case (k, v) = s$k - ${v.sorted.mkString([, ,, ])}}.mkString(,) +println(sCluster assignments: $assignmentsStr) + +sc.stop() + } + + def generateCirclesRdd(sc: SparkContext, + nCircles: Int = 3, + nPoints: Int = 30, + outerRadius: Double): RDD[(Long, Long, Double)] = { + +val radii = for (cx - 0 until nCircles) yield outerRadius / (nCircles-cx) +val groupSizes = for (cx - 0 until nCircles) yield (cx + 1) * nPoints +var ix = 0 +val points = for (cx - 0 until nCircles
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24517979 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import scopt.OptionParser + +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Number of sampled points on innermost circle.. There are proportionally more points + * within the outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example mllib.PowerIterationClusteringExample + * -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + * 0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) + .text(snumber of circles (/clusters), default: ${defaultParams.k}) + .action((x, c) = c.copy(k = x)) + opt[Int]('n', n) + .text(snumber of points, default: ${defaultParams.numPoints}) + .action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) + .text(snumber of iterations, default: ${defaultParams.numIterations}) + .action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def run(params: Params) { +val conf = new SparkConf() +.setMaster(local) +.setAppName(sPowerIterationClustering with $params) +val sc = new SparkContext(conf) + +Logger.getRootLogger.setLevel(Level.WARN) + +val circlesRdd = generateCirclesRdd(sc, params.k, params.numPoints, params.outerRadius) +val model = new PowerIterationClustering() +.setK(params.k) +.setMaxIterations(params.numIterations) +.run(circlesRdd) + +val clusters = model.assignments.collect.groupBy(_._2).mapValues(_.map(_._1)) +val assignments = clusters.toList.sortBy { case (k, v) = v.length} +val assignmentsStr = assignments +.map { case (k, v) = s$k - ${v.sorted.mkString([, ,, ])}}.mkString(,) +println(sCluster assignments: $assignmentsStr) + +sc.stop() + } + + def generateCirclesRdd(sc: SparkContext, + nCircles: Int = 3, + nPoints: Int = 30, + outerRadius: Double): RDD[(Long, Long, Double)] = { + +val radii = for (cx - 0 until nCircles) yield outerRadius / (nCircles-cx) +val groupSizes = for (cx - 0 until nCircles) yield (cx + 1) * nPoints +var ix = 0 +val points = for (cx - 0 until nCircles; --- End
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24518800 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import scopt.OptionParser + +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24518861 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import scopt.OptionParser + +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24518525 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import scopt.OptionParser + +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Number of sampled points on innermost circle.. There are proportionally more points + * within the outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example mllib.PowerIterationClusteringExample + * -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + * 0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) + .text(snumber of circles (/clusters), default: ${defaultParams.k}) + .action((x, c) = c.copy(k = x)) + opt[Int]('n', n) + .text(snumber of points, default: ${defaultParams.numPoints}) + .action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) + .text(snumber of iterations, default: ${defaultParams.numIterations}) + .action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def run(params: Params) { +val conf = new SparkConf() +.setMaster(local) +.setAppName(sPowerIterationClustering with $params) +val sc = new SparkContext(conf) + +Logger.getRootLogger.setLevel(Level.WARN) + +val circlesRdd = generateCirclesRdd(sc, params.k, params.numPoints, params.outerRadius) +val model = new PowerIterationClustering() +.setK(params.k) +.setMaxIterations(params.numIterations) +.run(circlesRdd) + +val clusters = model.assignments.collect.groupBy(_._2).mapValues(_.map(_._1)) +val assignments = clusters.toList.sortBy { case (k, v) = v.length} +val assignmentsStr = assignments +.map { case (k, v) = s$k - ${v.sorted.mkString([, ,, ])}}.mkString(,) +println(sCluster assignments: $assignmentsStr) + +sc.stop() + } + + def generateCirclesRdd(sc: SparkContext, + nCircles: Int = 3, + nPoints: Int = 30, + outerRadius: Double): RDD[(Long, Long, Double)] = { + +val radii = for (cx - 0 until nCircles) yield outerRadius / (nCircles-cx) +val groupSizes = for (cx - 0 until nCircles) yield (cx + 1) * nPoints +var ix = 0 +val points = for (cx - 0 until nCircles
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24471635 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import scopt.OptionParser + +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Number of sampled points on innermost circle.. There are proportionally more points + * within the outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example mllib.PowerIterationClusteringExample + * -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + * 0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) + .text(snumber of circles (/clusters), default: ${defaultParams.k}) + .action((x, c) = c.copy(k = x)) + opt[Int]('n', n) + .text(snumber of points, default: ${defaultParams.numPoints}) --- End diff -- ok --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24471616 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import scopt.OptionParser + +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Number of sampled points on innermost circle.. There are proportionally more points + * within the outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example mllib.PowerIterationClusteringExample + * -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + * 0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) + .text(snumber of circles (/clusters), default: ${defaultParams.k}) --- End diff -- err yes.. back to the drawing board for the IJ auto formatting. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24471723 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import scopt.OptionParser + +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Number of sampled points on innermost circle.. There are proportionally more points + * within the outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example mllib.PowerIterationClusteringExample + * -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + * 0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) + .text(snumber of circles (/clusters), default: ${defaultParams.k}) + .action((x, c) = c.copy(k = x)) + opt[Int]('n', n) + .text(snumber of points, default: ${defaultParams.numPoints}) + .action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) + .text(snumber of iterations, default: ${defaultParams.numIterations}) + .action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def run(params: Params) { +val conf = new SparkConf() +.setMaster(local) +.setAppName(sPowerIterationClustering with $params) +val sc = new SparkContext(conf) + +Logger.getRootLogger.setLevel(Level.WARN) + +val circlesRdd = generateCirclesRdd(sc, params.k, params.numPoints, params.outerRadius) +val model = new PowerIterationClustering() +.setK(params.k) +.setMaxIterations(params.numIterations) +.run(circlesRdd) + +val clusters = model.assignments.collect.groupBy(_._2).mapValues(_.map(_._1)) +val assignments = clusters.toList.sortBy { case (k, v) = v.length} +val assignmentsStr = assignments +.map { case (k, v) = s$k - ${v.sorted.mkString([, ,, ])}}.mkString(,) +println(sCluster assignments: $assignmentsStr) + +sc.stop() + } + + def generateCirclesRdd(sc: SparkContext, + nCircles: Int = 3, + nPoints: Int = 30, + outerRadius: Double): RDD[(Long, Long, Double)] = { + +val radii = for (cx - 0 until nCircles) yield outerRadius / (nCircles-cx) --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24394701 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} +import scopt.OptionParser + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Total number of sampled points. There are proportionally more points within the + * outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + case class Params( + input: String = null, + k: Int = 3, + numPoints: Int = 30, + numIterations: Int = 10, + outerRadius: Double = 3.0 + ) extends AbstractParams[Params] + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) +.text(snumber of circles (/clusters), default: ${defaultParams.k}) +.action((x, c) = c.copy(k = x)) + opt[Int]('n', n) +.text(snumber of points, default: ${defaultParams.numPoints}) +.action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) +.text(snumber of iterations, default: ${defaultParams.numIterations}) +.action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def generateCircle(n: Int, r: Double): Array[(Double, Double)] = { +val pi2 = 2 * math.Pi +(0.0 until pi2 by pi2 / n).map { x = + (r * math.cos(x), r * math.sin(x)) +}.toArray + } + + def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, nTotalPoints: Int = 30, + outerRadius: Double): + RDD[(Long, Long, Double)] = { +// The circles are generated as follows: +// The Radii are equal to the largestRadius/(C - circleIndex) +// where C=Number of circles +// and the circleIndex is 0 for the innermost and (nCircles-1) for the outermost circle +// The number of points in each circle (and thus in each final cluster) is: +// x, 2x, .., nCircles*x +// Where x is found from x = N * C(C+1)/2 +// The # points in the LAST circle is adjusted downwards so that the total sum is equal +// to the nTotalPoints + +val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) / 2.0
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24392886 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} +import scopt.OptionParser + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Total number of sampled points. There are proportionally more points within the + * outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + case class Params( + input: String = null, + k: Int = 3, + numPoints: Int = 30, + numIterations: Int = 10, + outerRadius: Double = 3.0 + ) extends AbstractParams[Params] + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) +.text(snumber of circles (/clusters), default: ${defaultParams.k}) +.action((x, c) = c.copy(k = x)) + opt[Int]('n', n) +.text(snumber of points, default: ${defaultParams.numPoints}) +.action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) +.text(snumber of iterations, default: ${defaultParams.numIterations}) +.action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def generateCircle(n: Int, r: Double): Array[(Double, Double)] = { +val pi2 = 2 * math.Pi +(0.0 until pi2 by pi2 / n).map { x = + (r * math.cos(x), r * math.sin(x)) +}.toArray + } + + def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, nTotalPoints: Int = 30, + outerRadius: Double): + RDD[(Long, Long, Double)] = { +// The circles are generated as follows: +// The Radii are equal to the largestRadius/(C - circleIndex) +// where C=Number of circles +// and the circleIndex is 0 for the innermost and (nCircles-1) for the outermost circle +// The number of points in each circle (and thus in each final cluster) is: +// x, 2x, .., nCircles*x +// Where x is found from x = N * C(C+1)/2 +// The # points in the LAST circle is adjusted downwards so that the total sum is equal +// to the nTotalPoints + +val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) / 2.0
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24393348 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} +import scopt.OptionParser + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Total number of sampled points. There are proportionally more points within the + * outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + case class Params( + input: String = null, + k: Int = 3, + numPoints: Int = 30, + numIterations: Int = 10, + outerRadius: Double = 3.0 + ) extends AbstractParams[Params] + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) +.text(snumber of circles (/clusters), default: ${defaultParams.k}) +.action((x, c) = c.copy(k = x)) + opt[Int]('n', n) +.text(snumber of points, default: ${defaultParams.numPoints}) +.action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) +.text(snumber of iterations, default: ${defaultParams.numIterations}) +.action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def generateCircle(n: Int, r: Double): Array[(Double, Double)] = { +val pi2 = 2 * math.Pi +(0.0 until pi2 by pi2 / n).map { x = + (r * math.cos(x), r * math.sin(x)) +}.toArray + } + + def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, nTotalPoints: Int = 30, --- End diff -- is there a code style xml for apache spark - since IJ sets the indentation to like 40 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24394006 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} +import scopt.OptionParser + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Total number of sampled points. There are proportionally more points within the + * outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + case class Params( + input: String = null, + k: Int = 3, + numPoints: Int = 30, + numIterations: Int = 10, + outerRadius: Double = 3.0 + ) extends AbstractParams[Params] + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) +.text(snumber of circles (/clusters), default: ${defaultParams.k}) +.action((x, c) = c.copy(k = x)) + opt[Int]('n', n) +.text(snumber of points, default: ${defaultParams.numPoints}) +.action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) +.text(snumber of iterations, default: ${defaultParams.numIterations}) +.action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def generateCircle(n: Int, r: Double): Array[(Double, Double)] = { +val pi2 = 2 * math.Pi +(0.0 until pi2 by pi2 / n).map { x = + (r * math.cos(x), r * math.sin(x)) +}.toArray + } + + def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, nTotalPoints: Int = 30, + outerRadius: Double): + RDD[(Long, Long, Double)] = { +// The circles are generated as follows: +// The Radii are equal to the largestRadius/(C - circleIndex) +// where C=Number of circles +// and the circleIndex is 0 for the innermost and (nCircles-1) for the outermost circle +// The number of points in each circle (and thus in each final cluster) is: +// x, 2x, .., nCircles*x +// Where x is found from x = N * C(C+1)/2 +// The # points in the LAST circle is adjusted downwards so that the total sum is equal +// to the nTotalPoints + +val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) / 2.0
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24393968 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} +import scopt.OptionParser + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Total number of sampled points. There are proportionally more points within the + * outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + case class Params( + input: String = null, + k: Int = 3, + numPoints: Int = 30, + numIterations: Int = 10, + outerRadius: Double = 3.0 + ) extends AbstractParams[Params] + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) +.text(snumber of circles (/clusters), default: ${defaultParams.k}) +.action((x, c) = c.copy(k = x)) + opt[Int]('n', n) +.text(snumber of points, default: ${defaultParams.numPoints}) +.action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) +.text(snumber of iterations, default: ${defaultParams.numIterations}) +.action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def generateCircle(n: Int, r: Double): Array[(Double, Double)] = { +val pi2 = 2 * math.Pi +(0.0 until pi2 by pi2 / n).map { x = + (r * math.cos(x), r * math.sin(x)) +}.toArray + } + + def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, nTotalPoints: Int = 30, + outerRadius: Double): + RDD[(Long, Long, Double)] = { +// The circles are generated as follows: +// The Radii are equal to the largestRadius/(C - circleIndex) +// where C=Number of circles +// and the circleIndex is 0 for the innermost and (nCircles-1) for the outermost circle +// The number of points in each circle (and thus in each final cluster) is: +// x, 2x, .., nCircles*x +// Where x is found from x = N * C(C+1)/2 +// The # points in the LAST circle is adjusted downwards so that the total sum is equal +// to the nTotalPoints + +val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) / 2.0
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24393995 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} +import scopt.OptionParser + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Total number of sampled points. There are proportionally more points within the + * outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + case class Params( + input: String = null, + k: Int = 3, + numPoints: Int = 30, + numIterations: Int = 10, + outerRadius: Double = 3.0 + ) extends AbstractParams[Params] + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) +.text(snumber of circles (/clusters), default: ${defaultParams.k}) +.action((x, c) = c.copy(k = x)) + opt[Int]('n', n) +.text(snumber of points, default: ${defaultParams.numPoints}) +.action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) +.text(snumber of iterations, default: ${defaultParams.numIterations}) +.action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def generateCircle(n: Int, r: Double): Array[(Double, Double)] = { +val pi2 = 2 * math.Pi +(0.0 until pi2 by pi2 / n).map { x = + (r * math.cos(x), r * math.sin(x)) +}.toArray + } + + def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, nTotalPoints: Int = 30, + outerRadius: Double): + RDD[(Long, Long, Double)] = { +// The circles are generated as follows: +// The Radii are equal to the largestRadius/(C - circleIndex) +// where C=Number of circles +// and the circleIndex is 0 for the innermost and (nCircles-1) for the outermost circle +// The number of points in each circle (and thus in each final cluster) is: +// x, 2x, .., nCircles*x +// Where x is found from x = N * C(C+1)/2 +// The # points in the LAST circle is adjusted downwards so that the total sum is equal +// to the nTotalPoints + +val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) / 2.0
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24394002 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} +import scopt.OptionParser + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Total number of sampled points. There are proportionally more points within the + * outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + case class Params( + input: String = null, + k: Int = 3, + numPoints: Int = 30, + numIterations: Int = 10, + outerRadius: Double = 3.0 + ) extends AbstractParams[Params] + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) +.text(snumber of circles (/clusters), default: ${defaultParams.k}) +.action((x, c) = c.copy(k = x)) + opt[Int]('n', n) +.text(snumber of points, default: ${defaultParams.numPoints}) +.action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) +.text(snumber of iterations, default: ${defaultParams.numIterations}) +.action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def generateCircle(n: Int, r: Double): Array[(Double, Double)] = { +val pi2 = 2 * math.Pi +(0.0 until pi2 by pi2 / n).map { x = + (r * math.cos(x), r * math.sin(x)) +}.toArray + } + + def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, nTotalPoints: Int = 30, + outerRadius: Double): + RDD[(Long, Long, Double)] = { +// The circles are generated as follows: +// The Radii are equal to the largestRadius/(C - circleIndex) +// where C=Number of circles +// and the circleIndex is 0 for the innermost and (nCircles-1) for the outermost circle +// The number of points in each circle (and thus in each final cluster) is: +// x, 2x, .., nCircles*x +// Where x is found from x = N * C(C+1)/2 +// The # points in the LAST circle is adjusted downwards so that the total sum is equal +// to the nTotalPoints + +val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) / 2.0
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24393130 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} +import scopt.OptionParser + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Total number of sampled points. There are proportionally more points within the + * outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + case class Params( + input: String = null, + k: Int = 3, + numPoints: Int = 30, + numIterations: Int = 10, + outerRadius: Double = 3.0 + ) extends AbstractParams[Params] + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) +.text(snumber of circles (/clusters), default: ${defaultParams.k}) +.action((x, c) = c.copy(k = x)) + opt[Int]('n', n) +.text(snumber of points, default: ${defaultParams.numPoints}) +.action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) +.text(snumber of iterations, default: ${defaultParams.numIterations}) +.action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def generateCircle(n: Int, r: Double): Array[(Double, Double)] = { +val pi2 = 2 * math.Pi +(0.0 until pi2 by pi2 / n).map { x = + (r * math.cos(x), r * math.sin(x)) +}.toArray + } + + def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, nTotalPoints: Int = 30, + outerRadius: Double): + RDD[(Long, Long, Double)] = { +// The circles are generated as follows: +// The Radii are equal to the largestRadius/(C - circleIndex) +// where C=Number of circles +// and the circleIndex is 0 for the innermost and (nCircles-1) for the outermost circle +// The number of points in each circle (and thus in each final cluster) is: +// x, 2x, .., nCircles*x +// Where x is found from x = N * C(C+1)/2 +// The # points in the LAST circle is adjusted downwards so that the total sum is equal +// to the nTotalPoints + +val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) / 2.0
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24393254 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} +import scopt.OptionParser --- End diff -- I manually fixed the order. Maybe I did optimize imports again: apparently the default behavior in IJ is completely different than the spark requirements. Will try that plugin. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24393498 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} +import scopt.OptionParser --- End diff -- That plugin is apparently for IJ13 and says should not be used. So that is a bit confusing.. anyways I did some more manual rearranging. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24393834 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} +import scopt.OptionParser + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Total number of sampled points. There are proportionally more points within the + * outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + case class Params( + input: String = null, + k: Int = 3, + numPoints: Int = 30, + numIterations: Int = 10, + outerRadius: Double = 3.0 + ) extends AbstractParams[Params] + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) +.text(snumber of circles (/clusters), default: ${defaultParams.k}) +.action((x, c) = c.copy(k = x)) + opt[Int]('n', n) +.text(snumber of points, default: ${defaultParams.numPoints}) +.action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) +.text(snumber of iterations, default: ${defaultParams.numIterations}) +.action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def generateCircle(n: Int, r: Double): Array[(Double, Double)] = { +val pi2 = 2 * math.Pi +(0.0 until pi2 by pi2 / n).map { x = + (r * math.cos(x), r * math.sin(x)) +}.toArray + } + + def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, nTotalPoints: Int = 30, --- End diff -- This is a problem: it is not reasonable to manually move these indentations. I am doing it this time, but have posted an SOF question on this. http://stackoverflow.com/questions/28426355/intellij-code-style-setting-for-wrapping-on-multi-line-function-arguments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24393018 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} +import scopt.OptionParser + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Total number of sampled points. There are proportionally more points within the + * outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + case class Params( + input: String = null, + k: Int = 3, + numPoints: Int = 30, + numIterations: Int = 10, + outerRadius: Double = 3.0 + ) extends AbstractParams[Params] + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) +.text(snumber of circles (/clusters), default: ${defaultParams.k}) +.action((x, c) = c.copy(k = x)) + opt[Int]('n', n) +.text(snumber of points, default: ${defaultParams.numPoints}) +.action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) +.text(snumber of iterations, default: ${defaultParams.numIterations}) +.action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def generateCircle(n: Int, r: Double): Array[(Double, Double)] = { +val pi2 = 2 * math.Pi +(0.0 until pi2 by pi2 / n).map { x = + (r * math.cos(x), r * math.sin(x)) +}.toArray + } + + def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, nTotalPoints: Int = 30, + outerRadius: Double): + RDD[(Long, Long, Double)] = { +// The circles are generated as follows: +// The Radii are equal to the largestRadius/(C - circleIndex) +// where C=Number of circles +// and the circleIndex is 0 for the innermost and (nCircles-1) for the outermost circle +// The number of points in each circle (and thus in each final cluster) is: +// x, 2x, .., nCircles*x +// Where x is found from x = N * C(C+1)/2 +// The # points in the LAST circle is adjusted downwards so that the total sum is equal +// to the nTotalPoints + +val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) / 2.0
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24393918 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} +import scopt.OptionParser + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Total number of sampled points. There are proportionally more points within the + * outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + case class Params( + input: String = null, + k: Int = 3, + numPoints: Int = 30, + numIterations: Int = 10, + outerRadius: Double = 3.0 + ) extends AbstractParams[Params] + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) +.text(snumber of circles (/clusters), default: ${defaultParams.k}) +.action((x, c) = c.copy(k = x)) + opt[Int]('n', n) +.text(snumber of points, default: ${defaultParams.numPoints}) +.action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) +.text(snumber of iterations, default: ${defaultParams.numIterations}) +.action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def generateCircle(n: Int, r: Double): Array[(Double, Double)] = { +val pi2 = 2 * math.Pi +(0.0 until pi2 by pi2 / n).map { x = + (r * math.cos(x), r * math.sin(x)) +}.toArray + } + + def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, nTotalPoints: Int = 30, + outerRadius: Double): + RDD[(Long, Long, Double)] = { +// The circles are generated as follows: +// The Radii are equal to the largestRadius/(C - circleIndex) +// where C=Number of circles +// and the circleIndex is 0 for the innermost and (nCircles-1) for the outermost circle +// The number of points in each circle (and thus in each final cluster) is: +// x, 2x, .., nCircles*x +// Where x is found from x = N * C(C+1)/2 +// The # points in the LAST circle is adjusted downwards so that the total sum is equal +// to the nTotalPoints + +val smallestRad = math.ceil(nTotalPoints / (nCircles * (nCircles + 1) / 2.0
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24394635 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} +import scopt.OptionParser + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Total number of sampled points. There are proportionally more points within the + * outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + case class Params( + input: String = null, + k: Int = 3, + numPoints: Int = 30, + numIterations: Int = 10, + outerRadius: Double = 3.0 + ) extends AbstractParams[Params] + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) +.text(snumber of circles (/clusters), default: ${defaultParams.k}) +.action((x, c) = c.copy(k = x)) + opt[Int]('n', n) +.text(snumber of points, default: ${defaultParams.numPoints}) +.action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) +.text(snumber of iterations, default: ${defaultParams.numIterations}) +.action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def generateCircle(n: Int, r: Double): Array[(Double, Double)] = { +val pi2 = 2 * math.Pi +(0.0 until pi2 by pi2 / n).map { x = + (r * math.cos(x), r * math.sin(x)) +}.toArray + } + + def generateCirclesRdd(sc: SparkContext, nCircles: Int = 3, nTotalPoints: Int = 30, + outerRadius: Double): + RDD[(Long, Long, Double)] = { --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24393289 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} +import scopt.OptionParser + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample [options] --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4495#discussion_r24393317 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala --- @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.mllib + +import org.apache.log4j.{Level, Logger} +import org.apache.spark.mllib.clustering.PowerIterationClustering +import org.apache.spark.rdd.RDD +import org.apache.spark.{SparkConf, SparkContext} +import scopt.OptionParser + +/** + * An example Power Iteration Clustering app. Takes an input of K concentric circles + * with a total of n sampled points (total here means across ALL of the circles). + * The output should be K clusters - each cluster containing precisely the points associated + * with each of the input circles. + * + * Run with + * {{{ + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample [options] + * + * Where options include: + * k: Number of circles/ clusters + * n: Total number of sampled points. There are proportionally more points within the + * outer/larger circles + * numIterations: Number of Power Iterations + * outerRadius: radius of the outermost of the concentric circles + * }}} + * + * Here is a sample run and output: + * + * ./bin/run-example org.apache.spark.examples.mllib.PowerIterationClusteringExample -k 3 --n 30 --numIterations 15 + * + * Cluster assignments: 1 - [0,1,2,3,4],2 - [5,6,7,8,9,10,11,12,13,14], + *0 - [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29] + * + * + * If you use it as a template to create your own app, please use `spark-submit` to submit your app. + */ +object PowerIterationClusteringExample { + + case class Params( + input: String = null, + k: Int = 3, + numPoints: Int = 30, + numIterations: Int = 10, + outerRadius: Double = 3.0 + ) extends AbstractParams[Params] + + def main(args: Array[String]) { +val defaultParams = Params() + +val parser = new OptionParser[Params](PIC Circles) { + head(PowerIterationClusteringExample: an example PIC app using concentric circles.) + opt[Int]('k', k) +.text(snumber of circles (/clusters), default: ${defaultParams.k}) +.action((x, c) = c.copy(k = x)) + opt[Int]('n', n) +.text(snumber of points, default: ${defaultParams.numPoints}) +.action((x, c) = c.copy(numPoints = x)) + opt[Int](numIterations) +.text(snumber of iterations, default: ${defaultParams.numIterations}) +.action((x, c) = c.copy(numIterations = x)) +} + +parser.parse(args, defaultParams).map { params = + run(params) +}.getOrElse { + sys.exit(1) +} + } + + def generateCircle(n: Int, r: Double): Array[(Double, Double)] = { --- End diff -- not anymore. removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5503][MLLIB] Example code for Power Ite...
GitHub user javadba opened a pull request: https://github.com/apache/spark/pull/4495 [SPARK-5503][MLLIB] Example code for Power Iteration Clustering You can merge this pull request into a Git repository by running: $ git pull https://github.com/Huawei-Spark/spark picexamples Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4495.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4495 commit 5864d4ada20f7b087e56741d0f39a36844f3a073 Author: sboeschhuawei stephen.boe...@huawei.com Date: 2015-02-03T21:11:08Z placeholder for pic examples commit c50913012654372da9b0b6a4d1b7e4e410f9afda Author: sboeschhuawei stephen.boe...@huawei.com Date: 2015-02-03T21:14:36Z placeholder for pic examples commit 03e8de460da07a6451bd20493dc4653c20cbe952 Author: sboeschhuawei stephen.boe...@huawei.com Date: 2015-02-10T00:57:10Z Added PICExample commit efeec458cb08b39bee4f026a2e5233f3bfc2d0c2 Author: sboeschhuawei stephen.boe...@huawei.com Date: 2015-02-10T06:24:48Z Update to PICExample from Xiangrui's comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23803786 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PowerIterationClustering { + + private val logger = Logger.getLogger(getClass.getName()) + + type LabeledPoint = (VertexId, BDV[Double]) + type Points = Seq[LabeledPoint] + type DGraph = Graph[Double, Double] + type IndexedVector[Double] = (Long, BDV[Double]) + + // Terminate iteration when norm changes by less than this value + val defaultMinNormChange: Double = 1e-11 + + // Default number of iterations for PIC loop + val defaultIterations: Int = 20 + + // Do not allow divide by zero: change to this value instead + val defaultDivideByZeroVal: Double = 1e-15 + + // Default number of runs by the KMeans.run() method + val defaultKMeansRuns = 10 --- End diff -- OK. This is actually a mistake/bug: the default was intended to be 1. Will set to 1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23808251 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringSuite.scala --- @@ -0,0 +1,317 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx.{EdgeRDD, Edge, Graph} +import org.apache.spark.mllib.clustering.PowerIterationClustering.{LabeledPoint, Points, IndexedVector} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.rdd.RDD +import org.scalatest.FunSuite + +import scala.util.Random + +class PowerIterationClusteringSuite extends FunSuite with MLlibTestSparkContext { + + val logger = Logger.getLogger(getClass.getName) + + import org.apache.spark.mllib.clustering.PowerIterationClusteringSuite._ + + test(concentricCirclesTest) { +concentricCirclesTest() + } + + def concentricCirclesTest() = { +val sigma = 1.0 +val nIterations = 10 + +val circleSpecs = Seq( + // Best results for 30 points + CircleSpec(Point(0.0, 0.0), 0.03, 0.1, 3), + CircleSpec(Point(0.0, 0.0), 0.3, 0.03, 12), + CircleSpec(Point(0.0, 0.0), 1.0, 0.01, 15) + // Add following to get 100 points + , CircleSpec(Point(0.0, 0.0), 1.5, 0.005, 30), + CircleSpec(Point(0.0, 0.0), 2.0, 0.002, 40) +) + +val nClusters = circleSpecs.size +val cdata = createConcentricCirclesData(circleSpecs) +val vertices = new Random().shuffle(cdata.map { p = + (p.label, new BDV(Array(p.x, p.y))) +}) + +val nVertices = vertices.length +val G = createGaussianAffinityMatrix(sc, vertices) +val (ccenters, estCollected) = PIC.run(sc, G, nClusters, nIterations) +logger.info(sCluster centers: ${ccenters.mkString(,)} + + s\nEstimates: ${estCollected.mkString([, ,, ])}) +assert(ccenters.size == circleSpecs.length, Did not get correct number of centers) + + } + +} + +object PowerIterationClusteringSuite { + val logger = Logger.getLogger(getClass.getName) + val A = Array + val PIC = PowerIterationClustering + + // Default sigma for Gaussian Distance calculations + val defaultSigma = 1.0 + + // Default minimum affinity between points - lower than this it is considered + // zero and no edge will be created + val defaultMinAffinity = 1e-11 + + def pdoub(d: Double) = f$d%1.6f + + case class Point(label: Long, x: Double, y: Double) { +def this(x: Double, y: Double) = this(-1L, x, y) + +override def toString() = s($label, (${pdoub(x)},${pdoub(y)})) + } + + object Point { +def apply(x: Double, y: Double) = new Point(-1L, x, y) + } + + case class CircleSpec(center: Point, radius: Double, noiseToRadiusRatio: Double, +nPoints: Int, uniformDistOnCircle: Boolean = true) + + def createConcentricCirclesData(circleSpecs: Seq[CircleSpec]) = { +import org.apache.spark.mllib.random.StandardNormalGenerator +val normalGen = new StandardNormalGenerator +var idStart = 0 +val circles = for (csp - circleSpecs) yield { + idStart += 1000 + val circlePoints = for (thetax - 0 until csp.nPoints) yield { +val theta = thetax * 2 * Math.PI / csp.nPoints +val (x, y) = (csp.radius * Math.cos(theta) + * (1 + normalGen.nextValue * csp.noiseToRadiusRatio), + csp.radius * Math.sin(theta) * (1 + normalGen.nextValue * csp.noiseToRadiusRatio)) +(Point(idStart + thetax, x, y)) + } + circlePoints +} +val points = circles.flatten.sortBy(_.label) +logger.info
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23807937 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PowerIterationClustering { + + private val logger = Logger.getLogger(getClass.getName()) + + type LabeledPoint = (VertexId, BDV[Double]) + type Points = Seq[LabeledPoint] + type DGraph = Graph[Double, Double] + type IndexedVector[Double] = (Long, BDV[Double]) + + // Terminate iteration when norm changes by less than this value + val defaultMinNormChange: Double = 1e-11 + + // Default number of iterations for PIC loop + val defaultIterations: Int = 20 + + // Do not allow divide by zero: change to this value instead + val defaultDivideByZeroVal: Double = 1e-15 + + // Default number of runs by the KMeans.run() method + val defaultKMeansRuns = 10 + + /** + * + * Run a Power Iteration Clustering + * + * @param sc Spark Context + * @param G Affinity Matrix in a Sparse Graph structure + * @param nClusters Number of clusters to create + * @param nIterations Number of iterations of the PIC algorithm + *that calculates primary PseudoEigenvector and Eigenvalue + * @param nRuns Number of runs for the KMeans clustering + * @return Tuple of (Seq[(Cluster Id,Cluster Center)], + * Seq[(VertexId, ClusterID Membership)] + */ + def run(sc: SparkContext, + G: Graph[Double, Double], + nClusters: Int, + nIterations: Int = defaultIterations, + nRuns: Int = defaultKMeansRuns) + : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = { +val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations) +// TODO: avoid local collect and then sc.parallelize. +val localVt = vt.collect.sortBy(_._1) +val vectRdd = sc.parallelize(localVt.map(v = (v._1, Vectors.dense(v._2 +vectRdd.cache() +val model = KMeans.train(vectRdd.map { + _._2 +}, nClusters, nRuns) +vectRdd.unpersist() +if (logger.isDebugEnabled) { + logger.debug(sEigenvalue = $lambda EigenVector: ${localVt.mkString(,)}) +} +val estimates = vectRdd.zip(model.predict(vectRdd.map(_._2))) +if (logger.isDebugEnabled) { + logger.debug(slambda=$lambda eigen=${localVt.mkString(,)}) +} +val ccs = (0 until model.clusterCenters.length).zip(model.clusterCenters) +if (logger.isDebugEnabled) { + logger.debug(sKmeans model cluster centers: ${ccs.mkString(,)}) +} +val estCollected = estimates.collect.sortBy(_._1._1) +if (logger.isDebugEnabled) { + val clusters = estCollected.map(_._2) + val counts = estCollected.groupBy(_._2).mapValues { +_.length + } + logger.debug(sCluster counts: Counts: ${counts.mkString(,)} ++ s\nCluster Estimates: ${estCollected.mkString
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23807026 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PowerIterationClustering { + + private val logger = Logger.getLogger(getClass.getName()) + + type LabeledPoint = (VertexId, BDV[Double]) + type Points = Seq[LabeledPoint] + type DGraph = Graph[Double, Double] + type IndexedVector[Double] = (Long, BDV[Double]) + + // Terminate iteration when norm changes by less than this value + val defaultMinNormChange: Double = 1e-11 + + // Default number of iterations for PIC loop + val defaultIterations: Int = 20 + + // Do not allow divide by zero: change to this value instead + val defaultDivideByZeroVal: Double = 1e-15 + + // Default number of runs by the KMeans.run() method + val defaultKMeansRuns = 10 + + /** + * + * Run a Power Iteration Clustering + * + * @param sc Spark Context + * @param G Affinity Matrix in a Sparse Graph structure --- End diff -- Done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23801499 --- Diff: docs/mllib-clustering-pic.md --- @@ -0,0 +1,30 @@ +--- --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23802807 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23808004 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PowerIterationClustering { + + private val logger = Logger.getLogger(getClass.getName()) + + type LabeledPoint = (VertexId, BDV[Double]) + type Points = Seq[LabeledPoint] + type DGraph = Graph[Double, Double] + type IndexedVector[Double] = (Long, BDV[Double]) + + // Terminate iteration when norm changes by less than this value + val defaultMinNormChange: Double = 1e-11 + + // Default number of iterations for PIC loop + val defaultIterations: Int = 20 + + // Do not allow divide by zero: change to this value instead + val defaultDivideByZeroVal: Double = 1e-15 + + // Default number of runs by the KMeans.run() method + val defaultKMeansRuns = 10 + + /** + * + * Run a Power Iteration Clustering + * + * @param sc Spark Context + * @param G Affinity Matrix in a Sparse Graph structure + * @param nClusters Number of clusters to create + * @param nIterations Number of iterations of the PIC algorithm + *that calculates primary PseudoEigenvector and Eigenvalue + * @param nRuns Number of runs for the KMeans clustering + * @return Tuple of (Seq[(Cluster Id,Cluster Center)], + * Seq[(VertexId, ClusterID Membership)] + */ + def run(sc: SparkContext, + G: Graph[Double, Double], + nClusters: Int, + nIterations: Int = defaultIterations, + nRuns: Int = defaultKMeansRuns) + : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = { +val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations) +// TODO: avoid local collect and then sc.parallelize. +val localVt = vt.collect.sortBy(_._1) +val vectRdd = sc.parallelize(localVt.map(v = (v._1, Vectors.dense(v._2 +vectRdd.cache() +val model = KMeans.train(vectRdd.map { + _._2 +}, nClusters, nRuns) +vectRdd.unpersist() +if (logger.isDebugEnabled) { + logger.debug(sEigenvalue = $lambda EigenVector: ${localVt.mkString(,)}) +} +val estimates = vectRdd.zip(model.predict(vectRdd.map(_._2))) +if (logger.isDebugEnabled) { + logger.debug(slambda=$lambda eigen=${localVt.mkString(,)}) +} +val ccs = (0 until model.clusterCenters.length).zip(model.clusterCenters) +if (logger.isDebugEnabled) { + logger.debug(sKmeans model cluster centers: ${ccs.mkString(,)}) +} +val estCollected = estimates.collect.sortBy(_._1._1) +if (logger.isDebugEnabled) { + val clusters = estCollected.map(_._2) + val counts = estCollected.groupBy(_._2).mapValues { +_.length + } + logger.debug(sCluster counts: Counts: ${counts.mkString(,)} ++ s\nCluster Estimates: ${estCollected.mkString
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23807925 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PowerIterationClustering { + + private val logger = Logger.getLogger(getClass.getName()) + + type LabeledPoint = (VertexId, BDV[Double]) + type Points = Seq[LabeledPoint] + type DGraph = Graph[Double, Double] + type IndexedVector[Double] = (Long, BDV[Double]) + + // Terminate iteration when norm changes by less than this value + val defaultMinNormChange: Double = 1e-11 + + // Default number of iterations for PIC loop + val defaultIterations: Int = 20 + + // Do not allow divide by zero: change to this value instead + val defaultDivideByZeroVal: Double = 1e-15 + + // Default number of runs by the KMeans.run() method + val defaultKMeansRuns = 10 + + /** + * + * Run a Power Iteration Clustering + * + * @param sc Spark Context + * @param G Affinity Matrix in a Sparse Graph structure + * @param nClusters Number of clusters to create + * @param nIterations Number of iterations of the PIC algorithm + *that calculates primary PseudoEigenvector and Eigenvalue + * @param nRuns Number of runs for the KMeans clustering + * @return Tuple of (Seq[(Cluster Id,Cluster Center)], + * Seq[(VertexId, ClusterID Membership)] + */ + def run(sc: SparkContext, + G: Graph[Double, Double], + nClusters: Int, + nIterations: Int = defaultIterations, + nRuns: Int = defaultKMeansRuns) + : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = { +val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations) +// TODO: avoid local collect and then sc.parallelize. +val localVt = vt.collect.sortBy(_._1) +val vectRdd = sc.parallelize(localVt.map(v = (v._1, Vectors.dense(v._2 +vectRdd.cache() +val model = KMeans.train(vectRdd.map { + _._2 +}, nClusters, nRuns) +vectRdd.unpersist() +if (logger.isDebugEnabled) { + logger.debug(sEigenvalue = $lambda EigenVector: ${localVt.mkString(,)}) +} +val estimates = vectRdd.zip(model.predict(vectRdd.map(_._2))) +if (logger.isDebugEnabled) { + logger.debug(slambda=$lambda eigen=${localVt.mkString(,)}) +} +val ccs = (0 until model.clusterCenters.length).zip(model.clusterCenters) +if (logger.isDebugEnabled) { + logger.debug(sKmeans model cluster centers: ${ccs.mkString(,)}) +} +val estCollected = estimates.collect.sortBy(_._1._1) +if (logger.isDebugEnabled) { + val clusters = estCollected.map(_._2) + val counts = estCollected.groupBy(_._2).mapValues { +_.length + } + logger.debug(sCluster counts: Counts: ${counts.mkString(,)} ++ s\nCluster Estimates: ${estCollected.mkString
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23807487 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PowerIterationClustering { + + private val logger = Logger.getLogger(getClass.getName()) + + type LabeledPoint = (VertexId, BDV[Double]) + type Points = Seq[LabeledPoint] + type DGraph = Graph[Double, Double] + type IndexedVector[Double] = (Long, BDV[Double]) + + // Terminate iteration when norm changes by less than this value + val defaultMinNormChange: Double = 1e-11 + + // Default number of iterations for PIC loop + val defaultIterations: Int = 20 + + // Do not allow divide by zero: change to this value instead + val defaultDivideByZeroVal: Double = 1e-15 + + // Default number of runs by the KMeans.run() method + val defaultKMeansRuns = 10 + + /** + * + * Run a Power Iteration Clustering + * + * @param sc Spark Context + * @param G Affinity Matrix in a Sparse Graph structure + * @param nClusters Number of clusters to create + * @param nIterations Number of iterations of the PIC algorithm + *that calculates primary PseudoEigenvector and Eigenvalue + * @param nRuns Number of runs for the KMeans clustering + * @return Tuple of (Seq[(Cluster Id,Cluster Center)], + * Seq[(VertexId, ClusterID Membership)] + */ + def run(sc: SparkContext, + G: Graph[Double, Double], + nClusters: Int, + nIterations: Int = defaultIterations, + nRuns: Int = defaultKMeansRuns) + : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = { +val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations) +// TODO: avoid local collect and then sc.parallelize. +val localVt = vt.collect.sortBy(_._1) +val vectRdd = sc.parallelize(localVt.map(v = (v._1, Vectors.dense(v._2 +vectRdd.cache() --- End diff -- true, it was used sorted for other purpose, will remove. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23807503 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PowerIterationClustering { + + private val logger = Logger.getLogger(getClass.getName()) + + type LabeledPoint = (VertexId, BDV[Double]) + type Points = Seq[LabeledPoint] + type DGraph = Graph[Double, Double] + type IndexedVector[Double] = (Long, BDV[Double]) + + // Terminate iteration when norm changes by less than this value + val defaultMinNormChange: Double = 1e-11 + + // Default number of iterations for PIC loop + val defaultIterations: Int = 20 + + // Do not allow divide by zero: change to this value instead + val defaultDivideByZeroVal: Double = 1e-15 + + // Default number of runs by the KMeans.run() method + val defaultKMeansRuns = 10 + + /** + * + * Run a Power Iteration Clustering + * + * @param sc Spark Context + * @param G Affinity Matrix in a Sparse Graph structure + * @param nClusters Number of clusters to create + * @param nIterations Number of iterations of the PIC algorithm + *that calculates primary PseudoEigenvector and Eigenvalue + * @param nRuns Number of runs for the KMeans clustering + * @return Tuple of (Seq[(Cluster Id,Cluster Center)], + * Seq[(VertexId, ClusterID Membership)] + */ + def run(sc: SparkContext, + G: Graph[Double, Double], + nClusters: Int, + nIterations: Int = defaultIterations, + nRuns: Int = defaultKMeansRuns) + : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = { +val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations) +// TODO: avoid local collect and then sc.parallelize. +val localVt = vt.collect.sortBy(_._1) +val vectRdd = sc.parallelize(localVt.map(v = (v._1, Vectors.dense(v._2 +vectRdd.cache() +val model = KMeans.train(vectRdd.map { + _._2 +}, nClusters, nRuns) +vectRdd.unpersist() +if (logger.isDebugEnabled) { + logger.debug(sEigenvalue = $lambda EigenVector: ${localVt.mkString(,)}) +} +val estimates = vectRdd.zip(model.predict(vectRdd.map(_._2))) +if (logger.isDebugEnabled) { + logger.debug(slambda=$lambda eigen=${localVt.mkString(,)}) +} +val ccs = (0 until model.clusterCenters.length).zip(model.clusterCenters) +if (logger.isDebugEnabled) { + logger.debug(sKmeans model cluster centers: ${ccs.mkString(,)}) +} +val estCollected = estimates.collect.sortBy(_._1._1) +if (logger.isDebugEnabled) { + val clusters = estCollected.map(_._2) + val counts = estCollected.groupBy(_._2).mapValues { +_.length + } + logger.debug(sCluster counts: Counts: ${counts.mkString(,)} ++ s\nCluster Estimates: ${estCollected.mkString
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23807540 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PowerIterationClustering { + + private val logger = Logger.getLogger(getClass.getName()) + + type LabeledPoint = (VertexId, BDV[Double]) + type Points = Seq[LabeledPoint] + type DGraph = Graph[Double, Double] + type IndexedVector[Double] = (Long, BDV[Double]) + + // Terminate iteration when norm changes by less than this value + val defaultMinNormChange: Double = 1e-11 + + // Default number of iterations for PIC loop + val defaultIterations: Int = 20 + + // Do not allow divide by zero: change to this value instead + val defaultDivideByZeroVal: Double = 1e-15 + + // Default number of runs by the KMeans.run() method + val defaultKMeansRuns = 10 + + /** + * + * Run a Power Iteration Clustering + * + * @param sc Spark Context + * @param G Affinity Matrix in a Sparse Graph structure + * @param nClusters Number of clusters to create + * @param nIterations Number of iterations of the PIC algorithm + *that calculates primary PseudoEigenvector and Eigenvalue + * @param nRuns Number of runs for the KMeans clustering + * @return Tuple of (Seq[(Cluster Id,Cluster Center)], + * Seq[(VertexId, ClusterID Membership)] + */ + def run(sc: SparkContext, + G: Graph[Double, Double], + nClusters: Int, + nIterations: Int = defaultIterations, + nRuns: Int = defaultKMeansRuns) + : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = { +val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations) +// TODO: avoid local collect and then sc.parallelize. +val localVt = vt.collect.sortBy(_._1) +val vectRdd = sc.parallelize(localVt.map(v = (v._1, Vectors.dense(v._2 +vectRdd.cache() +val model = KMeans.train(vectRdd.map { + _._2 +}, nClusters, nRuns) +vectRdd.unpersist() +if (logger.isDebugEnabled) { + logger.debug(sEigenvalue = $lambda EigenVector: ${localVt.mkString(,)}) +} +val estimates = vectRdd.zip(model.predict(vectRdd.map(_._2))) +if (logger.isDebugEnabled) { + logger.debug(slambda=$lambda eigen=${localVt.mkString(,)}) +} +val ccs = (0 until model.clusterCenters.length).zip(model.clusterCenters) +if (logger.isDebugEnabled) { + logger.debug(sKmeans model cluster centers: ${ccs.mkString(,)}) +} +val estCollected = estimates.collect.sortBy(_._1._1) +if (logger.isDebugEnabled) { + val clusters = estCollected.map(_._2) + val counts = estCollected.groupBy(_._2).mapValues { +_.length + } + logger.debug(sCluster counts: Counts: ${counts.mkString(,)} ++ s\nCluster Estimates: ${estCollected.mkString
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23802566 --- Diff: docs/mllib-clustering-pic.md --- @@ -0,0 +1,30 @@ +--- +layout: global +title: Clustering - MLlib +displayTitle: a href=mllib-guide.htmlMLlib/a - Power Iteration Clustering +--- + +* Table of contents +{:toc} + + +## Power Iteration Clustering + +Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm: + +* computes the Gaussian distance between all pairs of points and represents these distances in an Affinity Matrix --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23802606 --- Diff: mllib/pom.xml --- @@ -103,6 +108,13 @@ typetest-jar/type scopetest/scope /dependency +!--dependency --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23807449 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PowerIterationClustering { + + private val logger = Logger.getLogger(getClass.getName()) + + type LabeledPoint = (VertexId, BDV[Double]) + type Points = Seq[LabeledPoint] + type DGraph = Graph[Double, Double] + type IndexedVector[Double] = (Long, BDV[Double]) + + // Terminate iteration when norm changes by less than this value + val defaultMinNormChange: Double = 1e-11 + + // Default number of iterations for PIC loop + val defaultIterations: Int = 20 + + // Do not allow divide by zero: change to this value instead + val defaultDivideByZeroVal: Double = 1e-15 + + // Default number of runs by the KMeans.run() method + val defaultKMeansRuns = 10 + + /** + * + * Run a Power Iteration Clustering + * + * @param sc Spark Context + * @param G Affinity Matrix in a Sparse Graph structure + * @param nClusters Number of clusters to create + * @param nIterations Number of iterations of the PIC algorithm + *that calculates primary PseudoEigenvector and Eigenvalue + * @param nRuns Number of runs for the KMeans clustering + * @return Tuple of (Seq[(Cluster Id,Cluster Center)], + * Seq[(VertexId, ClusterID Membership)] + */ + def run(sc: SparkContext, --- End diff -- yes, got it: use the rdd.sparkContext. RE: changing input type from Graph to RDD[(Long,Long,Double)]: OK - that does give more flexibility if we choose to have different backends (e.g. not graphx) for this implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23802691 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23807660 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PowerIterationClustering { + + private val logger = Logger.getLogger(getClass.getName()) + + type LabeledPoint = (VertexId, BDV[Double]) + type Points = Seq[LabeledPoint] + type DGraph = Graph[Double, Double] + type IndexedVector[Double] = (Long, BDV[Double]) + + // Terminate iteration when norm changes by less than this value + val defaultMinNormChange: Double = 1e-11 + + // Default number of iterations for PIC loop + val defaultIterations: Int = 20 + + // Do not allow divide by zero: change to this value instead + val defaultDivideByZeroVal: Double = 1e-15 + + // Default number of runs by the KMeans.run() method + val defaultKMeansRuns = 10 + + /** + * + * Run a Power Iteration Clustering + * + * @param sc Spark Context + * @param G Affinity Matrix in a Sparse Graph structure + * @param nClusters Number of clusters to create + * @param nIterations Number of iterations of the PIC algorithm + *that calculates primary PseudoEigenvector and Eigenvalue + * @param nRuns Number of runs for the KMeans clustering + * @return Tuple of (Seq[(Cluster Id,Cluster Center)], + * Seq[(VertexId, ClusterID Membership)] + */ + def run(sc: SparkContext, + G: Graph[Double, Double], + nClusters: Int, + nIterations: Int = defaultIterations, + nRuns: Int = defaultKMeansRuns) + : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = { +val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations) +// TODO: avoid local collect and then sc.parallelize. +val localVt = vt.collect.sortBy(_._1) +val vectRdd = sc.parallelize(localVt.map(v = (v._1, Vectors.dense(v._2 +vectRdd.cache() +val model = KMeans.train(vectRdd.map { + _._2 +}, nClusters, nRuns) +vectRdd.unpersist() +if (logger.isDebugEnabled) { + logger.debug(sEigenvalue = $lambda EigenVector: ${localVt.mkString(,)}) +} +val estimates = vectRdd.zip(model.predict(vectRdd.map(_._2))) +if (logger.isDebugEnabled) { + logger.debug(slambda=$lambda eigen=${localVt.mkString(,)}) +} +val ccs = (0 until model.clusterCenters.length).zip(model.clusterCenters) +if (logger.isDebugEnabled) { + logger.debug(sKmeans model cluster centers: ${ccs.mkString(,)}) +} +val estCollected = estimates.collect.sortBy(_._1._1) +if (logger.isDebugEnabled) { + val clusters = estCollected.map(_._2) + val counts = estCollected.groupBy(_._2).mapValues { +_.length + } + logger.debug(sCluster counts: Counts: ${counts.mkString(,)} ++ s\nCluster Estimates: ${estCollected.mkString
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23802882 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PowerIterationClustering { --- End diff -- OK working on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23802897 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PowerIterationClustering { + + private val logger = Logger.getLogger(getClass.getName()) --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23818389 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PowerIterationClustering { + + private val logger = Logger.getLogger(getClass.getName()) + + type LabeledPoint = (VertexId, BDV[Double]) + type Points = Seq[LabeledPoint] + type DGraph = Graph[Double, Double] + type IndexedVector[Double] = (Long, BDV[Double]) + + // Terminate iteration when norm changes by less than this value + val defaultMinNormChange: Double = 1e-11 + + // Default number of iterations for PIC loop + val defaultIterations: Int = 20 + + // Do not allow divide by zero: change to this value instead + val defaultDivideByZeroVal: Double = 1e-15 + + // Default number of runs by the KMeans.run() method + val defaultKMeansRuns = 10 + + /** + * + * Run a Power Iteration Clustering + * + * @param sc Spark Context + * @param G Affinity Matrix in a Sparse Graph structure + * @param nClusters Number of clusters to create + * @param nIterations Number of iterations of the PIC algorithm + *that calculates primary PseudoEigenvector and Eigenvalue + * @param nRuns Number of runs for the KMeans clustering + * @return Tuple of (Seq[(Cluster Id,Cluster Center)], + * Seq[(VertexId, ClusterID Membership)] + */ + def run(sc: SparkContext, + G: Graph[Double, Double], + nClusters: Int, + nIterations: Int = defaultIterations, + nRuns: Int = defaultKMeansRuns) + : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = { +val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations) +// TODO: avoid local collect and then sc.parallelize. +val localVt = vt.collect.sortBy(_._1) +val vectRdd = sc.parallelize(localVt.map(v = (v._1, Vectors.dense(v._2 +vectRdd.cache() +val model = KMeans.train(vectRdd.map { + _._2 +}, nClusters, nRuns) +vectRdd.unpersist() +if (logger.isDebugEnabled) { + logger.debug(sEigenvalue = $lambda EigenVector: ${localVt.mkString(,)}) +} +val estimates = vectRdd.zip(model.predict(vectRdd.map(_._2))) +if (logger.isDebugEnabled) { + logger.debug(slambda=$lambda eigen=${localVt.mkString(,)}) +} +val ccs = (0 until model.clusterCenters.length).zip(model.clusterCenters) +if (logger.isDebugEnabled) { + logger.debug(sKmeans model cluster centers: ${ccs.mkString(,)}) +} +val estCollected = estimates.collect.sortBy(_._1._1) +if (logger.isDebugEnabled) { + val clusters = estCollected.map(_._2) + val counts = estCollected.groupBy(_._2).mapValues { +_.length + } + logger.debug(sCluster counts: Counts: ${counts.mkString(,)} ++ s\nCluster Estimates: ${estCollected.mkString
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23819495 --- Diff: data/mllib/pic_data.txt --- @@ -0,0 +1,299 @@ +1000 0.000.0001250.380.012684 0.0006380.0510910.0001510.0442080.004264 0.0006170.0077460.0365690.0018130.000305 0.0031710.0041140.0005300.0168000.003396 0.0175660.0347560.180.0510960.22 0.0017490.0002100.0060650.0069690.016719 0.0060280.0033780.0032000.0250720.000291 0.0001160.0016330.280.0113050.19 0.0103590.0065330.0475930.0274110.59 0.0175580.0005180.0009460.0442120.94 0.0054040.0267620.0099410.0038010.27 0.0001610.0009010.190.0005180.034732 0.590.0001260.0009700.0118140.005997 0.0002050.0018320.0087920.0363180.000149 0.0327810.0106920.0005300.0105570.016641 0.0081800.0016060.920.0074450.026718 0.0274570.0009570.0059010.0003140.000162 0.0008560.0047760.0081140.0036930.38 0.0249650.0442560.0071800.220.010297 0.0009940.0442550.0017250.0165410.003658 0.000288 --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23811947 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PowerIterationClustering { + + private val logger = Logger.getLogger(getClass.getName()) + + type LabeledPoint = (VertexId, BDV[Double]) + type Points = Seq[LabeledPoint] + type DGraph = Graph[Double, Double] + type IndexedVector[Double] = (Long, BDV[Double]) + + // Terminate iteration when norm changes by less than this value + val defaultMinNormChange: Double = 1e-11 + + // Default number of iterations for PIC loop + val defaultIterations: Int = 20 + + // Do not allow divide by zero: change to this value instead + val defaultDivideByZeroVal: Double = 1e-15 + + // Default number of runs by the KMeans.run() method + val defaultKMeansRuns = 10 + + /** + * + * Run a Power Iteration Clustering + * + * @param sc Spark Context + * @param G Affinity Matrix in a Sparse Graph structure + * @param nClusters Number of clusters to create + * @param nIterations Number of iterations of the PIC algorithm + *that calculates primary PseudoEigenvector and Eigenvalue + * @param nRuns Number of runs for the KMeans clustering + * @return Tuple of (Seq[(Cluster Id,Cluster Center)], + * Seq[(VertexId, ClusterID Membership)] + */ + def run(sc: SparkContext, + G: Graph[Double, Double], + nClusters: Int, + nIterations: Int = defaultIterations, + nRuns: Int = defaultKMeansRuns) + : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = { +val (gUpdated, lambda, vt) = getPrincipalEigen(sc, G, nIterations) +// TODO: avoid local collect and then sc.parallelize. +val localVt = vt.collect.sortBy(_._1) +val vectRdd = sc.parallelize(localVt.map(v = (v._1, Vectors.dense(v._2 +vectRdd.cache() +val model = KMeans.train(vectRdd.map { + _._2 +}, nClusters, nRuns) +vectRdd.unpersist() +if (logger.isDebugEnabled) { + logger.debug(sEigenvalue = $lambda EigenVector: ${localVt.mkString(,)}) +} +val estimates = vectRdd.zip(model.predict(vectRdd.map(_._2))) +if (logger.isDebugEnabled) { + logger.debug(slambda=$lambda eigen=${localVt.mkString(,)}) +} +val ccs = (0 until model.clusterCenters.length).zip(model.clusterCenters) +if (logger.isDebugEnabled) { + logger.debug(sKmeans model cluster centers: ${ccs.mkString(,)}) +} +val estCollected = estimates.collect.sortBy(_._1._1) +if (logger.isDebugEnabled) { + val clusters = estCollected.map(_._2) + val counts = estCollected.groupBy(_._2).mapValues { +_.length + } + logger.debug(sCluster counts: Counts: ${counts.mkString(,)} ++ s\nCluster Estimates: ${estCollected.mkString
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23818325 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PowerIterationClustering { + + private val logger = Logger.getLogger(getClass.getName()) + + type LabeledPoint = (VertexId, BDV[Double]) + type Points = Seq[LabeledPoint] + type DGraph = Graph[Double, Double] + type IndexedVector[Double] = (Long, BDV[Double]) + + // Terminate iteration when norm changes by less than this value + val defaultMinNormChange: Double = 1e-11 + + // Default number of iterations for PIC loop + val defaultIterations: Int = 20 + + // Do not allow divide by zero: change to this value instead + val defaultDivideByZeroVal: Double = 1e-15 + + // Default number of runs by the KMeans.run() method + val defaultKMeansRuns = 10 + + /** + * + * Run a Power Iteration Clustering + * + * @param sc Spark Context + * @param G Affinity Matrix in a Sparse Graph structure + * @param nClusters Number of clusters to create + * @param nIterations Number of iterations of the PIC algorithm + *that calculates primary PseudoEigenvector and Eigenvalue + * @param nRuns Number of runs for the KMeans clustering + * @return Tuple of (Seq[(Cluster Id,Cluster Center)], + * Seq[(VertexId, ClusterID Membership)] --- End diff -- OK I did not consider THAT large of scale - where even a single row or column does not fit into single Worker memory. I am looking into it now to see if there were any obvious hindrances. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/4254#issuecomment-71947549 I have moved out the Gaussian / Affinity matrix calculations. It is not clear where their new home / if they have a new home. Presently the testcases rely upon them - so they are now in the testing suite. The core PowerIterationClustering class is still working, now with a Graph input. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23728755 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala --- @@ -0,0 +1,433 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PIClustering { + + private val logger = Logger.getLogger(getClass.getName()) + + type LabeledPoint = (VertexId, BDV[Double]) + type Points = Seq[LabeledPoint] + type DGraph = Graph[Double, Double] + type IndexedVector[Double] = (Long, BDV[Double]) + + // Terminate iteration when norm changes by less than this value + private[mllib] val DefaultMinNormChange: Double = 1e-11 + + // Default Ï for Gaussian Distance calculations + private[mllib] val DefaultSigma = 1.0 + + // Default number of iterations for PIC loop + private[mllib] val DefaultIterations: Int = 20 + + // Default minimum affinity between points - lower than this it is considered + // zero and no edge will be created + private[mllib] val DefaultMinAffinity = 1e-11 + + // Do not allow divide by zero: change to this value instead + val DefaultDivideByZeroVal: Double = 1e-15 + + // Default number of runs by the KMeans.run() method + val DefaultKMeansRuns = 10 + + /** + * + * Run a Power Iteration Clustering + * + * @param sc Spark Context + * @param points Input Points in format of [(VertexId,(x,y)] + *where VertexId is a Long + * @param nClusters Number of clusters to create + * @param nIterations Number of iterations of the PIC algorithm + *that calculates primary PseudoEigenvector and Eigenvalue + * @param sigma Sigma for Gaussian distribution calculation according to + *[1/2 *sqrt(pi*sigma)] exp (- (x-y)**2 / 2sigma**2 + * @param minAffinity Minimum Affinity between two Points in the input dataset: below + * this threshold the affinity will be considered close to zero and + * no Edge will be created between those Points in the sparse matrix + * @param nRuns Number of runs for the KMeans clustering + * @return Tuple of (Seq[(Cluster Id,Cluster Center)], + * Seq[(VertexId, ClusterID Membership)] + */ + def run(sc: SparkContext, + points: Points, + nClusters: Int, + nIterations: Int = DefaultIterations, + sigma: Double = DefaultSigma, + minAffinity: Double = DefaultMinAffinity, + nRuns: Int = DefaultKMeansRuns) + : (Seq[(Int, Vector)], Seq[((VertexId, Vector), Int)]) = { +val vidsRdd = sc.parallelize(points.map(_._1).sorted) +val nVertices = points.length + +val (wRdd, rowSums) = createNormalizedAffinityMatrix(sc, points, sigma) +val initialVt = createInitialVector(sc, points.map(_._1), rowSums) +if (logger.isDebugEnabled) { + logger.debug(sVt(0)=${ +printVector(new BDV(initialVt.map { + _._2 +}.toArray
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23728797 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala --- @@ -0,0 +1,433 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PIClustering { + + private val logger = Logger.getLogger(getClass.getName()) + + type LabeledPoint = (VertexId, BDV[Double]) + type Points = Seq[LabeledPoint] + type DGraph = Graph[Double, Double] + type IndexedVector[Double] = (Long, BDV[Double]) + + // Terminate iteration when norm changes by less than this value + private[mllib] val DefaultMinNormChange: Double = 1e-11 --- End diff -- Typical scala style is to use CamelCase for constants. Would you pls explain briefly? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/4254#issuecomment-71929175 RE: Is it possible to do Gaussian similarity in another PR? It should be part of the feature transformation but not within PIC. It would be easier for code review if the PR is minimal. This is a big change late in the game. We could do the following more easily: change the signature to PIC to allow the input to be the Affinity Matrix (user defined). Then we retain the code for the Gaussian similarity calculation to be sent in . Our tests and documentation and verification all depend on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23728909 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/PIClusteringSuite.scala --- @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.graphx._ +import org.apache.spark.{SparkConf, SparkContext} +import org.scalatest.FunSuite + +import scala.util.Random + +class PIClusteringSuite extends FunSuite with LocalSparkContext { + + val logger = Logger.getLogger(getClass.getName) + + import org.apache.spark.mllib.clustering.PIClusteringSuite._ + + val PIC = PIClustering + val A = Array + + test(concentricCirclesTest) { +concentricCirclesTest() + } + + + def concentricCirclesTest() = { +val sigma = 1.0 +val nIterations = 10 + +val circleSpecs = Seq( + // Best results for 30 points + CircleSpec(Point(0.0, 0.0), 0.03, 0.1, 3), + CircleSpec(Point(0.0, 0.0), 0.3, 0.03, 12), + CircleSpec(Point(0.0, 0.0), 1.0, 0.01, 15) + // Add following to get 100 points + , CircleSpec(Point(0.0, 0.0), 1.5, 0.005, 30), + CircleSpec(Point(0.0, 0.0), 2.0, 0.002, 40) +) + +val nClusters = circleSpecs.size +val cdata = createConcentricCirclesData(circleSpecs) +withSpark { sc = + val vertices = new Random().shuffle(cdata.map { p = +(p.label, new BDV(Array(p.x, p.y))) + }) + + val nVertices = vertices.length + val (ccenters, estCollected) = PIC.run(sc, vertices, nClusters, nIterations) + logger.info(sCluster centers: ${ccenters.mkString(,)} + +s\nEstimates: ${estCollected.mkString([, ,, ])}) + assert(ccenters.size == circleSpecs.length, Did not get correct number of centers) + +} + } + +} + +object PIClusteringSuite { + val logger = Logger.getLogger(getClass.getName) + val A = Array + + def pdoub(d: Double) = f$d%1.6f + + case class Point(label: Long, x: Double, y: Double) { +def this(x: Double, y: Double) = this(-1L, x, y) + +override def toString() = s($label, (${pdoub(x)},${pdoub(y)})) + } + + object Point { +def apply(x: Double, y: Double) = new Point(-1L, x, y) + } + + case class CircleSpec(center: Point, radius: Double, noiseToRadiusRatio: Double, +nPoints: Int, uniformDistOnCircle: Boolean = true) + + def createConcentricCirclesData(circleSpecs: Seq[CircleSpec]) = { +import org.apache.spark.mllib.random.StandardNormalGenerator +val normalGen = new StandardNormalGenerator +var idStart = 0 +val circles = for (csp - circleSpecs) yield { + idStart += 1000 + val circlePoints = for (thetax - 0 until csp.nPoints) yield { +val theta = thetax * 2 * Math.PI / csp.nPoints +val (x, y) = (csp.radius * Math.cos(theta) + * (1 + normalGen.nextValue * csp.noiseToRadiusRatio), + csp.radius * Math.sin(theta) * (1 + normalGen.nextValue * csp.noiseToRadiusRatio)) +(Point(idStart + thetax, x, y)) + } + circlePoints +} +val points = circles.flatten.sortBy(_.label) +logger.info(printPoints(points)) +points + } + + def printPoints(points: Seq[Point]) = { +points.mkString([, , , ]) + } + + def main(args: Array[String]) { +val pictest = new PIClusteringSuite +pictest.concentricCirclesTest() + } +} + +/** + * Provides a method to run tests against a {@link SparkContext} variable that is correctly stopped + * after each test. + * TODO: import this from the graphx test cases package i.e. may need update to pom.xml
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/4254#issuecomment-71932032 Hi, We have suggestion here: to separate the creation/definition of the input graph from the PIC: val G = PIC.createGaussianAffinityMatrix(sc, vertices) // Create the affinity matrix in a Graph structure val (ccenters, estCollected) = PIC.run(sc, G, nClusters, nIterations) // Run PIC The new signatures are: Take a Graph input instead of the input vertices: def run(sc: SparkContext, G: Graph[Double, Double], nClusters: Int, nIterations: Int = defaultIterations, sigma: Double = defaultSigma, minAffinity: Double = defaultMinAffinity, nRuns: Int = defaultKMeansRuns) And here is the new method for creating the input Graph: def createGaussianAffinityMatrix(sc: SparkContext, points: Points, sigma: Double = defaultSigma, minAffinity: Double = defaultMinAffinity) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23728804 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/PIClusteringSuite.scala --- @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.graphx._ +import org.apache.spark.{SparkConf, SparkContext} +import org.scalatest.FunSuite + +import scala.util.Random + +class PIClusteringSuite extends FunSuite with LocalSparkContext { + + val logger = Logger.getLogger(getClass.getName) + + import org.apache.spark.mllib.clustering.PIClusteringSuite._ + + val PIC = PIClustering + val A = Array + + test(concentricCirclesTest) { +concentricCirclesTest() + } + + + def concentricCirclesTest() = { +val sigma = 1.0 +val nIterations = 10 + +val circleSpecs = Seq( + // Best results for 30 points + CircleSpec(Point(0.0, 0.0), 0.03, 0.1, 3), + CircleSpec(Point(0.0, 0.0), 0.3, 0.03, 12), + CircleSpec(Point(0.0, 0.0), 1.0, 0.01, 15) + // Add following to get 100 points + , CircleSpec(Point(0.0, 0.0), 1.5, 0.005, 30), + CircleSpec(Point(0.0, 0.0), 2.0, 0.002, 40) +) + +val nClusters = circleSpecs.size +val cdata = createConcentricCirclesData(circleSpecs) +withSpark { sc = + val vertices = new Random().shuffle(cdata.map { p = +(p.label, new BDV(Array(p.x, p.y))) + }) + + val nVertices = vertices.length + val (ccenters, estCollected) = PIC.run(sc, vertices, nClusters, nIterations) + logger.info(sCluster centers: ${ccenters.mkString(,)} + +s\nEstimates: ${estCollected.mkString([, ,, ])}) + assert(ccenters.size == circleSpecs.length, Did not get correct number of centers) + +} + } + +} + +object PIClusteringSuite { + val logger = Logger.getLogger(getClass.getName) + val A = Array + + def pdoub(d: Double) = f$d%1.6f + + case class Point(label: Long, x: Double, y: Double) { +def this(x: Double, y: Double) = this(-1L, x, y) + +override def toString() = s($label, (${pdoub(x)},${pdoub(y)})) + } + + object Point { +def apply(x: Double, y: Double) = new Point(-1L, x, y) + } + + case class CircleSpec(center: Point, radius: Double, noiseToRadiusRatio: Double, +nPoints: Int, uniformDistOnCircle: Boolean = true) + + def createConcentricCirclesData(circleSpecs: Seq[CircleSpec]) = { +import org.apache.spark.mllib.random.StandardNormalGenerator +val normalGen = new StandardNormalGenerator +var idStart = 0 +val circles = for (csp - circleSpecs) yield { + idStart += 1000 + val circlePoints = for (thetax - 0 until csp.nPoints) yield { +val theta = thetax * 2 * Math.PI / csp.nPoints +val (x, y) = (csp.radius * Math.cos(theta) + * (1 + normalGen.nextValue * csp.noiseToRadiusRatio), + csp.radius * Math.sin(theta) * (1 + normalGen.nextValue * csp.noiseToRadiusRatio)) +(Point(idStart + thetax, x, y)) + } + circlePoints +} +val points = circles.flatten.sortBy(_.label) +logger.info(printPoints(points)) +points + } + + def printPoints(points: Seq[Point]) = { +points.mkString([, , , ]) + } + + def main(args: Array[String]) { --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23728761 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/PIClustering.scala --- @@ -0,0 +1,433 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV} +import org.apache.log4j.Logger +import org.apache.spark.SparkContext +import org.apache.spark.graphx._ +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + +import scala.language.existentials + +/** + * Implements the scalable graph clustering algorithm Power Iteration Clustering (see + * www.icml2010.org/papers/387.pdf). From the abstract: + * + * The input data is first transformed to a normalized Affinity Matrix via Gaussian pairwise + * distance calculations. Power iteration is then used to find a dimensionality-reduced + * representation. The resulting pseudo-eigenvector provides effective clustering - as + * performed by Parallel KMeans. + */ +object PIClustering { --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4259][MLlib]: Add Power Iteration Clust...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/4254#discussion_r23728918 --- Diff: mllib/src/test/resources/log4j.mllib.properties --- @@ -0,0 +1,41 @@ +# --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...
Github user javadba closed the pull request at: https://github.com/apache/spark/pull/1586 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-67383303 OK Michael thanks for the update. 2014-12-17 11:21 GMT-08:00 Michael Armbrust notificati...@github.com: Hey @javadba https://github.com/javadba, thanks for all the work on this, especially the time figuring out the surprisingly complicated semantics for this function. Also, sorry for the delay with review/merging! I'd love to add this, but right now I'm concerned that the way we are adding UDFs is unsustainable. I've written up some thoughts on the right way to proceed in SPARK-4867 https://issues.apache.org/jira/browse/SPARK-4867. Since I'm trying really hard to keep the PR queue small (mostly to help avoid PRs that languish like this one has been), I propose we close this issue for now and reopen once the UDF framework has been updated. I've linked your issue to that one so you or others can use this code as a starting point. â Reply to this email directly or view it on GitHub https://github.com/apache/spark/pull/1586#issuecomment-67377013. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-4060] [MLlib] exposing special rdd func...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/2907#issuecomment-60875650 RE: use case. We are considering to use the treeAggregate function within the implementation of SpectralClustering. In addition it was noted that the EigenvalueDecomposition.symmetricEigs is private: it is likely we would like to use that too --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
GitHub user javadba opened a pull request: https://github.com/apache/spark/pull/1954 Merge pull request #1 from apache/master Syncing to upstream You can merge this pull request into a Git repository by running: $ git pull https://github.com/javadba/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1954.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1954 commit 1b5213183b8a9d5160bbcea1c420a04c0b6f5dfa Author: StephenBoesch java...@gmail.com Date: 2014-08-13T01:00:25Z Merge pull request #1 from apache/master Syncing to upstream --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
Github user javadba closed the pull request at: https://github.com/apache/spark/pull/1954 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
GitHub user javadba opened a pull request: https://github.com/apache/spark/pull/1955 Merge pull request #1 from apache/master Syncing to upstream You can merge this pull request into a Git repository by running: $ git pull https://github.com/javadba/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1955.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1955 commit 1b5213183b8a9d5160bbcea1c420a04c0b6f5dfa Author: StephenBoesch java...@gmail.com Date: 2014-08-13T01:00:25Z Merge pull request #1 from apache/master Syncing to upstream --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
Github user javadba closed the pull request at: https://github.com/apache/spark/pull/1955 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
GitHub user javadba reopened a pull request: https://github.com/apache/spark/pull/1955 Merge pull request #1 from apache/master Syncing to upstream You can merge this pull request into a Git repository by running: $ git pull https://github.com/javadba/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1955.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1955 commit 1b5213183b8a9d5160bbcea1c420a04c0b6f5dfa Author: StephenBoesch java...@gmail.com Date: 2014-08-13T01:00:25Z Merge pull request #1 from apache/master Syncing to upstream --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Merge pull request #1 from apache/master
Github user javadba closed the pull request at: https://github.com/apache/spark/pull/1955 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-51614247 I have been waiting here for @ueshin and @marmbrus to decide on next steps. Please be clear what are the next steps at this point. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-51640101 @marmbrusI am fine with delays on this - I just was unclear as to whether there some expectation on action on my part.Overall this is a minor enhancement but it has generated a non-trivial amount of interaction and effort. I can understand if this were put on hold. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Fix tiny bug (likely copy and paste error) in ...
GitHub user javadba opened a pull request: https://github.com/apache/spark/pull/1792 Fix tiny bug (likely copy and paste error) in closing jdbc connection I inquired on dev mailing list about the motivation for checking the jdbc statement instead of the connection in the close() logic of JdbcRDD. Ted Yu believes there essentially is none- it is a simple cut and paste issue. So here is the tiny fix to patch it. You can merge this pull request into a Git repository by running: $ git pull https://github.com/javadba/spark closejdbc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1792.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1792 commit 095b2c9127b161437013476d48d855714716449f Author: Stephen Boesch javadba Date: 2014-08-05T19:46:55Z Fix tiny bug (likely copy and paste error) in closing jdbc connection --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2869 - Fix tiny bug in JdbcRdd for closi...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/1792#discussion_r15849947 --- Diff: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala --- @@ -106,7 +106,7 @@ class JdbcRDD[T: ClassTag]( case e: Exception = logWarning(Exception closing statement, e) } try { -if (null != conn ! stmt.isClosed()) conn.close() +if (null != conn ! conn.isClosed()) conn.close() --- End diff -- sure - done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2638 MapOutputTracker concurrency improv...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1542#issuecomment-51022369 Thanks for commenting Josh. I will see about putting together something on this including solid testcases. ETA later in the coming week. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-51065193 Hi, For some reason the CORE module testing has ballooned in overall testing time: it took over 7.5 hours to run. There was one timeout error out of 736 tests - and it is quite unlikely to have anything to do with the code added in this PR. Here is the test that failed and then the overall results: DriverSuite: Spark assembly has been built with Hive, including Datanucleus jars on classpath - driver should exit after finishing *** FAILED *** TestFailedDueToTimeoutException was thrown during property evaluation. (DriverSuite.scala:40) Message: The code passed to failAfter did not complete within 60 seconds. Location: (DriverSuite.scala:41) Occurred at table row 0 (zero based, not counting headings), which had values ( master = local ) Tests: succeeded 723, failed 1, canceled 0, ignored 7, pending 0 *** 1 TEST FAILED *** [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ... SUCCESS [ 1.180 s] [INFO] Spark Project Core . FAILURE [ 07:35 h] So I am not presently in a position to run regression tests - given the overall runtime will be doulbe-digit hours. Would someone please run Jenkins on this code? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-51146419 @ueshinI repeatably verified that simply changing OCTET_LEN to OCTET_LENGTH ended up causing SOF. By repeatably I mean: Set the 'constant' val OCTET_LENGTH=OCTET_LENGTH observe the error change to something like val OCTET_LENGTH=OCTET_LEN or val OCTET_LENGTH=OCTET_LENG observe the error has gone away Rinse, cleanse, repeat i have been able to demonstrate this multiple times. Now the regression tests have been run against the modified and reliable code. Please re-run your tests in a fresh area. I will do the same .. but i am hesitant to consider to revert because we have positive test results now with the latest commit (as well as my results of the problem before the commit). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-51149151 @ueshin I have git clone'd to a completely new area, and I reverted my last commit. git clone https://github.com/javadba/spark.git strlen2 cd strlen2 git chckout strlen git revert 22eddbce6a201c8f5b5c31859ceb972e60657377 mvn -DskipTests -Pyarn -Phive -Phadoop-2.3 clean compile package mvn -Pyarn -Phive -Phadoop-2.3 test -DwildcardSuites=org.apache.spark.sql.hive.execution.HiveQuerySuite,org.apache.spark.sql.SQLQuerySuite,org.apache.spark.sql.catalyst.expressions.ExpressionEvaluationSuite I get precisely the same error: HiveQuerySuite: 21:03:31.120 WARN org.apache.spark.util.Utils: Your hostname, mithril resolves to a loopback address: 127.0.1.1; using 10.0.0.33 instead (on interface eth0) 21:03:31.121 WARN org.apache.spark.util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address 21:03:37.294 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 21:03:40.045 WARN com.jolbox.bonecp.BoneCPConfig: Max Connections 1. Setting to 20 21:03:49.464 WARN com.jolbox.bonecp.BoneCPConfig: Max Connections 1. Setting to 20 21:03:49.487 WARN org.apache.hadoop.hive.metastore.ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 0.12.0 21:03:57.157 WARN com.jolbox.bonecp.BoneCPConfig: Max Connections 1. Setting to 20 21:03:57.593 WARN com.jolbox.bonecp.BoneCPConfig: Max Connections 1. Setting to 20 - single case - double case - case else null - having no references - boolean = number - CREATE TABLE AS runs once - between - div - division *** RUN ABORTED *** java.lang.StackOverflowError: at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) Now, let's revert the revert : git log commit db09cd132c2d7e995287eea54f3415726934138c Author: Stephen Boesch javadba Date: Mon Aug 4 20:54:24 2014 -0700 Revert Use Octet/Char_Len instead of Octet/Char_length due to apparent preexisting spark ParserCombinator bug. This reverts commit 22eddbce6a201c8f5b5c31859ceb972e60657377. git revert db09cd132c2d7e995287eea54f3415726934138c mvn -Pyarn -Phive -Phadoop-2.3 test -DwildcardSuites=org.apache.spark.sql.hive.execution.HiveQuerySuite,org.apache.spark.sql.SQLQuerySuite,org.apache.spark.sql.catalyst.expressions.ExpressionEvaluationSuite Now those three test sutes pass again (specifically HiveQuerySuite did not fail) And .. just to be *extra* sure here- that we can toggle between pass/fail arbitrary # of times: commit 602adedc9ca58d99957eb12bd91098ffe904604c Author: Stephen Boesch javadba Date: Mon Aug 4 21:18:53 2014 -0700 Revert Revert Use Octet/Char_Len instead of Octet/Char_length due to apparent preexisting spark ParserCombinator bug. git revert 602adedc9ca58d99957eb12bd91098ffe904604c And once again HiveQuerySuite fails with the same error. So I have established clearly the following: the strlen branch on my fork fails with SOF if we rollback the commit that changes OCTET/CHAR_LENGTH - OCTET/CHAR_LEN. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr
[GitHub] spark pull request: SPARK-2712 - Add a small note to maven doc tha...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1615#issuecomment-50994967 Apologies for the long delay - it was induced by a moderately arduous process of learning git workflows/rebase-ing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-51016510 The following division test in HiveQueryTest is failing with StackOverflowError. I have no idea why given there is no obvious connection to the added code in this PR: createQueryTest(division, SELECT 2 / 1, 1 / 2, 1 / 3, 1 / COUNT(*) FROM src LIMIT 1) - division *** RUN ABORTED *** java.lang.StackOverflowError: at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) at scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254) ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-51016969 I have narrowed the problem down to the SQLParser. I will update when the precise cause is determined, likely within the hour. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-51017366 This is strange. The change that causes that SOF is just renaming OCTET/CHAR_LEN to OCTET/CHAR_LENGTH Working protected val CHAR_LEN = Keyword(CHAR_LEN) protected val OCTET_LEN = Keyword(OCTET_LEN) StackOverflowError protected val CHAR_LENGTH = Keyword(CHAR_LENGTH) protected val OCTET_LENGTH = Keyword(OCTET_LENGTH) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-51018296 Surprising result here: the following change makes this work: StackOverflowError: protected val OCTET_LENGTH = Keyword(OCTET_LENGTH) Works fine: protected val OCTET_LENGTH = Keyword(OCTET_LEN) Also works fine: protected val OCTET_LENGTH = Keyword(OCTET_LENG) Let's double check - make sure this really repeatably fails on OCTET_LENGTH: And Yes! It does fail again with OCTET_LENGTH. We have a clear test failure scenario. So here we are we need to do OCTET/CHAR_LEN and NOT OCTET/CHAR_LENGTH - until the root cause of this unrelated parser bug is found! Should I open a separate JIRA for the parser bug? BTW my theory is that there is something happening when one KEYWORD contains another KEYWORD. But OTOH the LEN keyword is not causing an issue. So this is a subtle case to understand --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2686 Add Length and OctetLen support to ...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-51020150 The rename change was committed/pushed and the most germane tests pass. I am re-running full regression. One thing I have noticed already: the flume-sink external project is failing - looks to be unrelated to any of my work. But I am looking into it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-50856746 @ueshin I mostly agree except: let us keep the length which can be used for non-strings e.g. length(12345678) = 8 Then since length does handle strings as well by using codePoints as you suggest: then CharLength is synonym for Length (not the other way around) AFA OctetLength: I am quite fine with that and will implement it according to your suggestion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-50858453 @marmbrus OK fine with that. Then given the inputs from ueshin, we are presently at: len(gth)/char_length : take a single string argument and use codePointCount octet_length : takes two arguments: (string, charset) and returns the number of bytes --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-50858624 @marmbrusRE: Charlength for the expression - also fine, will do. (btw how did you highlight in the comment?) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-50923836 I am delayed in providing the next implementation, due to continuing investigation here. 1) The default encoding seems to have changed in jdk 6 (ISO-8859-1) to jdk7 (UTF-8?). Here the getBytes method returns substantially different results based on which JDK JDK 6: \u2345.getBytes.length A: 3 JDK 7: \u2345.getBytes.length A: 1 2) I have been absorbing the Hive binary/string implementations. The logic is short/easy to follow. But I am working through how to properly test this. Based on Takuya's (correct) comment about hive length support for both binary and string: let us take a look at it first: From hive.o.a.h.h.ql.udf.UDFLength @Description(name = length, value = _FUNC_(str | binary) - Returns the length of str or number of bytes in binary data, extended = Example:\n +SELECT _FUNC_('Facebook') FROM src LIMIT 1;\n + 8) @VectorizedExpressions({StringLength.class}) public class UDFLength extends UDF { private final IntWritable result = new IntWritable(); public IntWritable evaluate(Text s) { if (s == null) { return null; } byte[] data = s.getBytes(); int len = 0; for (int i = 0; i s.getLength(); i++) { if (GenericUDFUtils.isUtfStartByte(data[i])) { len++; } } result.set(len); return result; } public IntWritable evaluate(BytesWritable bw){ if (bw == null){ return null; } result.set(bw.getLength()); return result; } } So in hive the invocations of length would be: String: select length(my_string) from some_table; binary: select length(cast(my_string as binary)) from some_table; As noted above - the result will likely not be consistent across all instances of hive: in particular hive on jdk6 should be having a different answer (I only have jdk 7 in my testing environment). I am still pondering how to handle these differences. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-50936508 Hi, I have updated the codebase to match our present level of discussions - and to include the merge from upstream. Takuya's test cases have been incorporated as well - but I found some differences in my results. @ueshin please review the following, some of which do differ from your results presented above - especially for UTF-16 checkEvaluation(OctetLength( Literal(\uF93D\uF936\uF949\uF942, StringType), ISO-8859-1), 4) // Chinese characters get truncated by ISO-8859-1 encoding checkEvaluation(OctetLength( Literal(\uF93D\uF936\uF949\uF942, StringType), UTF-8), 12) // chinese characters checkEvaluation(OctetLength( Literal(\uD840\uDC0B\uD842\uDFB7, StringType), UTF-8), 8) // 2 surrogate pairs checkEvaluation(OctetLength( Literal(\uF93D\uF936\uF949\uF942, StringType), UTF-16), 10) // chinese characters checkEvaluation(OctetLength( Literal(\uD840\uDC0B\uD842\uDFB7, StringType), UTF-16), 10) // 2 surrogate pairs checkEvaluation(OctetLength( Literal(\uF93D\uF936\uF949\uF942, StringType), UTF-32), 16) // chinese characters checkEvaluation(OctetLength( Literal(\uD840\uDC0B\uD842\uDFB7, StringType), UTF-32), 8) // 2 surrogate pairs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-50936636 grrr.. My push got some other extraneous changes of mine. I will fix that now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-50946925 Cleaned up now. Re-ran the affected tests - and now re-running all tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-50947836 @ueshin Actually after re-checking it appears that my results match the ones you had placed in your comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/1586#discussion_r15726383 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -208,6 +211,96 @@ case class EndsWith(left: Expression, right: Expression) def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that returns the number of bytes in an expression + */ +case class Length(child: Expression) extends UnaryExpression { + + type EvaluatedType = Any + + override def dataType = IntegerType + + override def foldable = child.foldable + + override def nullable = child.nullable + + override def toString = sLength($child) + + override def eval(input: Row): EvaluatedType = { +val inputVal = child.eval(input) +if (inputVal == null) { + null +} else if (!inputVal.isInstanceOf[String]) { + inputVal.toString.length +} else { + OctetLenUtils.len(inputVal.asInstanceOf[String]) --- End diff -- How can use just String.codePointCount? you need to provide the start/end points and we do not know them w/o doing this kind of count --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/1586#discussion_r15726388 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -208,6 +211,96 @@ case class EndsWith(left: Expression, right: Expression) def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that returns the number of bytes in an expression + */ +case class Length(child: Expression) extends UnaryExpression { + + type EvaluatedType = Any + + override def dataType = IntegerType + + override def foldable = child.foldable + + override def nullable = child.nullable + + override def toString = sLength($child) + + override def eval(input: Row): EvaluatedType = { +val inputVal = child.eval(input) +if (inputVal == null) { + null +} else if (!inputVal.isInstanceOf[String]) { + inputVal.toString.length +} else { + OctetLenUtils.len(inputVal.asInstanceOf[String]) --- End diff -- I do agree String#getBytes will return different results based on for example JDK6 vs JDK7. Do you have a recommendation on this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/1586#discussion_r15726391 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -208,6 +211,96 @@ case class EndsWith(left: Expression, right: Expression) def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that returns the number of bytes in an expression + */ +case class Length(child: Expression) extends UnaryExpression { + + type EvaluatedType = Any + + override def dataType = IntegerType + + override def foldable = child.foldable + + override def nullable = child.nullable + + override def toString = sLength($child) + + override def eval(input: Row): EvaluatedType = { +val inputVal = child.eval(input) +if (inputVal == null) { + null +} else if (!inputVal.isInstanceOf[String]) { + inputVal.toString.length +} else { + OctetLenUtils.len(inputVal.asInstanceOf[String]) +} + } + +} + +object OctetLengthConstants { + val DefaultEncoding = ISO-8859-1 --- End diff -- OK, maybe that makes sense. I will change it, and let's continue to consider . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-50951491 BTW I am seeing testing errors in core for MapOutputTracker involving OOME. I had to restart testing a few times already. This is not anything to do with the code of this PR, just trying to do overall regrssion testing but failing in unrelated tests. Oh. it just failed again: ClosureCleanerSuite: - closures inside an object *** FAILED *** org.apache.spark.SparkException: Error communicating with MapOutputTracker at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:111) at org.apache.spark.MapOutputTracker.sendTracker(MapOutputTracker.scala:117) at org.apache.spark.MapOutputTrackerMaster.stop(MapOutputTracker.scala:324) at org.apache.spark.SparkEnv.stop(SparkEnv.scala:83) at org.apache.spark.SparkContext.stop(SparkContext.scala:1003) at org.apache.spark.LocalSparkContext$.stop(LocalSparkContext.scala:50) at org.apache.spark.LocalSparkContext$.withSpark(LocalSparkContext.scala:61) at org.apache.spark.util.TestObject$.run(ClosureCleanerSuite.scala:76) at org.apache.spark.util.ClosureCleanerSuite$$anonfun$1.apply$mcV$sp(ClosureCleanerSuite.scala:27) at org.apache.spark.util.ClosureCleanerSuite$$anonfun$1.apply(ClosureCleanerSuite.scala:27) ... Cause: akka.pattern.AskTimeoutException: Recipient[Actor[akka://spark/user/MapOutputTracker#221210107]] had already been terminated. at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134) at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:106) at org.apache.spark.MapOutputTracker.sendTracker(MapOutputTracker.scala:117) at org.apache.spark.MapOutputTrackerMaster.stop(MapOutputTracker.scala:324) at org.apache.spark.SparkEnv.stop(SparkEnv.scala:83) at org.apache.spark.SparkContext.stop(SparkContext.scala:1003) at org.apache.spark.LocalSparkContext$.stop(LocalSparkContext.scala:50) at org.apache.spark.LocalSparkContext$.withSpark(LocalSparkContext.scala:61) at org.apache.spark.util.TestObject$.run(ClosureCleanerSuite.scala:76) at org.apache.spark.util.ClosureCleanerSuite$$anonfun$1.apply$mcV$sp(ClosureCleanerSuite.scala:27) ... *** RUN ABORTED *** java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:679) at org.eclipse.jetty.util.thread.QueuedThreadPool.startThread(QueuedThreadPool.java:441) at org.eclipse.jetty.util.thread.QueuedThreadPool.doStart(QueuedThreadPool.java:108) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:81) at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:58) at org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:96) at org.eclipse.jetty.server.Server.doStart(Server.java:282) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) ... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-50951617 Ran the tests again and got same Error communicating with MapOutputTracker but in a different test: this time: ImplicitOrderingSuite: - basic inference of Orderings *** RUN ABORTED *** org.apache.spark.SparkException: Error communicating with MapOutputTracker .. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on the pull request: https://github.com/apache/spark/pull/1586#issuecomment-50951779 I ran another time and get an OOME this time. ImplicitOrderingSuite: *** RUN ABORTED *** java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:679) at org.eclipse.jetty.util.thread.QueuedThreadPool.startThread(QueuedThreadPool.java:441) at org.eclipse.jetty.util.thread.QueuedThreadPool.doStart(QueuedThreadPool.java:108) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) at org.eclipse.jetty.util.component.AggregateLifeCycle.doStart(AggregateLifeCycle.java:81) at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:58) at org.eclipse.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:96) at org.eclipse.jetty.server.Server.doStart(Server.java:282) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64) I do not understand. You can see the file diff's for this PR: the modified logic has nothing to do with these CORE operations - it should only be following modules affected: sql/catalyst sql/core sql/hive sql/hive/compatibility But these errors are happing in: core --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/1586#discussion_r15726702 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -208,6 +211,96 @@ case class EndsWith(left: Expression, right: Expression) def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that returns the number of bytes in an expression + */ +case class Length(child: Expression) extends UnaryExpression { + + type EvaluatedType = Any + + override def dataType = IntegerType + + override def foldable = child.foldable + + override def nullable = child.nullable + + override def toString = sLength($child) + + override def eval(input: Row): EvaluatedType = { +val inputVal = child.eval(input) +if (inputVal == null) { + null +} else if (!inputVal.isInstanceOf[String]) { + inputVal.toString.length +} else { + OctetLenUtils.len(inputVal.asInstanceOf[String]) --- End diff -- OK I will try this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2686 Add Length and Strlen support to Sp...
Github user javadba commented on a diff in the pull request: https://github.com/apache/spark/pull/1586#discussion_r15726780 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala --- @@ -208,6 +211,96 @@ case class EndsWith(left: Expression, right: Expression) def compare(l: String, r: String) = l.endsWith(r) } + +/** + * A function that returns the number of bytes in an expression + */ +case class Length(child: Expression) extends UnaryExpression { + + type EvaluatedType = Any + + override def dataType = IntegerType + + override def foldable = child.foldable + + override def nullable = child.nullable + + override def toString = sLength($child) + + override def eval(input: Row): EvaluatedType = { +val inputVal = child.eval(input) +if (inputVal == null) { + null +} else if (!inputVal.isInstanceOf[String]) { + inputVal.toString.length +} else { + OctetLenUtils.len(inputVal.asInstanceOf[String]) --- End diff -- Looks like working. I am pushing the change. In the meanwhile will rerun all regression tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---