[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4622 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-168112592 I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-121487220 @mengxr any plan to revisit this and merge it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r30020688 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,497 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param member cluster member. + */ +@Experimental +case class AffinityPropagationAssignment(val id: Long, val exemplar: Long, val member: Long) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster member. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param assignments the cluster assignments of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val assignments: RDD[AffinityPropagationAssignment]) extends Serializable { + + /** + * Get the number of clusters + */ + lazy val k: Long = assignments.map(_.id).distinct.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[RDD]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): RDD[Long] = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assignments.filter(_.id == assign(0).id).map(_.member) +} else { + assignments.sparkContext.emptyRDD[Long] +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assign(0).id +} else { + -1 +} + } + + /** + * Turn cluster assignments to cluster representations [[AffinityPropagationCluster]]. + * @return a [[RDD]] that contains all clusters generated by Affinity Propagation. Because the + * cluster members in [[AffinityPropagationCluster]] is an [[Array]], it could consume too much + * memory even run out of memory when you call collect() on the returned [[RDD]]. + */ + def fromAssignToClusters(): RDD[AffinityPropagationCluster] = { +assignments.map { assign = ((assign.id, assign.exemplar), assign.member) } + .aggregateByKey(mutable.Set[Long]())( +seqOp = (s, d) = s ++ mutable.Set(d), +combOp = (s1, s2) = s1 ++ s2 + ).map(kv = new AffinityPropagationCluster(kv._1._1, kv._1._2, kv._2.toArray)) + } +} + +/** + * The message exchanged on the node graph + */ +private case class EdgeMessage( +similarity: Double, +availability: Double, +responsibility: Double) extends Equals { + override def canEqual(that: Any): Boolean = { +that match { + case e:
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r30052128 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,497 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param member cluster member. + */ +@Experimental +case class AffinityPropagationAssignment(val id: Long, val exemplar: Long, val member: Long) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster member. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param assignments the cluster assignments of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val assignments: RDD[AffinityPropagationAssignment]) extends Serializable { + + /** + * Get the number of clusters + */ + lazy val k: Long = assignments.map(_.id).distinct.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[RDD]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): RDD[Long] = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assignments.filter(_.id == assign(0).id).map(_.member) +} else { + assignments.sparkContext.emptyRDD[Long] +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assign(0).id +} else { + -1 +} + } + + /** + * Turn cluster assignments to cluster representations [[AffinityPropagationCluster]]. + * @return a [[RDD]] that contains all clusters generated by Affinity Propagation. Because the + * cluster members in [[AffinityPropagationCluster]] is an [[Array]], it could consume too much + * memory even run out of memory when you call collect() on the returned [[RDD]]. + */ + def fromAssignToClusters(): RDD[AffinityPropagationCluster] = { +assignments.map { assign = ((assign.id, assign.exemplar), assign.member) } + .aggregateByKey(mutable.Set[Long]())( +seqOp = (s, d) = s ++ mutable.Set(d), +combOp = (s1, s2) = s1 ++ s2 + ).map(kv = new AffinityPropagationCluster(kv._1._1, kv._1._2, kv._2.toArray)) + } +} + +/** + * The message exchanged on the node graph + */ +private case class EdgeMessage( +similarity: Double, +availability: Double, +responsibility: Double) extends Equals { + override def canEqual(that: Any): Boolean = { +that match { + case e:
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29949043 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,497 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param member cluster member. + */ +@Experimental +case class AffinityPropagationAssignment(val id: Long, val exemplar: Long, val member: Long) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster member. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param assignments the cluster assignments of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val assignments: RDD[AffinityPropagationAssignment]) extends Serializable { + + /** + * Get the number of clusters + */ + lazy val k: Long = assignments.map(_.id).distinct.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[RDD]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): RDD[Long] = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assignments.filter(_.id == assign(0).id).map(_.member) +} else { + assignments.sparkContext.emptyRDD[Long] +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assign(0).id +} else { + -1 +} + } + + /** + * Turn cluster assignments to cluster representations [[AffinityPropagationCluster]]. + * @return a [[RDD]] that contains all clusters generated by Affinity Propagation. Because the + * cluster members in [[AffinityPropagationCluster]] is an [[Array]], it could consume too much + * memory even run out of memory when you call collect() on the returned [[RDD]]. + */ + def fromAssignToClusters(): RDD[AffinityPropagationCluster] = { +assignments.map { assign = ((assign.id, assign.exemplar), assign.member) } + .aggregateByKey(mutable.Set[Long]())( +seqOp = (s, d) = s ++ mutable.Set(d), +combOp = (s1, s2) = s1 ++ s2 + ).map(kv = new AffinityPropagationCluster(kv._1._1, kv._1._2, kv._2.toArray)) + } +} + +/** + * The message exchanged on the node graph + */ +private case class EdgeMessage( +similarity: Double, +availability: Double, +responsibility: Double) extends Equals { + override def canEqual(that: Any): Boolean = { +that match { + case e:
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29884174 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,496 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param member cluster member. + */ +@Experimental +case class AffinityPropagationAssignment(val id: Long, val exemplar: Long, val member: Long) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster member. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param assignments the cluster assignments of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val assignments: RDD[AffinityPropagationAssignment]) extends Serializable { + + /** + * Get the number of clusters + */ + lazy val k: Long = assignments.map(_.id).distinct.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[RDD]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): RDD[Long] = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assignments.filter(_.id == assign(0).id).map(_.member) +} else { + assignments.sparkContext.emptyRDD[Long] +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assign(0).id +} else { + -1 +} + } + + /** + * Turn cluster assignments to cluster representations [[AffinityPropagationCluster]]. + * @return a [[RDD]] that contains all clusters generated by Affinity Propagation. Because the + * cluster members in [[AffinityPropagationCluster]] is an [[Array]], it could consume too much + * memory even run out of memory when you call collect() on the returned [[RDD]]. + */ + def fromAssignToClusters(): RDD[AffinityPropagationCluster] = { +assignments.map { assign = ((assign.id, assign.exemplar), assign.member) } + .aggregateByKey(mutable.Set[Long]())( +seqOp = (s, d) = s ++ mutable.Set(d), +combOp = (s1, s2) = s1 ++ s2 + ).map(kv = new AffinityPropagationCluster(kv._1._1, kv._1._2, kv._2.toArray)) + } +} + +/** + * The message exchanged on the node graph + */ +private case class EdgeMessage( +similarity: Double, +availability: Double, +responsibility: Double) extends Equals { + override def canEqual(that: Any): Boolean = { +that match { + case e: EdgeMessage = +similarity ==
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29883577 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,496 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param member cluster member. + */ +@Experimental +case class AffinityPropagationAssignment(val id: Long, val exemplar: Long, val member: Long) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster member. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param assignments the cluster assignments of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val assignments: RDD[AffinityPropagationAssignment]) extends Serializable { + + /** + * Get the number of clusters + */ + lazy val k: Long = assignments.map(_.id).distinct.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[RDD]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): RDD[Long] = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assignments.filter(_.id == assign(0).id).map(_.member) +} else { + assignments.sparkContext.emptyRDD[Long] +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assign(0).id +} else { + -1 +} + } + + /** + * Turn cluster assignments to cluster representations [[AffinityPropagationCluster]]. + * @return a [[RDD]] that contains all clusters generated by Affinity Propagation. Because the + * cluster members in [[AffinityPropagationCluster]] is an [[Array]], it could consume too much + * memory even run out of memory when you call collect() on the returned [[RDD]]. + */ + def fromAssignToClusters(): RDD[AffinityPropagationCluster] = { +assignments.map { assign = ((assign.id, assign.exemplar), assign.member) } + .aggregateByKey(mutable.Set[Long]())( +seqOp = (s, d) = s ++ mutable.Set(d), +combOp = (s1, s2) = s1 ++ s2 + ).map(kv = new AffinityPropagationCluster(kv._1._1, kv._1._2, kv._2.toArray)) + } +} + +/** + * The message exchanged on the node graph + */ +private case class EdgeMessage( +similarity: Double, +availability: Double, +responsibility: Double) extends Equals { + override def canEqual(that: Any): Boolean = { +that match { + case e: EdgeMessage = +similarity ==
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-99804275 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-99804294 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-99804679 [Test build #32103 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32103/consoleFull) for PR 4622 at commit [`0c7a26f`](https://github.com/apache/spark/commit/0c7a26f56e85febfe1edac0d85251de9b0bf91e0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29896816 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,497 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param member cluster member. + */ +@Experimental +case class AffinityPropagationAssignment(val id: Long, val exemplar: Long, val member: Long) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster member. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param assignments the cluster assignments of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val assignments: RDD[AffinityPropagationAssignment]) extends Serializable { + + /** + * Get the number of clusters + */ + lazy val k: Long = assignments.map(_.id).distinct.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[RDD]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): RDD[Long] = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assignments.filter(_.id == assign(0).id).map(_.member) +} else { + assignments.sparkContext.emptyRDD[Long] +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assign(0).id +} else { + -1 +} + } + + /** + * Turn cluster assignments to cluster representations [[AffinityPropagationCluster]]. + * @return a [[RDD]] that contains all clusters generated by Affinity Propagation. Because the + * cluster members in [[AffinityPropagationCluster]] is an [[Array]], it could consume too much + * memory even run out of memory when you call collect() on the returned [[RDD]]. + */ + def fromAssignToClusters(): RDD[AffinityPropagationCluster] = { +assignments.map { assign = ((assign.id, assign.exemplar), assign.member) } + .aggregateByKey(mutable.Set[Long]())( +seqOp = (s, d) = s ++ mutable.Set(d), +combOp = (s1, s2) = s1 ++ s2 + ).map(kv = new AffinityPropagationCluster(kv._1._1, kv._1._2, kv._2.toArray)) + } +} + +/** + * The message exchanged on the node graph + */ +private case class EdgeMessage( +similarity: Double, +availability: Double, +responsibility: Double) extends Equals { + override def canEqual(that: Any): Boolean = { +that match { + case e:
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-99837389 [Test build #32103 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32103/consoleFull) for PR 4622 at commit [`0c7a26f`](https://github.com/apache/spark/commit/0c7a26f56e85febfe1edac0d85251de9b0bf91e0). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class JavaAffinityPropagation ` * `case class AffinityPropagationAssignment(val id: Long, val exemplar: Long, val member: Long)` * `case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long])` * `class AffinityPropagationModel(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-99837449 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32103/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-99837445 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29738697 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,475 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster members. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the clusters of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val clusters: RDD[AffinityPropagationCluster]) extends Serializable { + + /** + * Set the number of clusters + */ + lazy val getK: Long = clusters.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[Array]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): Array[Long] = { --- End diff -- I have the same concern @mengxr has. Can that particular cluster contains lots of vertexes that run out of memory? This can return `RDD[Long]`, but with data structure of `(vertex id, cluster id)`, it seems it requires two passes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29738695 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,475 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster members. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the clusters of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val clusters: RDD[AffinityPropagationCluster]) extends Serializable { + + /** + * Set the number of clusters + */ + lazy val getK: Long = clusters.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[Array]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): Array[Long] = { +val cluster = clusters.filter(_.members.contains(vertexID)).collect() +if (cluster.nonEmpty) { + cluster(0).members +} else { + null +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { +val clusterIds = clusters.flatMap { cluster = + if (cluster.members.contains(vertexID)) { +Seq(cluster.id) + } else { +Seq() + } +}.collect() +if (clusterIds.nonEmpty) { + clusterIds(0) +} else { + -1 +} + } +} + +/** + * The message exchanged on the node graph + */ +private case class EdgeMessage( +similarity: Double, +availability: Double, +responsibility: Double) extends Equals { + override def canEqual(that: Any): Boolean = { +that match { + case e: EdgeMessage = +similarity == e.similarity availability == e.availability + responsibility == e.responsibility + case _ = +false +} + } +} + +/** + * The data stored in each vertex on the graph + */ +private case class VertexData(availability: Double, responsibility: Double) + +/** + * :: Experimental :: + * + * Affinity propagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://doi.org/10.1126/science.1136800 Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * @param lambda lambda parameter used in the messaging iteration loop + * @param normalization Indication of performing normalization + * @param symmetric Indication of using symmetric similarity input + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29738770 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/clustering/AffinityPropagationSuite.scala --- @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.scalatest.FunSuite + +import org.apache.spark.graphx.{Edge, Graph} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ + +class AffinityPropagationSuite extends FunSuite with MLlibTestSparkContext { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + test(affinity propagation) { +/* + We use the following graph to test AP. + + 15-14 -13 12 + . \ / + 4 . 3 . 2 + | . | + 5 0 . 1 10 + | \ . . + 6 7 . 8 - 9 - 11 + */ + --- End diff -- Mirror: we use the following style ``` /** * . * */ ``` for comment --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-99354269 General question. In euclidean space, the negative squared error is used as similarity. If we want to use affinity propagation to clustering lots of samples in euclidean space, it's impossible to create all the pairs of similarity data even it's symmetrical. What's the criteria to filter out those pairs which have very low similarity? Also, it's impossible to compute all the pairs of `RDD[Vector]` since it's O(N^2) operation, and how people address this in practice? I really like this algorithm, but still have concern about how people can use it in practice. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29777663 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,475 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster members. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the clusters of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val clusters: RDD[AffinityPropagationCluster]) extends Serializable { + + /** + * Set the number of clusters + */ + lazy val getK: Long = clusters.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[Array]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): Array[Long] = { +val cluster = clusters.filter(_.members.contains(vertexID)).collect() +if (cluster.nonEmpty) { + cluster(0).members +} else { + null +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { +val clusterIds = clusters.flatMap { cluster = + if (cluster.members.contains(vertexID)) { +Seq(cluster.id) + } else { +Seq() + } +}.collect() +if (clusterIds.nonEmpty) { + clusterIds(0) +} else { + -1 +} + } +} + +/** + * The message exchanged on the node graph + */ +private case class EdgeMessage( +similarity: Double, +availability: Double, +responsibility: Double) extends Equals { + override def canEqual(that: Any): Boolean = { +that match { + case e: EdgeMessage = +similarity == e.similarity availability == e.availability + responsibility == e.responsibility + case _ = +false +} + } +} + +/** + * The data stored in each vertex on the graph + */ +private case class VertexData(availability: Double, responsibility: Double) + +/** + * :: Experimental :: + * + * Affinity propagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://doi.org/10.1126/science.1136800 Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * @param lambda lambda parameter used in the messaging iteration loop + * @param normalization Indication of performing normalization + * @param symmetric Indication of using symmetric similarity input + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-99576733 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-99565738 @viirya Maybe you can comment on this in the documentation and in the comment of the code, and it will be useful for people trying to understand the use-case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-99577439 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32017/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-99577433 [Test build #32017 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32017/consoleFull) for PR 4622 at commit [`e062a94`](https://github.com/apache/spark/commit/e062a94e580e51d7cb5ff7b3d0e93be9c3edc704). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class JavaAffinityPropagation ` * `case class AffinityPropagationAssignment(val id: Long, val exemplar: Long, val member: Long)` * `case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long])` * `class AffinityPropagationModel(` * `class JoinedRow6 extends Row ` * `case class WindowSpecDefinition(` * `case class WindowSpecReference(name: String) extends WindowSpec` * `sealed trait FrameBoundary ` * `case class ValuePreceding(value: Int) extends FrameBoundary ` * `case class ValueFollowing(value: Int) extends FrameBoundary ` * `case class SpecifiedWindowFrame(` * `trait WindowFunction extends Expression ` * `case class UnresolvedWindowFunction(` * `case class UnresolvedWindowExpression(` * `case class WindowExpression(` * `case class WithWindowDefinition(` * `case class Window(` * `case class Window(` * ` case class ComputedWindow(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-99577436 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-99576709 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29791336 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,496 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param member cluster member. + */ +@Experimental +case class AffinityPropagationAssignment(val id: Long, val exemplar: Long, val member: Long) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster member. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param assignments the cluster assignments of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val assignments: RDD[AffinityPropagationAssignment]) extends Serializable { + + /** + * Get the number of clusters + */ + lazy val k: Long = assignments.map(_.id).distinct.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[RDD]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): RDD[Long] = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assignments.filter(_.id == assign(0).id).map(_.member) +} else { + assignments.sparkContext.emptyRDD[Long] +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assign(0).id +} else { + -1 +} + } + + /** + * Turn cluster assignments to cluster representations [[AffinityPropagationCluster]]. + * @return a [[RDD]] that contains all clusters generated by Affinity Propagation. Because the + * cluster members in [[AffinityPropagationCluster]] is an [[Array]], it could consume too much + * memory even run out of memory when you call collect() on the returned [[RDD]]. + */ + def fromAssignToClusters(): RDD[AffinityPropagationCluster] = { +assignments.map { assign = ((assign.id, assign.exemplar), assign.member) } + .aggregateByKey(mutable.Set[Long]())( +seqOp = (s, d) = s ++ mutable.Set(d), +combOp = (s1, s2) = s1 ++ s2 + ).map(kv = new AffinityPropagationCluster(kv._1._1, kv._1._2, kv._2.toArray)) + } +} + +/** + * The message exchanged on the node graph + */ +private case class EdgeMessage( +similarity: Double, +availability: Double, +responsibility: Double) extends Equals { + override def canEqual(that: Any): Boolean = { +that match { + case e: EdgeMessage = +similarity ==
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-99577109 [Test build #32017 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32017/consoleFull) for PR 4622 at commit [`e062a94`](https://github.com/apache/spark/commit/e062a94e580e51d7cb5ff7b3d0e93be9c3edc704). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29791211 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,496 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param member cluster member. + */ +@Experimental +case class AffinityPropagationAssignment(val id: Long, val exemplar: Long, val member: Long) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster member. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param assignments the cluster assignments of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val assignments: RDD[AffinityPropagationAssignment]) extends Serializable { + + /** + * Get the number of clusters + */ + lazy val k: Long = assignments.map(_.id).distinct.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[RDD]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): RDD[Long] = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assignments.filter(_.id == assign(0).id).map(_.member) +} else { + assignments.sparkContext.emptyRDD[Long] +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assign(0).id +} else { + -1 +} + } + + /** + * Turn cluster assignments to cluster representations [[AffinityPropagationCluster]]. + * @return a [[RDD]] that contains all clusters generated by Affinity Propagation. Because the + * cluster members in [[AffinityPropagationCluster]] is an [[Array]], it could consume too much + * memory even run out of memory when you call collect() on the returned [[RDD]]. + */ + def fromAssignToClusters(): RDD[AffinityPropagationCluster] = { +assignments.map { assign = ((assign.id, assign.exemplar), assign.member) } + .aggregateByKey(mutable.Set[Long]())( +seqOp = (s, d) = s ++ mutable.Set(d), +combOp = (s1, s2) = s1 ++ s2 + ).map(kv = new AffinityPropagationCluster(kv._1._1, kv._1._2, kv._2.toArray)) + } +} + +/** + * The message exchanged on the node graph + */ +private case class EdgeMessage( +similarity: Double, +availability: Double, +responsibility: Double) extends Equals { + override def canEqual(that: Any): Boolean = { +that match { + case e: EdgeMessage = +similarity ==
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29780015 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,475 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster members. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the clusters of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val clusters: RDD[AffinityPropagationCluster]) extends Serializable { + + /** + * Set the number of clusters + */ + lazy val getK: Long = clusters.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[Array]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): Array[Long] = { +val cluster = clusters.filter(_.members.contains(vertexID)).collect() +if (cluster.nonEmpty) { + cluster(0).members +} else { + null +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { +val clusterIds = clusters.flatMap { cluster = + if (cluster.members.contains(vertexID)) { +Seq(cluster.id) + } else { +Seq() + } +}.collect() +if (clusterIds.nonEmpty) { + clusterIds(0) +} else { + -1 +} + } +} + +/** + * The message exchanged on the node graph + */ +private case class EdgeMessage( +similarity: Double, +availability: Double, +responsibility: Double) extends Equals { + override def canEqual(that: Any): Boolean = { +that match { + case e: EdgeMessage = +similarity == e.similarity availability == e.availability + responsibility == e.responsibility + case _ = +false +} + } +} + +/** + * The data stored in each vertex on the graph + */ +private case class VertexData(availability: Double, responsibility: Double) + +/** + * :: Experimental :: + * + * Affinity propagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://doi.org/10.1126/science.1136800 Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * @param lambda lambda parameter used in the messaging iteration loop + * @param normalization Indication of performing normalization + * @param symmetric Indication of using symmetric similarity input + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29777568 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,475 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster members. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the clusters of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val clusters: RDD[AffinityPropagationCluster]) extends Serializable { + + /** + * Set the number of clusters + */ + lazy val getK: Long = clusters.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[Array]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): Array[Long] = { +val cluster = clusters.filter(_.members.contains(vertexID)).collect() +if (cluster.nonEmpty) { + cluster(0).members +} else { + null +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { +val clusterIds = clusters.flatMap { cluster = + if (cluster.members.contains(vertexID)) { +Seq(cluster.id) + } else { +Seq() + } +}.collect() +if (clusterIds.nonEmpty) { + clusterIds(0) +} else { + -1 +} + } +} + +/** + * The message exchanged on the node graph + */ +private case class EdgeMessage( +similarity: Double, +availability: Double, +responsibility: Double) extends Equals { + override def canEqual(that: Any): Boolean = { +that match { + case e: EdgeMessage = +similarity == e.similarity availability == e.availability + responsibility == e.responsibility + case _ = +false +} + } +} + +/** + * The data stored in each vertex on the graph + */ +private case class VertexData(availability: Double, responsibility: Double) + +/** + * :: Experimental :: + * + * Affinity propagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://doi.org/10.1126/science.1136800 Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * @param lambda lambda parameter used in the messaging iteration loop + * @param normalization Indication of performing normalization + * @param symmetric Indication of using symmetric similarity input + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29778597 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,475 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster members. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the clusters of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val clusters: RDD[AffinityPropagationCluster]) extends Serializable { + + /** + * Set the number of clusters + */ + lazy val getK: Long = clusters.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[Array]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): Array[Long] = { +val cluster = clusters.filter(_.members.contains(vertexID)).collect() +if (cluster.nonEmpty) { + cluster(0).members +} else { + null +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { +val clusterIds = clusters.flatMap { cluster = + if (cluster.members.contains(vertexID)) { +Seq(cluster.id) + } else { +Seq() + } +}.collect() +if (clusterIds.nonEmpty) { + clusterIds(0) +} else { + -1 +} + } +} + +/** + * The message exchanged on the node graph + */ +private case class EdgeMessage( +similarity: Double, +availability: Double, +responsibility: Double) extends Equals { + override def canEqual(that: Any): Boolean = { +that match { + case e: EdgeMessage = +similarity == e.similarity availability == e.availability + responsibility == e.responsibility + case _ = +false +} + } +} + +/** + * The data stored in each vertex on the graph + */ +private case class VertexData(availability: Double, responsibility: Double) + +/** + * :: Experimental :: + * + * Affinity propagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://doi.org/10.1126/science.1136800 Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * @param lambda lambda parameter used in the messaging iteration loop + * @param normalization Indication of performing normalization + * @param symmetric Indication of using symmetric similarity input + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-99541423 @dbtsai Thanks for comments and the question. I have no very good answer to the question. We have used a threshold to filter out the insignificant similarities of large scale data. As you said, it is impossible to compute all the pairs. However, the data we process is very high dimensional and very sparse, so the all-pair computation can be much reduced by only considering the pairs having corresponding dimensions with values more than zero (or a threshold defined). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user duncanfinney commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29823539 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,496 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param member cluster member. + */ +@Experimental +case class AffinityPropagationAssignment(val id: Long, val exemplar: Long, val member: Long) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster member. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param assignments the cluster assignments of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val assignments: RDD[AffinityPropagationAssignment]) extends Serializable { + + /** + * Get the number of clusters + */ + lazy val k: Long = assignments.map(_.id).distinct.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[RDD]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): RDD[Long] = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assignments.filter(_.id == assign(0).id).map(_.member) +} else { + assignments.sparkContext.emptyRDD[Long] +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { +val assign = assignments.filter(_.member == vertexID).collect() +if (assign.nonEmpty) { + assign(0).id +} else { + -1 +} + } + + /** + * Turn cluster assignments to cluster representations [[AffinityPropagationCluster]]. + * @return a [[RDD]] that contains all clusters generated by Affinity Propagation. Because the + * cluster members in [[AffinityPropagationCluster]] is an [[Array]], it could consume too much + * memory even run out of memory when you call collect() on the returned [[RDD]]. + */ + def fromAssignToClusters(): RDD[AffinityPropagationCluster] = { +assignments.map { assign = ((assign.id, assign.exemplar), assign.member) } + .aggregateByKey(mutable.Set[Long]())( +seqOp = (s, d) = s ++ mutable.Set(d), +combOp = (s1, s2) = s1 ++ s2 + ).map(kv = new AffinityPropagationCluster(kv._1._1, kv._1._2, kv._2.toArray)) + } +} + +/** + * The message exchanged on the node graph + */ +private case class EdgeMessage( +similarity: Double, +availability: Double, +responsibility: Double) extends Equals { + override def canEqual(that: Any): Boolean = { +that match { + case e: EdgeMessage = +similarity ==
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29677618 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,475 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster members. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) --- End diff -- Is it reasonable to assume that `k` is small but the number of vertices are large? Then storing members as `Array[Long]` may run out of memory. We can store `id` and `exemplar` and the driver and cluster assignments distributively as `RDD[(Long, Int)]` (vertex id, cluster id). Lookup becomes less expensive in this setup. Btw, is it sufficient to use `Int` for cluster id? It won't provide much information if AP outputs more than `Int.MaxValue` clusters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29680209 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,475 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster members. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the clusters of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val clusters: RDD[AffinityPropagationCluster]) extends Serializable { + + /** + * Set the number of clusters + */ + lazy val getK: Long = clusters.count() --- End diff -- `getK` - `k` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29680338 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,475 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster members. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the clusters of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val clusters: RDD[AffinityPropagationCluster]) extends Serializable { + + /** + * Set the number of clusters + */ + lazy val getK: Long = clusters.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[Array]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): Array[Long] = { +val cluster = clusters.filter(_.members.contains(vertexID)).collect() +if (cluster.nonEmpty) { + cluster(0).members +} else { + null +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { --- End diff -- This could be simplified if we store cluster assignment in `RDD[(Long, Int)]`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29682634 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,475 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster members. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the clusters of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val clusters: RDD[AffinityPropagationCluster]) extends Serializable { + + /** + * Set the number of clusters + */ + lazy val getK: Long = clusters.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[Array]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): Array[Long] = { +val cluster = clusters.filter(_.members.contains(vertexID)).collect() +if (cluster.nonEmpty) { + cluster(0).members +} else { + null +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { +val clusterIds = clusters.flatMap { cluster = + if (cluster.members.contains(vertexID)) { +Seq(cluster.id) + } else { +Seq() + } +}.collect() +if (clusterIds.nonEmpty) { + clusterIds(0) +} else { + -1 +} + } +} + +/** + * The message exchanged on the node graph + */ +private case class EdgeMessage( +similarity: Double, +availability: Double, +responsibility: Double) extends Equals { + override def canEqual(that: Any): Boolean = { +that match { + case e: EdgeMessage = +similarity == e.similarity availability == e.availability + responsibility == e.responsibility + case _ = +false +} + } +} + +/** + * The data stored in each vertex on the graph + */ +private case class VertexData(availability: Double, responsibility: Double) + +/** + * :: Experimental :: + * + * Affinity propagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://doi.org/10.1126/science.1136800 Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * @param lambda lambda parameter used in the messaging iteration loop + * @param normalization Indication of performing normalization + * @param symmetric Indication of using symmetric similarity input + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r29682601 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,475 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param id cluster id. + * @param exemplar cluster exemplar. + * @param members cluster members. + */ +@Experimental +case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long]) + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the clusters of AffinityPropagation clustering results. + */ +@Experimental +class AffinityPropagationModel( +val clusters: RDD[AffinityPropagationCluster]) extends Serializable { + + /** + * Set the number of clusters + */ + lazy val getK: Long = clusters.count() + + /** + * Find the cluster the given vertex belongs + * @param vertexID vertex id. + * @return a [[Array]] that contains vertex ids in the same cluster of given vertexID. If + * the given vertex doesn't belong to any cluster, return null. + */ + def findCluster(vertexID: Long): Array[Long] = { +val cluster = clusters.filter(_.members.contains(vertexID)).collect() +if (cluster.nonEmpty) { + cluster(0).members +} else { + null +} + } + + /** + * Find the cluster id the given vertex belongs to + * @param vertexID vertex id. + * @return the cluster id that the given vertex belongs to. If the given vertex doesn't belong to + * any cluster, return -1. + */ + def findClusterID(vertexID: Long): Long = { +val clusterIds = clusters.flatMap { cluster = + if (cluster.members.contains(vertexID)) { +Seq(cluster.id) + } else { +Seq() + } +}.collect() +if (clusterIds.nonEmpty) { + clusterIds(0) +} else { + -1 +} + } +} + +/** + * The message exchanged on the node graph + */ +private case class EdgeMessage( +similarity: Double, +availability: Double, +responsibility: Double) extends Equals { + override def canEqual(that: Any): Boolean = { +that match { + case e: EdgeMessage = +similarity == e.similarity availability == e.availability + responsibility == e.responsibility + case _ = +false +} + } +} + +/** + * The data stored in each vertex on the graph + */ +private case class VertexData(availability: Double, responsibility: Double) + +/** + * :: Experimental :: + * + * Affinity propagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://doi.org/10.1126/science.1136800 Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * @param lambda lambda parameter used in the messaging iteration loop + * @param normalization Indication of performing normalization + * @param symmetric Indication of using symmetric similarity input + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-91784107 [Test build #30067 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30067/consoleFull) for PR 4622 at commit [`97cef01`](https://github.com/apache/spark/commit/97cef01f1e9f1bd2fcaa2d7718302c8ae79cf25a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-91795886 [Test build #30067 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/30067/consoleFull) for PR 4622 at commit [`97cef01`](https://github.com/apache/spark/commit/97cef01f1e9f1bd2fcaa2d7718302c8ae79cf25a). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class JavaAffinityPropagation ` * `case class AffinityPropagationCluster(val id: Long, val exemplar: Long, val members: Array[Long])` * `class AffinityPropagationModel(` * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-91795890 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30067/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-91319694 @mengxr The model is now distributed. Edge messages and vertex data are defined as private case classes too. I will try to add Java examples later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-91304199 Thanks @vanzin, I think it might be a problem. I decided to move to custom method for median computation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-91322337 [Test build #29956 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29956/consoleFull) for PR 4622 at commit [`d70b70a`](https://github.com/apache/spark/commit/d70b70a4b2813ac8219cfe5f5e8c4c3e41f48c4b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-91342743 [Test build #29956 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29956/consoleFull) for PR 4622 at commit [`d70b70a`](https://github.com/apache/spark/commit/d70b70a4b2813ac8219cfe5f5e8c4c3e41f48c4b). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AffinityPropagationModel(` * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-91337081 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29955/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-91337063 [Test build #29955 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29955/consoleFull) for PR 4622 at commit [`6ad4905`](https://github.com/apache/spark/commit/6ad4905b0de5ac6ca9602ea30ce515bd75151221). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AffinityPropagationModel(` * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-91299289 it has to add spark-hive as dependency of MLlib spark-hive can be disabled (in fact it has to be explicitly enabled, it's disabled by default), while mllib cannot, so I guess that is a non-starter. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-91321346 [Test build #29955 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29955/consoleFull) for PR 4622 at commit [`6ad4905`](https://github.com/apache/spark/commit/6ad4905b0de5ac6ca9602ea30ce515bd75151221). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-91342764 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29956/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-91229215 @mengxr I have added a new function to automatically calculate the preferences used in AP algorithm. In fact, it just the median of similarities. Because I use the function `percentile_approx` in Hive to calculate the median value, it has to add spark-hive as dependency of MLlib. I don't know if this is a good idea? How do you think? If it is not, in order to provide this function to users, where it is better to put this in? Or it is better to implement a custom method to compute the median value? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-90995035 [Test build #29876 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29876/consoleFull) for PR 4622 at commit [`250b6a4`](https://github.com/apache/spark/commit/250b6a40206b4a540432bb340ef598bef0dd457d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-90987246 [Test build #29872 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29872/consoleFull) for PR 4622 at commit [`f031efd`](https://github.com/apache/spark/commit/f031efdb45dd7bd879297d4a9d7b31e9d9a8804f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-90992381 [Test build #29872 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29872/consoleFull) for PR 4622 at commit [`f031efd`](https://github.com/apache/spark/commit/f031efdb45dd7bd879297d4a9d7b31e9d9a8804f). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. * This patch **removes the following dependencies:** * `RoaringBitmap-0.4.5.jar` * `activation-1.1.jar` * `akka-actor_2.10-2.3.4-spark.jar` * `akka-remote_2.10-2.3.4-spark.jar` * `akka-slf4j_2.10-2.3.4-spark.jar` * `aopalliance-1.0.jar` * `arpack_combined_all-0.1.jar` * `avro-1.7.6.jar` * `breeze-macros_2.10-0.11.1.jar` * `breeze_2.10-0.11.1.jar` * `chill-java-0.5.0.jar` * `chill_2.10-0.5.0.jar` * `commons-beanutils-1.7.0.jar` * `commons-beanutils-core-1.8.0.jar` * `commons-cli-1.2.jar` * `commons-codec-1.10.jar` * `commons-collections-3.2.1.jar` * `commons-compress-1.4.1.jar` * `commons-configuration-1.6.jar` * `commons-digester-1.8.jar` * `commons-httpclient-3.1.jar` * `commons-io-2.1.jar` * `commons-lang-2.5.jar` * `commons-lang3-3.3.2.jar` * `commons-math-2.1.jar` * `commons-math3-3.1.1.jar` * `commons-net-2.2.jar` * `compress-lzf-1.0.0.jar` * `config-1.2.1.jar` * `core-1.1.2.jar` * `curator-client-2.4.0.jar` * `curator-framework-2.4.0.jar` * `curator-recipes-2.4.0.jar` * `gmbal-api-only-3.0.0-b023.jar` * `grizzly-framework-2.1.2.jar` * `grizzly-http-2.1.2.jar` * `grizzly-http-server-2.1.2.jar` * `grizzly-http-servlet-2.1.2.jar` * `grizzly-rcm-2.1.2.jar` * `groovy-all-2.3.7.jar` * `guava-14.0.1.jar` * `guice-3.0.jar` * `hadoop-annotations-2.2.0.jar` * `hadoop-auth-2.2.0.jar` * `hadoop-client-2.2.0.jar` * `hadoop-common-2.2.0.jar` * `hadoop-hdfs-2.2.0.jar` * `hadoop-mapreduce-client-app-2.2.0.jar` * `hadoop-mapreduce-client-common-2.2.0.jar` * `hadoop-mapreduce-client-core-2.2.0.jar` * `hadoop-mapreduce-client-jobclient-2.2.0.jar` * `hadoop-mapreduce-client-shuffle-2.2.0.jar` * `hadoop-yarn-api-2.2.0.jar` * `hadoop-yarn-client-2.2.0.jar` * `hadoop-yarn-common-2.2.0.jar` * `hadoop-yarn-server-common-2.2.0.jar` * `ivy-2.4.0.jar` * `jackson-annotations-2.4.0.jar` * `jackson-core-2.4.4.jar` * `jackson-core-asl-1.8.8.jar` * `jackson-databind-2.4.4.jar` * `jackson-jaxrs-1.8.8.jar` * `jackson-mapper-asl-1.8.8.jar` * `jackson-module-scala_2.10-2.4.4.jar` * `jackson-xc-1.8.8.jar` * `jansi-1.4.jar` * `javax.inject-1.jar` * `javax.servlet-3.0.0.v201112011016.jar` * `javax.servlet-3.1.jar` * `javax.servlet-api-3.0.1.jar` * `jaxb-api-2.2.2.jar` * `jaxb-impl-2.2.3-1.jar` * `jcl-over-slf4j-1.7.10.jar` * `jersey-client-1.9.jar` * `jersey-core-1.9.jar` * `jersey-grizzly2-1.9.jar` * `jersey-guice-1.9.jar` * `jersey-json-1.9.jar` * `jersey-server-1.9.jar` * `jersey-test-framework-core-1.9.jar` * `jersey-test-framework-grizzly2-1.9.jar` * `jets3t-0.7.1.jar` * `jettison-1.1.jar` * `jetty-util-6.1.26.jar` * `jline-0.9.94.jar` * `jline-2.10.4.jar` * `jodd-core-3.6.3.jar` * `json4s-ast_2.10-3.2.10.jar` * `json4s-core_2.10-3.2.10.jar` * `json4s-jackson_2.10-3.2.10.jar` * `jsr305-1.3.9.jar` * `jtransforms-2.4.0.jar` * `jul-to-slf4j-1.7.10.jar` * `kryo-2.21.jar` * `log4j-1.2.17.jar` * `lz4-1.2.0.jar` * `management-api-3.0.0-b012.jar` * `mesos-0.21.0-shaded-protobuf.jar` * `metrics-core-3.1.0.jar` * `metrics-graphite-3.1.0.jar` * `metrics-json-3.1.0.jar` * `metrics-jvm-3.1.0.jar` * `minlog-1.2.jar` * `netty-3.8.0.Final.jar` * `netty-all-4.0.23.Final.jar` * `objenesis-1.2.jar` * `opencsv-2.3.jar` * `oro-2.0.8.jar` * `paranamer-2.6.jar` * `parquet-column-1.6.0rc3.jar` * `parquet-common-1.6.0rc3.jar` * `parquet-encoding-1.6.0rc3.jar` * `parquet-format-2.2.0-rc1.jar` * `parquet-generator-1.6.0rc3.jar` * `parquet-hadoop-1.6.0rc3.jar` * `parquet-jackson-1.6.0rc3.jar` * `protobuf-java-2.4.1.jar` * `protobuf-java-2.5.0-spark.jar` * `py4j-0.8.2.1.jar` * `pyrolite-2.0.1.jar` * `quasiquotes_2.10-2.0.1.jar` * `reflectasm-1.07-shaded.jar` * `scala-compiler-2.10.4.jar` * `scala-library-2.10.4.jar`
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-90992384 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29872/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-91016629 [Test build #29876 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29876/consoleFull) for PR 4622 at commit [`250b6a4`](https://github.com/apache/spark/commit/250b6a40206b4a540432bb340ef598bef0dd457d). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AffinityPropagationModel(` * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-91016654 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29876/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r27893371 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r27896111 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-90644122 [Test build #29798 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29798/consoleFull) for PR 4622 at commit [`d762697`](https://github.com/apache/spark/commit/d762697e4140781b2dc8882c4d654e19ce5bb80b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-90643243 @mengxr I updated for small comments. For the bigger ones, I will update later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r27893434 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { --- End diff -- ok, would work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-90644488 [Test build #29798 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29798/consoleFull) for PR 4622 at commit [`d762697`](https://github.com/apache/spark/commit/d762697e4140781b2dc8882c4d654e19ce5bb80b). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AffinityPropagationModel(` * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-90644493 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29798/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r27893348 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-90652236 [Test build #29799 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29799/consoleFull) for PR 4622 at commit [`68947bd`](https://github.com/apache/spark/commit/68947bd619650c5b25ef574d155a24bf97f0356a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-90694626 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29799/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-90694610 [Test build #29799 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29799/consoleFull) for PR 4622 at commit [`68947bd`](https://github.com/apache/spark/commit/68947bd619650c5b25ef574d155a24bf97f0356a). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AffinityPropagationModel(` * This patch does not change any dependencies. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r27930452 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r27930450 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-89915614 @viirya Any plan to update this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-89915963 @mengxr I will update this soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user viirya commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-75284420 @mengxr have we decided no to include this algorithm? If so, please let me know. Then I will close this pr and maintain it as third-party package. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-75288067 I didn't see `cartesian` in your code. So the complexity is really `O(nnz * k)` but not `O(n^2 * k)`, correct? This is the same complexity as PIC/PageRank. If this is the case, let's check the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092347 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092363 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092359 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092309 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { --- End diff -- `Option` is not Java friendly. Would `-1` work here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092338 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092313 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing --- End diff -- `AffinityPropagation` - `Affinity propagation`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092304 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], --- End diff -- So we store the entire clustering result to driver. How large could it be? It would be nice if we can store the model distributively. In PIC, the result is stored as `RDD[(Long, Int)]`. Please also consider Java users. Both `Seq` and `Set` are Scala collections. The best way to check this is to provide a Java example. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092334 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092345 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092331 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092352 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092326 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092329 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092321 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, --- End diff -- Put backticks around `0.5`. Otherwise, ScalaDoc will only show ... lambda: 0. as the preview. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092362 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092300 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map --- End diff -- Please only import `mutable` and use `mutable.Map` and `mutable.Set` in the code. Sometimes, mixing the default `Map` with mutable `Map` causes problems that are hard to debug. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092308 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs --- End diff -- Document the behavior if the given vertex doesn't belong to any cluster. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092310 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { --- End diff -- use this style for consistency: ~~~ cluster.foreach { cluster = ... } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092320 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] --- End diff -- `(Wikipedia)` - `Affinity propagation (Wikipedia)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092316 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. --- End diff -- Use DOI link: http://doi.org/10.1126/science.1136800 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092333 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092335 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4622#discussion_r25092341 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/AffinityPropagation.scala --- @@ -0,0 +1,347 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.mutable.Map +import scala.collection.mutable.Set + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Experimental +import org.apache.spark.graphx._ +import org.apache.spark.graphx.impl.GraphImpl +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * + * Model produced by [[AffinityPropagation]]. + * + * @param clusters the vertexIDs of each cluster. + * @param exemplars the vertexIDs of all exemplars. + */ +@Experimental +class AffinityPropagationModel( +val clusters: Seq[Set[Long]], +val exemplars: Seq[Long]) extends Serializable { + + /** + * Set the number of clusters + */ + def getK(): Int = clusters.size + + /** + * Find the cluster the given vertex belongs + */ + def findCluster(vertexID: Long): Set[Long] = { +clusters.filter(_.contains(vertexID))(0) + } + + /** + * Find the cluster id the given vertex belongs + */ + def findClusterID(vertexID: Long): Option[Int] = { +var i = 0 +clusters.foreach(cluster = { + if (cluster.contains(vertexID)) { +return Some(i) + } + i += i +}) +None + } +} + +/** + * :: Experimental :: + * + * AffinityPropagation (AP), a graph clustering algorithm based on the concept of message passing + * between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not + * require the number of clusters to be determined or estimated before running it. AP is developed + * by [[http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf Frey and Dueck]]. + * + * @param maxIterations Maximum number of iterations of the AP algorithm. + * + * @see [[http://en.wikipedia.org/wiki/Affinity_propagation (Wikipedia)]] + */ +@Experimental +class AffinityPropagation private[clustering] ( +private var maxIterations: Int, +private var lambda: Double, +private var normalization: Boolean) extends Serializable { + + import org.apache.spark.mllib.clustering.AffinityPropagation._ + + /** Constructs a AP instance with default parameters: {maxIterations: 100, lambda: 0.5, + *normalization: false}. + */ + def this() = this(maxIterations = 100, lambda = 0.5, normalization = false) + + /** + * Set maximum number of iterations of the messaging iteration loop + */ + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + /** + * Get maximum number of iterations of the messaging iteration loop + */ + def getMaxIterations(): Int = { +this.maxIterations + } + + /** + * Set lambda of the messaging iteration loop + */ + def setLambda(lambda: Double): this.type = { +this.lambda = lambda +this + } + + /** + * Get lambda of the messaging iteration loop + */ + def getLambda(): Double = { +this.lambda + } + + /** + * Set whether to do normalization or not + */ + def setNormalization(normalization: Boolean): this.type = { +this.normalization = normalization +this + } + + /** + * Get whether to do normalization or not + */ + def getNormalization(): Boolean = { +this.normalization + } + + /** + * Run the AP algorithm. + * + * @param similarities an RDD of (i, j, s,,ij,,) tuples representing the similarity matrix, which
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-75297368 @viirya I made a pass on the code. Some high-level feedback: 1. What is the expected size of the model? Should we store it distributively? 2. Defining a private case class to replace `Seq(Double, Double, Double)` would make the code much easier to read. 3. Please try to add a Java example, which helps test Java API compatibility. I will go through the algorithm part again after 2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5832][Mllib] Add Affinity Propagation c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4622#issuecomment-74653282 [Test build #27628 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27628/consoleFull) for PR 4622 at commit [`6cddeb2`](https://github.com/apache/spark/commit/6cddeb2f655fb477d23c1fbe5bf0230e2b97bdce). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class AffinityPropagationModel(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org