[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm
[ https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650600#comment-14650600 ] Joseph K. Bradley commented on SPARK-5832: -- It sounds like we need some more discussion here, so I'm going to push the target version to 1.6 Add Affinity Propagation clustering algorithm - Key: SPARK-5832 URL: https://issues.apache.org/jira/browse/SPARK-5832 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Labels: clustering -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm
[ https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329270#comment-14329270 ] Xiangrui Meng commented on SPARK-5832: -- Maybe I understood the complexity wrongly. So it is O(nnz * k), where nnz is the number of nonzeros in the similarity matrix and k is the number of iterations. Is it correct? If that's the case, this is the same complexity as the power iteration clustering we just added, and I'm okay to include it in MLlib. (If an algorithm only works well on a single machine, we tend to not include it in MLlib just for feature complete. Users can pick a single-machine implementation and use it on a single worker.) I will make a pass on your PR and it would be great if you can provide some performance results. Add Affinity Propagation clustering algorithm - Key: SPARK-5832 URL: https://issues.apache.org/jira/browse/SPARK-5832 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm
[ https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329284#comment-14329284 ] Liang-Chi Hsieh commented on SPARK-5832: The time complexity O(nnz * K) is just for sparse directed graph and in the graph every pair of distinct vertices is only connected by a unique edge. So it has constraint. If the directed graph is sparse, but every pair of distinct vertices is still connected by a pair of unique edges, the time complexity is O(nnz^2 * K). Add Affinity Propagation clustering algorithm - Key: SPARK-5832 URL: https://issues.apache.org/jira/browse/SPARK-5832 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm
[ https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326323#comment-14326323 ] Xiangrui Meng commented on SPARK-5832: -- [~viirya] Thanks for sharing the details! I'm a little worried about the `N^2` complexity. Most, if not all, algorithms in MLlib have linear dependency on the number of instances. Having N^2 complexity means that if a single machine can solve a problem of 100,000 instances, we need a 100-node cluster to solve a problem of 1,000,000 instances, not counting the communication overhead. We had some discussion about k-medoids (SPARK-5056) and DBSCAN (SPARK-5226), both of which have N^2 complexity. I'm not sure whether we should include those algorithms in MLlib. For now, it might be good if we can maintain the implementation outside Spark as a third-party package (http://spark-packages.org/). Add Affinity Propagation clustering algorithm - Key: SPARK-5832 URL: https://issues.apache.org/jira/browse/SPARK-5832 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm
[ https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326951#comment-14326951 ] Liang-Chi Hsieh commented on SPARK-5832: For clustering algorithm, O(N^2) complexity is not rare. So I suggest maybe we can consider clustering algorithm more with its usability, scalability, feature in addition to complexity? One of the feature of AP is that it does not require users to specify the number of clusters. As I know, we have no clustering algorithms in MLlib with similar feature? O(N^2) time/memory complexity is the worst case for AP. The worst case means that we are going to cluster on a complete digraph (directed graph, every pair of distinct vertices is connected by a pair of unique edges) or a complete graph (undirected graph, every pair of distinct vertices is connected by a unique edge). As we know, that is very rare for practical use case. At most cases, the graph we process would be sparse graph. Unlike K-means and DBSCAN, AP is not a metric-space clustering techniques. For K-means and DBSCAN, they do not distinguish between dense and sparse graph. Because they need to calculate the distances between the possible cluster centers and all other nodes. But for AP, since the similarity graph is given, a sparse graph can significantly reduce the time/memory complexity required to perform clustering. I have no strong opinion towards whether we should include AP in MLlib. As Xiangrui Meng suggested, it is also a good idea to just maintain the implementation outside Spark as a third-party package. Add Affinity Propagation clustering algorithm - Key: SPARK-5832 URL: https://issues.apache.org/jira/browse/SPARK-5832 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm
[ https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323983#comment-14323983 ] Liang-Chi Hsieh commented on SPARK-5832: I would like to add the Affinity Propagation clustering algorithm to Spark MLlib. As I know, Spark MLlib currently has no similar algorithm. For clustering on large datasets, it is hard to determine the number of clusters in advance. One of the feature of AP is that it doesn't require to set the number of clusters. Such graph clustering algorithms generally suffer from high time complexity so standalone implementation is usually only suitable for small or medium sized datasets. But considering its message updating mechanism, AP is very suitable to be implemented in a distributed approach. Distributed AP can be applied to larger datasets. Introduction Affinity Propagation (AP), a graph clustering algorithm based on the concept of message passing between data points. Unlike clustering algorithms such as k-means or k-medoids, AP does not require the number of clusters to be determined or estimated before running it. AP is developed by Frey and Dueck. Please refer to the paper (http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf). Complexity and scalability A Python version of AP is part of the scikit-learn library. You may also refer to the description in scikit-learn (http://scikit-learn.org/stable/modules/clustering.html). As stated in the scikit-learn page, AP's time complexity is O(N^2T) where N is the number of nodes in the graph and T is the number of iterations until convergence. The memory complexity is O(N^2) if a dense similarity matrix is used, but reducible if a sparse similarity matrix is used. The scikit-learn page claims that the main drawback of Affinity Propagation is its complexity. Thus, this makes Affinity Propagation most appropriate for small to medium sized datasets. However, since AP can be easily implemented in a distributed approach [1][2]. A distributed version of Affinity Propagation can significantly improve its efficiency on larger datasets. Usage It is showed that AP is better for certain computer vision and computational biology tasks. Alternatives MCL algorithm (http://micans.org/mcl/) is clustering algorithm for graphs. It is mostly applied in bioinformatics. Similar with AP, MCL doesn't require the number of clusters to be determined in advance. Both they can be adaptable to different contexts. For a straigtforward implementation of MCL, its time and space complexity are respectively O(N^3) and O(N^2). The optimization implementation can reduce time complexity to O(NK^2) where K is the number of resources allocated per node. Besides, the number of iterations required by MCL is not included in the time complexity. So generally speaking, the time complexity of MCL is higher than AP. [1] Rose, D.M., et al., Parallel Hierarchical Affinity Propagation with MapReduce, IC2E '14 [2] W. Wang, et al., Large scale of e-learning resources clustering with parallel affinity propagation, in: International Conference on Hybrid Learning, 2008. [3] Minchao Wang, et al., Parallel Clustering Algorithm for Large-Scale Biological Data Sets, PLoS One, April 4, 2014. Add Affinity Propagation clustering algorithm - Key: SPARK-5832 URL: https://issues.apache.org/jira/browse/SPARK-5832 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm
[ https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322528#comment-14322528 ] Apache Spark commented on SPARK-5832: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/4622 Add Affinity Propagation clustering algorithm - Key: SPARK-5832 URL: https://issues.apache.org/jira/browse/SPARK-5832 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Liang-Chi Hsieh -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org