[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-08-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650600#comment-14650600
 ] 

Joseph K. Bradley commented on SPARK-5832:
--

It sounds like we need some more discussion here, so I'm going to push the 
target version to 1.6

 Add Affinity Propagation clustering algorithm
 -

 Key: SPARK-5832
 URL: https://issues.apache.org/jira/browse/SPARK-5832
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
  Labels: clustering





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-02-20 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329270#comment-14329270
 ] 

Xiangrui Meng commented on SPARK-5832:
--

Maybe I understood the complexity wrongly. So it is O(nnz * k), where nnz is 
the number of nonzeros in the similarity matrix and k is the number of 
iterations. Is it correct? If that's the case, this is the same complexity as 
the power iteration clustering we just added, and I'm okay to include it in 
MLlib. (If an algorithm only works well on a single machine, we tend to not 
include it in MLlib just for feature complete. Users can pick a single-machine 
implementation and use it on a single worker.)

I will make a pass on your PR and it would be great if you can provide some 
performance results.

 Add Affinity Propagation clustering algorithm
 -

 Key: SPARK-5832
 URL: https://issues.apache.org/jira/browse/SPARK-5832
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-02-20 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329284#comment-14329284
 ] 

Liang-Chi Hsieh commented on SPARK-5832:


The time complexity O(nnz * K) is just for sparse directed graph and in the 
graph every pair of distinct vertices is only connected by a unique edge. So it 
has constraint. If the directed graph is sparse, but every pair of distinct 
vertices is still connected by a pair of unique edges, the time complexity is 
O(nnz^2 * K).


 Add Affinity Propagation clustering algorithm
 -

 Key: SPARK-5832
 URL: https://issues.apache.org/jira/browse/SPARK-5832
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-02-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326323#comment-14326323
 ] 

Xiangrui Meng commented on SPARK-5832:
--

[~viirya] Thanks for sharing the details! I'm a little worried about the `N^2` 
complexity. Most, if not all, algorithms in MLlib have linear dependency on the 
number of instances. Having N^2 complexity means that if a single machine can 
solve a problem of 100,000 instances, we need a 100-node cluster to solve a 
problem of 1,000,000 instances, not counting the communication overhead. We had 
some discussion about k-medoids (SPARK-5056) and DBSCAN (SPARK-5226), both of 
which have N^2 complexity. I'm not sure whether we should include those 
algorithms in MLlib. For now, it might be good if we can maintain the 
implementation outside Spark as a third-party package 
(http://spark-packages.org/). 

 Add Affinity Propagation clustering algorithm
 -

 Key: SPARK-5832
 URL: https://issues.apache.org/jira/browse/SPARK-5832
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-02-18 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326951#comment-14326951
 ] 

Liang-Chi Hsieh commented on SPARK-5832:


For clustering algorithm, O(N^2) complexity is not rare. So I suggest maybe we 
can consider clustering algorithm more with its usability, scalability, feature 
in addition to complexity?

One of the feature of AP is that it does not require users to specify the 
number of clusters. As I know, we have no clustering algorithms in MLlib with 
similar feature?

O(N^2) time/memory complexity is the worst case for AP. The worst case means 
that we are going to cluster on a complete digraph (directed graph, every pair 
of distinct vertices is connected by a pair of unique edges) or a complete 
graph (undirected graph, every pair of distinct vertices is connected by a 
unique edge). As we know, that is very rare for practical use case. At most 
cases, the graph we process would be sparse graph.

Unlike K-means and DBSCAN, AP is not a metric-space clustering techniques. For 
K-means and DBSCAN, they do not distinguish between dense and sparse graph. 
Because they need to calculate the distances between the possible cluster 
centers and all other nodes. But for AP, since the similarity graph is given, a 
sparse graph can significantly reduce the time/memory complexity required to 
perform clustering.

I have no strong opinion towards whether we should include AP in MLlib. As 
Xiangrui Meng suggested, it is also a good idea to just maintain the 
implementation outside Spark as a third-party package.



 Add Affinity Propagation clustering algorithm
 -

 Key: SPARK-5832
 URL: https://issues.apache.org/jira/browse/SPARK-5832
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-02-17 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323983#comment-14323983
 ] 

Liang-Chi Hsieh commented on SPARK-5832:



I would like to add the Affinity Propagation clustering algorithm to Spark 
MLlib. As I know, Spark MLlib currently has no similar algorithm.

For clustering on large datasets, it is hard to determine the number of 
clusters in advance. One of the feature of AP is that it doesn't require to set 
the number of clusters. Such graph clustering algorithms generally suffer from 
high time complexity so standalone implementation is usually only suitable for 
small or medium sized datasets. But considering its message updating mechanism, 
AP is very suitable to be implemented in a distributed approach. Distributed AP 
can be applied to larger datasets.

Introduction

Affinity Propagation (AP), a graph clustering algorithm based on the concept of 
message passing between data points. Unlike clustering algorithms such as 
k-means or k-medoids, AP does not require the number of clusters to be 
determined or estimated before running it. AP is developed by Frey and Dueck. 
Please refer to the paper 
(http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf).

Complexity and scalability

A Python version of AP is part of the scikit-learn library. You may also refer 
to the description in scikit-learn 
(http://scikit-learn.org/stable/modules/clustering.html). As stated in the 
scikit-learn page, AP's time complexity is O(N^2T) where N is the number of 
nodes in the graph and T is the number of iterations until convergence. The 
memory complexity is O(N^2) if a dense similarity matrix is used, but reducible 
if a sparse similarity matrix is used.

The scikit-learn page claims that the main drawback of Affinity Propagation is 
its complexity. Thus, this makes Affinity Propagation most appropriate for 
small to medium sized datasets. However, since AP can be easily implemented in 
a distributed approach [1][2]. A distributed version of Affinity Propagation 
can significantly improve its efficiency on larger datasets.

Usage

It is showed that AP is better for certain computer vision and computational 
biology tasks.

Alternatives

MCL algorithm (http://micans.org/mcl/) is clustering algorithm for graphs. It 
is mostly applied in bioinformatics. Similar with AP, MCL doesn't require the 
number of clusters to be determined in advance. Both they can be adaptable to 
different contexts.

For a straigtforward implementation of MCL, its time and space complexity are 
respectively O(N^3) and O(N^2). The optimization implementation can reduce time 
complexity to O(NK^2) where K is the number of resources allocated per node. 
Besides, the number of iterations required by MCL is not included in the time 
complexity. So generally speaking, the time complexity of MCL is higher than AP.


[1] Rose, D.M., et al., Parallel Hierarchical Affinity Propagation with 
MapReduce, IC2E '14
[2] W. Wang, et al., Large scale of e-learning resources clustering with 
parallel affinity propagation, in: International Conference on Hybrid Learning, 
2008.
[3] Minchao Wang, et al., Parallel Clustering Algorithm for Large-Scale 
Biological Data Sets, PLoS One, April 4, 2014.








 Add Affinity Propagation clustering algorithm
 -

 Key: SPARK-5832
 URL: https://issues.apache.org/jira/browse/SPARK-5832
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-02-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322528#comment-14322528
 ] 

Apache Spark commented on SPARK-5832:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/4622

 Add Affinity Propagation clustering algorithm
 -

 Key: SPARK-5832
 URL: https://issues.apache.org/jira/browse/SPARK-5832
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Liang-Chi Hsieh





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org