subject:"\[jira\] \[Commented\] \(SPARK\-5832\) Add Affinity Propagation clustering algorithm"

[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-08-01 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14650600#comment-14650600
 ] 

Joseph K. Bradley commented on SPARK-5832:
--

It sounds like we need some more discussion here, so I'm going to push the 
target version to 1.6

 Add Affinity Propagation clustering algorithm
 -

 Key: SPARK-5832
 URL: https://issues.apache.org/jira/browse/SPARK-5832
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
  Labels: clustering





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-02-20 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329270#comment-14329270
 ] 

Xiangrui Meng commented on SPARK-5832:
--

Maybe I understood the complexity wrongly. So it is O(nnz * k), where nnz is 
the number of nonzeros in the similarity matrix and k is the number of 
iterations. Is it correct? If that's the case, this is the same complexity as 
the power iteration clustering we just added, and I'm okay to include it in 
MLlib. (If an algorithm only works well on a single machine, we tend to not 
include it in MLlib just for feature complete. Users can pick a single-machine 
implementation and use it on a single worker.)

I will make a pass on your PR and it would be great if you can provide some 
performance results.

 Add Affinity Propagation clustering algorithm
 -

 Key: SPARK-5832
 URL: https://issues.apache.org/jira/browse/SPARK-5832
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-02-20 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14329284#comment-14329284
 ] 

Liang-Chi Hsieh commented on SPARK-5832:


The time complexity O(nnz * K) is just for sparse directed graph and in the 
graph every pair of distinct vertices is only connected by a unique edge. So it 
has constraint. If the directed graph is sparse, but every pair of distinct 
vertices is still connected by a pair of unique edges, the time complexity is 
O(nnz^2 * K).


 Add Affinity Propagation clustering algorithm
 -

 Key: SPARK-5832
 URL: https://issues.apache.org/jira/browse/SPARK-5832
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-02-18 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326323#comment-14326323
 ] 

Xiangrui Meng commented on SPARK-5832:
--

[~viirya] Thanks for sharing the details! I'm a little worried about the `N^2` 
complexity. Most, if not all, algorithms in MLlib have linear dependency on the 
number of instances. Having N^2 complexity means that if a single machine can 
solve a problem of 100,000 instances, we need a 100-node cluster to solve a 
problem of 1,000,000 instances, not counting the communication overhead. We had 
some discussion about k-medoids (SPARK-5056) and DBSCAN (SPARK-5226), both of 
which have N^2 complexity. I'm not sure whether we should include those 
algorithms in MLlib. For now, it might be good if we can maintain the 
implementation outside Spark as a third-party package 
(http://spark-packages.org/). 

 Add Affinity Propagation clustering algorithm
 -

 Key: SPARK-5832
 URL: https://issues.apache.org/jira/browse/SPARK-5832
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-02-18 Thread Liang-Chi Hsieh (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326951#comment-14326951
]

Liang-Chi Hsieh commented on SPARK-5832:

For clustering algorithm, O(N^2) complexity is not rare. So I suggest maybe we
can consider clustering algorithm more with its usability, scalability, feature
in addition to complexity?

One of the feature of AP is that it does not require users to specify the
number of clusters. As I know, we have no clustering algorithms in MLlib with
similar feature?

O(N^2) time/memory complexity is the worst case for AP. The worst case means
that we are going to cluster on a complete digraph (directed graph, every pair
of distinct vertices is connected by a pair of unique edges) or a complete
graph (undirected graph, every pair of distinct vertices is connected by a
unique edge). As we know, that is very rare for practical use case. At most
cases, the graph we process would be sparse graph.

Unlike K-means and DBSCAN, AP is not a metric-space clustering techniques. For
K-means and DBSCAN, they do not distinguish between dense and sparse graph.
Because they need to calculate the distances between the possible cluster
centers and all other nodes. But for AP, since the similarity graph is given, a
sparse graph can significantly reduce the time/memory complexity required to
perform clustering.

I have no strong opinion towards whether we should include AP in MLlib. As
Xiangrui Meng suggested, it is also a good idea to just maintain the
implementation outside Spark as a third-party package.

Add Affinity Propagation clustering algorithm
-

Key: SPARK-5832
URL: https://issues.apache.org/jira/browse/SPARK-5832
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-02-17 Thread Liang-Chi Hsieh (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323983#comment-14323983
]

Liang-Chi Hsieh commented on SPARK-5832:

I would like to add the Affinity Propagation clustering algorithm to Spark
MLlib. As I know, Spark MLlib currently has no similar algorithm.

For clustering on large datasets, it is hard to determine the number of
clusters in advance. One of the feature of AP is that it doesn't require to set
the number of clusters. Such graph clustering algorithms generally suffer from
high time complexity so standalone implementation is usually only suitable for
small or medium sized datasets. But considering its message updating mechanism,
AP is very suitable to be implemented in a distributed approach. Distributed AP
can be applied to larger datasets.

Introduction

Affinity Propagation (AP), a graph clustering algorithm based on the concept of
message passing between data points. Unlike clustering algorithms such as
k-means or k-medoids, AP does not require the number of clusters to be
determined or estimated before running it. AP is developed by Frey and Dueck.
Please refer to the paper
(http://www.psi.toronto.edu/affinitypropagation/FreyDueckScience07.pdf).

Complexity and scalability

A Python version of AP is part of the scikit-learn library. You may also refer
to the description in scikit-learn
(http://scikit-learn.org/stable/modules/clustering.html). As stated in the
scikit-learn page, AP's time complexity is O(N^2T) where N is the number of
nodes in the graph and T is the number of iterations until convergence. The
memory complexity is O(N^2) if a dense similarity matrix is used, but reducible
if a sparse similarity matrix is used.

The scikit-learn page claims that the main drawback of Affinity Propagation is
its complexity. Thus, this makes Affinity Propagation most appropriate for
small to medium sized datasets. However, since AP can be easily implemented in
a distributed approach [1][2]. A distributed version of Affinity Propagation
can significantly improve its efficiency on larger datasets.

Usage

It is showed that AP is better for certain computer vision and computational
biology tasks.

Alternatives

MCL algorithm (http://micans.org/mcl/) is clustering algorithm for graphs. It
is mostly applied in bioinformatics. Similar with AP, MCL doesn't require the
number of clusters to be determined in advance. Both they can be adaptable to
different contexts.

For a straigtforward implementation of MCL, its time and space complexity are
respectively O(N^3) and O(N^2). The optimization implementation can reduce time
complexity to O(NK^2) where K is the number of resources allocated per node.
Besides, the number of iterations required by MCL is not included in the time
complexity. So generally speaking, the time complexity of MCL is higher than AP.

[1] Rose, D.M., et al., Parallel Hierarchical Affinity Propagation with
MapReduce, IC2E '14
[2] W. Wang, et al., Large scale of e-learning resources clustering with
parallel aﬃnity propagation, in: International Conference on Hybrid Learning,
2008.
[3] Minchao Wang, et al., Parallel Clustering Algorithm for Large-Scale
Biological Data Sets, PLoS One, April 4, 2014.

Add Affinity Propagation clustering algorithm
-

Key: SPARK-5832
URL: https://issues.apache.org/jira/browse/SPARK-5832
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

2015-02-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14322528#comment-14322528
 ] 

Apache Spark commented on SPARK-5832:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/4622

 Add Affinity Propagation clustering algorithm
 -

 Key: SPARK-5832
 URL: https://issues.apache.org/jira/browse/SPARK-5832
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Liang-Chi Hsieh





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

[jira] [Commented] (SPARK-5832) Add Affinity Propagation clustering algorithm

7 matches

Site Navigation

Mail list logo

Footer information