[jira] [Created] (SPARK-8497) Graph Clique(Complete Connected Sub-graph) Discovery Algorithm

2015-06-19 Thread Fan Jiang (JIRA)
Fan Jiang created SPARK-8497:


 Summary: Graph Clique(Complete Connected Sub-graph) Discovery 
Algorithm
 Key: SPARK-8497
 URL: https://issues.apache.org/jira/browse/SPARK-8497
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, ML, MLlib, Spark Core
Reporter: Fan Jiang


In recent years, social network industry has high demand on Complete Connected 
Sub-Graph Discoveries, so does Telecom. Similar as the graph connection from 
Twitter, the calls and other activities from telecoms world form a huge social 
graph, and due to the nature of communication method, it shows the strongest 
inter-person relationship, the graph based analysis will reveal tremendous 
value from telecoms connections. 

We need an algorithm in Spark to figure out ALL the strongest completely 
connected sub-graph (so called Clique here) for EVERY person in the network 
which will be one of the start point for understanding user's social behaviour. 

In Huawei, we have many real-world use cases that invovle telecom social graph 
of tens billion edges and hundreds million vertices, and the cliques will be 
also in tens million level. The graph will be a fast changing one which means 
we need to analyse the graph pattern very often (one result per day/week for 
moving time window which spans multiple months). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-3530) Pipeline and Parameters

2015-04-20 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-3530:
-
Comment: was deleted

(was: Hi Xiangrui,

Which part of this pipeline project would you like us to work on? 

Thanks!


)

 Pipeline and Parameters
 ---

 Key: SPARK-3530
 URL: https://issues.apache.org/jira/browse/SPARK-3530
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.2.0


 This part of the design doc is for pipelines and parameters. I put the design 
 doc at
 https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
 I will copy the proposed interfaces to this JIRA later. Some sample code can 
 be viewed at: https://github.com/mengxr/spark-ml/
 Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3530) Pipeline and Parameters

2015-04-19 Thread Fan Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502364#comment-14502364
 ] 

Fan Jiang commented on SPARK-3530:
--

Hi Xiangrui,

Which part of this pipeline project would you like us to work on? 

Thanks!




 Pipeline and Parameters
 ---

 Key: SPARK-3530
 URL: https://issues.apache.org/jira/browse/SPARK-3530
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.2.0


 This part of the design doc is for pipelines and parameters. I put the design 
 doc at
 https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
 I will copy the proposed interfaces to this JIRA later. Some sample code can 
 be viewed at: https://github.com/mengxr/spark-ml/
 Please help review the design and post your comments here. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6417) Add Linear Programming algorithm

2015-03-19 Thread Fan Jiang (JIRA)
Fan Jiang created SPARK-6417:


 Summary: Add Linear Programming algorithm 
 Key: SPARK-6417
 URL: https://issues.apache.org/jira/browse/SPARK-6417
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang


Linear programming is the problem of finding a vector x that minimizes a linear 
function fTx subject to linear constraints:

minxfTx

such that one or more of the following hold: A·x ≤ b, Aeq·x = beq, l ≤ x ≤ u.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function

2015-01-28 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-4259:
-
Description: 
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

Power iteration clustering is a scalable and efficient algorithm for clustering 
points given pointwise mutual affinity values.  Internally the algorithm:

computes the Gaussian distance between all pairs of points and represents these 
distances in an Affinity Matrix
calculates a Normalized Affinity Matrix
calculates the principal eigenvalue and eigenvector
Clusters each of the input points according to their principal eigenvector 
component value

Details of this algorithm are found within [Power Iteration Clustering, Lin and 
Cohen]{www.icml2010.org/papers/387.pdf}


  was:
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci }.



Summary: Add Power Iteration Clustering Algorithm with Gaussian 
Similarity Function  (was: Add Spectral Clustering Algorithm with Gaussian 
Similarity Function)

 Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
 --

 Key: SPARK-4259
 URL: https://issues.apache.org/jira/browse/SPARK-4259
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
Assignee: Fan Jiang
  Labels: features

 In recent years, spectral clustering has become one of the most popular 
 modern clustering algorithms. It is simple to implement, can be solved 
 efficiently by standard linear algebra software, and very often outperforms 
 traditional clustering algorithms such as the k-means algorithm.
 Power iteration clustering is a scalable and efficient algorithm for 
 clustering points given pointwise mutual affinity values.  Internally the 
 algorithm:
 computes the Gaussian distance between all pairs of points and represents 
 these distances in an Affinity Matrix
 calculates a Normalized Affinity Matrix
 calculates the principal eigenvalue and eigenvector
 Clusters each of the input points according to their principal eigenvector 
 component value
 Details of this algorithm are found within [Power Iteration Clustering, Lin 
 and Cohen]{www.icml2010.org/papers/387.pdf}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm

2014-12-12 Thread Fan Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243900#comment-14243900
 ] 

Fan Jiang commented on SPARK-4510:
--

[~mengxr]

1)
The PAM K-Medoids complexity is:

O( (1 + Beta) * K^2 * (N - K)^2)

where K = Number of Medoids
N = Number of datapoints
Beta = Mean # of successful swaps of DataPoints with existing Medoids

This is basically a factor of N greater than K-Means - due to recalculation of 
the Costs as part of considering every swap. This characteristic of K-Medoids 
has been widely discussed in the academia. And there have been some comparisons 
(vs K-Means) and possible improvement suggestions made. Seems like it's a 
tradeoff between performance and complexity 
when choosing from these 2 algorithms.

2)
Regarding the performance: 
I added a dataset as an example (data/mllib/sample_pam_data.txt). This was 
found on a paper comparing K-Means and K-Medoids. You can plot the datapoints 
to see the clusters. If user runs 
examples/mllib/DenseKMeans.scala
examples/mllib/Kmedoids.scala

on this dataset for K = 3, the correct clustering can be visualized by the 
results with a smaller cost. K-Medoids has is more robust in terms of 
correctness.


The papers we referred to are:

Article KMedoids complexity  
http://www.researchgate.net/profile/Dr_Velmurugan_T/publication/47554407_Computational_Complexity_between_K-Means_and_K-Medoids_Clustering_Algorithms_for_Normal_and_Uniform_Distributions_of_Data_Points/links/0fcfd50dc5cd08aa2900

Article on efficient K-Medoids (Springer)  
http://link.springer.com/chapter/10.1007%2F3-540-46145-0_7#page-1
Clustering for Large Applications CLARA (used in one of Medoid papers) 
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA


 Add k-medoids Partitioning Around Medoids (PAM) algorithm
 -

 Key: SPARK-4510
 URL: https://issues.apache.org/jira/browse/SPARK-4510
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
Assignee: Fan Jiang
  Labels: features
   Original Estimate: 0h
  Remaining Estimate: 0h

 PAM (k-medoids) is more robust to noise and outliers as compared to k-means 
 because it minimizes a sum of pairwise dissimilarities instead of a sum of 
 squared Euclidean distances. A medoid can be defined as the object of a 
 cluster, whose average dissimilarity to all the objects in the cluster is 
 minimal i.e. it is a most centrally located point in the cluster.
 The most common realisation of k-medoid clustering is the Partitioning Around 
 Medoids (PAM) algorithm and is as follows:
 Initialize: randomly select (without replacement) k of the n data points as 
 the medoids
 Associate each data point to the closest medoid. (closest here is defined 
 using any valid distance metric, most commonly Euclidean distance, Manhattan 
 distance or Minkowski distance)
 For each medoid m
 For each non-medoid data point o
 Swap m and o and compute the total cost of the configuration
 Select the configuration with the lowest cost.
 Repeat steps 2 to 4 until there is no change in the medoid.
 The new feature for MLlib will contain 5 new files
 /main/scala/org/apache/spark/mllib/clustering/PAM.scala
 /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala
 /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala
 /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala
 /main/scala/org/apache/spark/examples/mllib/KMedoids.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm

2014-12-12 Thread Fan Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243900#comment-14243900
 ] 

Fan Jiang edited comment on SPARK-4510 at 12/12/14 9:19 AM:


[~mengxr]

1)
The PAM K-Medoids complexity is:

O( (1 + Beta) * K^2 * (N - K)^2)

where K = Number of Medoids
N = Number of datapoints
Beta = Mean # of successful swaps of DataPoints with existing Medoids

This is basically a factor of N greater than K-Means - due to recalculation of 
the Costs as part of considering every swap. This characteristic of K-Medoids 
has been widely discussed in the academia. And there have been some comparisons 
(vs K-Means) and possible improvement suggestions made. Seems like it's a 
tradeoff between performance and complexity when choosing from these 2 
algorithms.

2)
Regarding the performance: 
I added a dataset as an example (data/mllib/sample_pam_data.txt). This was 
found on a paper comparing K-Means and K-Medoids. You can plot the datapoints 
to see the clusters. If user runs 
examples/mllib/DenseKMeans.scala
examples/mllib/Kmedoids.scala

on this dataset for K = 3, the correct clustering can be visualized by the 
results with a smaller cost. K-Medoids has is more robust in terms of 
correctness.


The papers we referred to are:

Article KMedoids complexity  
http://www.researchgate.net/profile/Dr_Velmurugan_T/publication/47554407_Computational_Complexity_between_K-Means_and_K-Medoids_Clustering_Algorithms_for_Normal_and_Uniform_Distributions_of_Data_Points/links/0fcfd50dc5cd08aa2900

Article on efficient K-Medoids (Springer)  
http://link.springer.com/chapter/10.1007%2F3-540-46145-0_7#page-1
Clustering for Large Applications CLARA (used in one of Medoid papers) 
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA



was (Author: fjiang6):
[~mengxr]

1)
The PAM K-Medoids complexity is:

O( (1 + Beta) * K^2 * (N - K)^2)

where K = Number of Medoids
N = Number of datapoints
Beta = Mean # of successful swaps of DataPoints with existing Medoids

This is basically a factor of N greater than K-Means - due to recalculation of 
the Costs as part of considering every swap. This characteristic of K-Medoids 
has been widely discussed in the academia. And there have been some comparisons 
(vs K-Means) and possible improvement suggestions made. Seems like it's a 
tradeoff between performance and complexity 
when choosing from these 2 algorithms.

2)
Regarding the performance: 
I added a dataset as an example (data/mllib/sample_pam_data.txt). This was 
found on a paper comparing K-Means and K-Medoids. You can plot the datapoints 
to see the clusters. If user runs 
examples/mllib/DenseKMeans.scala
examples/mllib/Kmedoids.scala

on this dataset for K = 3, the correct clustering can be visualized by the 
results with a smaller cost. K-Medoids has is more robust in terms of 
correctness.


The papers we referred to are:

Article KMedoids complexity  
http://www.researchgate.net/profile/Dr_Velmurugan_T/publication/47554407_Computational_Complexity_between_K-Means_and_K-Medoids_Clustering_Algorithms_for_Normal_and_Uniform_Distributions_of_Data_Points/links/0fcfd50dc5cd08aa2900

Article on efficient K-Medoids (Springer)  
http://link.springer.com/chapter/10.1007%2F3-540-46145-0_7#page-1
Clustering for Large Applications CLARA (used in one of Medoid papers) 
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA


 Add k-medoids Partitioning Around Medoids (PAM) algorithm
 -

 Key: SPARK-4510
 URL: https://issues.apache.org/jira/browse/SPARK-4510
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
Assignee: Fan Jiang
  Labels: features
   Original Estimate: 0h
  Remaining Estimate: 0h

 PAM (k-medoids) is more robust to noise and outliers as compared to k-means 
 because it minimizes a sum of pairwise dissimilarities instead of a sum of 
 squared Euclidean distances. A medoid can be defined as the object of a 
 cluster, whose average dissimilarity to all the objects in the cluster is 
 minimal i.e. it is a most centrally located point in the cluster.
 The most common realisation of k-medoid clustering is the Partitioning Around 
 Medoids (PAM) algorithm and is as follows:
 Initialize: randomly select (without replacement) k of the n data points as 
 the medoids
 Associate each data point to the closest medoid. (closest here is defined 
 using any valid distance metric, most commonly Euclidean distance, Manhattan 
 distance or Minkowski distance)
 For each medoid m
 For each non-medoid data point o
 Swap m and o and compute the total cost of the configuration
 Select the configuration with the lowest cost.
 Repeat steps 2 to 4 until there is 

[jira] [Comment Edited] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm

2014-12-12 Thread Fan Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243900#comment-14243900
 ] 

Fan Jiang edited comment on SPARK-4510 at 12/12/14 9:20 AM:


[~mengxr]

1)
The PAM K-Medoids complexity is:

O( (1 + Beta) * K^2 * (N - K)^2)

where K = Number of Medoids
N = Number of datapoints
Beta = Mean # of successful swaps of DataPoints with existing Medoids

This is basically a factor of N greater than K-Means - due to recalculation of 
the Costs as part of considering every swap. This characteristic of K-Medoids 
has been widely discussed in the academia. And there have been some comparisons 
(vs K-Means) and possible improvement suggestions made. Seems like it's a 
tradeoff between performance and complexity when choosing from these 2 
algorithms.

2)
Regarding the performance: 
I added a dataset as an example (data/mllib/sample_pam_data.txt). This was 
found on a paper comparing K-Means and K-Medoids. You can plot the datapoints 
to see the clusters. If user runs 
examples/mllib/DenseKMeans.scala
examples/mllib/Kmedoids.scala

on this dataset for K = 3, the correct clustering can be visualized by the 
results with a smaller cost. K-Medoids is more robust in terms of correctness.


The papers we referred to are:

Article KMedoids complexity  
http://www.researchgate.net/profile/Dr_Velmurugan_T/publication/47554407_Computational_Complexity_between_K-Means_and_K-Medoids_Clustering_Algorithms_for_Normal_and_Uniform_Distributions_of_Data_Points/links/0fcfd50dc5cd08aa2900

Article on efficient K-Medoids (Springer)  
http://link.springer.com/chapter/10.1007%2F3-540-46145-0_7#page-1
Clustering for Large Applications CLARA (used in one of Medoid papers) 
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA



was (Author: fjiang6):
[~mengxr]

1)
The PAM K-Medoids complexity is:

O( (1 + Beta) * K^2 * (N - K)^2)

where K = Number of Medoids
N = Number of datapoints
Beta = Mean # of successful swaps of DataPoints with existing Medoids

This is basically a factor of N greater than K-Means - due to recalculation of 
the Costs as part of considering every swap. This characteristic of K-Medoids 
has been widely discussed in the academia. And there have been some comparisons 
(vs K-Means) and possible improvement suggestions made. Seems like it's a 
tradeoff between performance and complexity when choosing from these 2 
algorithms.

2)
Regarding the performance: 
I added a dataset as an example (data/mllib/sample_pam_data.txt). This was 
found on a paper comparing K-Means and K-Medoids. You can plot the datapoints 
to see the clusters. If user runs 
examples/mllib/DenseKMeans.scala
examples/mllib/Kmedoids.scala

on this dataset for K = 3, the correct clustering can be visualized by the 
results with a smaller cost. K-Medoids has is more robust in terms of 
correctness.


The papers we referred to are:

Article KMedoids complexity  
http://www.researchgate.net/profile/Dr_Velmurugan_T/publication/47554407_Computational_Complexity_between_K-Means_and_K-Medoids_Clustering_Algorithms_for_Normal_and_Uniform_Distributions_of_Data_Points/links/0fcfd50dc5cd08aa2900

Article on efficient K-Medoids (Springer)  
http://link.springer.com/chapter/10.1007%2F3-540-46145-0_7#page-1
Clustering for Large Applications CLARA (used in one of Medoid papers) 
http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA


 Add k-medoids Partitioning Around Medoids (PAM) algorithm
 -

 Key: SPARK-4510
 URL: https://issues.apache.org/jira/browse/SPARK-4510
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
Assignee: Fan Jiang
  Labels: features
   Original Estimate: 0h
  Remaining Estimate: 0h

 PAM (k-medoids) is more robust to noise and outliers as compared to k-means 
 because it minimizes a sum of pairwise dissimilarities instead of a sum of 
 squared Euclidean distances. A medoid can be defined as the object of a 
 cluster, whose average dissimilarity to all the objects in the cluster is 
 minimal i.e. it is a most centrally located point in the cluster.
 The most common realisation of k-medoid clustering is the Partitioning Around 
 Medoids (PAM) algorithm and is as follows:
 Initialize: randomly select (without replacement) k of the n data points as 
 the medoids
 Associate each data point to the closest medoid. (closest here is defined 
 using any valid distance metric, most commonly Euclidean distance, Manhattan 
 distance or Minkowski distance)
 For each medoid m
 For each non-medoid data point o
 Swap m and o and compute the total cost of the configuration
 Select the configuration with the lowest cost.
 Repeat steps 2 to 4 until there is no 

[jira] [Created] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm

2014-11-20 Thread Fan Jiang (JIRA)
Fan Jiang created SPARK-4510:


 Summary: Add k-medoids Partitioning Around Medoids (PAM) algorithm
 Key: SPARK-4510
 URL: https://issues.apache.org/jira/browse/SPARK-4510
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm

2014-11-20 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-4510:
-
Description: 
PAM (k-medoids) is more robust to noise and outliers as compared to k-means 
because it minimizes a sum of pairwise dissimilarities instead of a sum of 
squared Euclidean distances. A medoid can be defined as the object of a 
cluster, whose average dissimilarity to all the objects in the cluster is 
minimal i.e. it is a most centrally located point in the cluster.

The most common realisation of k-medoid clustering is the Partitioning Around 
Medoids (PAM) algorithm and is as follows:

Initialize: randomly select (without replacement) k of the n data points as the 
medoids
Associate each data point to the closest medoid. (closest here is defined 
using any valid distance metric, most commonly Euclidean distance, Manhattan 
distance or Minkowski distance)
For each medoid m
For each non-medoid data point o
Swap m and o and compute the total cost of the configuration
Select the configuration with the lowest cost.
Repeat steps 2 to 4 until there is no change in the medoid.


 Add k-medoids Partitioning Around Medoids (PAM) algorithm
 -

 Key: SPARK-4510
 URL: https://issues.apache.org/jira/browse/SPARK-4510
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
  Labels: features
   Original Estimate: 0h
  Remaining Estimate: 0h

 PAM (k-medoids) is more robust to noise and outliers as compared to k-means 
 because it minimizes a sum of pairwise dissimilarities instead of a sum of 
 squared Euclidean distances. A medoid can be defined as the object of a 
 cluster, whose average dissimilarity to all the objects in the cluster is 
 minimal i.e. it is a most centrally located point in the cluster.
 The most common realisation of k-medoid clustering is the Partitioning Around 
 Medoids (PAM) algorithm and is as follows:
 Initialize: randomly select (without replacement) k of the n data points as 
 the medoids
 Associate each data point to the closest medoid. (closest here is defined 
 using any valid distance metric, most commonly Euclidean distance, Manhattan 
 distance or Minkowski distance)
 For each medoid m
 For each non-medoid data point o
 Swap m and o and compute the total cost of the configuration
 Select the configuration with the lowest cost.
 Repeat steps 2 to 4 until there is no change in the medoid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm

2014-11-20 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-4510:
-
Description: 
PAM (k-medoids) is more robust to noise and outliers as compared to k-means 
because it minimizes a sum of pairwise dissimilarities instead of a sum of 
squared Euclidean distances. A medoid can be defined as the object of a 
cluster, whose average dissimilarity to all the objects in the cluster is 
minimal i.e. it is a most centrally located point in the cluster.

The most common realisation of k-medoid clustering is the Partitioning Around 
Medoids (PAM) algorithm and is as follows:

Initialize: randomly select (without replacement) k of the n data points as the 
medoids
Associate each data point to the closest medoid. (closest here is defined 
using any valid distance metric, most commonly Euclidean distance, Manhattan 
distance or Minkowski distance)
For each medoid m
For each non-medoid data point o
Swap m and o and compute the total cost of the configuration
Select the configuration with the lowest cost.
Repeat steps 2 to 4 until there is no change in the medoid.

The new feature for MLlib will contain 5 new files
/main/scala/org/apache/spark/mllib/clustering/PAM.scala
/main/scala/org/apache/spark/mllib/clustering/PAMModel.scala
/main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala
/test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala
/main/scala/org/apache/spark/examples/mllib/KMedoids.scala


  was:
PAM (k-medoids) is more robust to noise and outliers as compared to k-means 
because it minimizes a sum of pairwise dissimilarities instead of a sum of 
squared Euclidean distances. A medoid can be defined as the object of a 
cluster, whose average dissimilarity to all the objects in the cluster is 
minimal i.e. it is a most centrally located point in the cluster.

The most common realisation of k-medoid clustering is the Partitioning Around 
Medoids (PAM) algorithm and is as follows:

Initialize: randomly select (without replacement) k of the n data points as the 
medoids
Associate each data point to the closest medoid. (closest here is defined 
using any valid distance metric, most commonly Euclidean distance, Manhattan 
distance or Minkowski distance)
For each medoid m
For each non-medoid data point o
Swap m and o and compute the total cost of the configuration
Select the configuration with the lowest cost.
Repeat steps 2 to 4 until there is no change in the medoid.



 Add k-medoids Partitioning Around Medoids (PAM) algorithm
 -

 Key: SPARK-4510
 URL: https://issues.apache.org/jira/browse/SPARK-4510
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
  Labels: features
   Original Estimate: 0h
  Remaining Estimate: 0h

 PAM (k-medoids) is more robust to noise and outliers as compared to k-means 
 because it minimizes a sum of pairwise dissimilarities instead of a sum of 
 squared Euclidean distances. A medoid can be defined as the object of a 
 cluster, whose average dissimilarity to all the objects in the cluster is 
 minimal i.e. it is a most centrally located point in the cluster.
 The most common realisation of k-medoid clustering is the Partitioning Around 
 Medoids (PAM) algorithm and is as follows:
 Initialize: randomly select (without replacement) k of the n data points as 
 the medoids
 Associate each data point to the closest medoid. (closest here is defined 
 using any valid distance metric, most commonly Euclidean distance, Manhattan 
 distance or Minkowski distance)
 For each medoid m
 For each non-medoid data point o
 Swap m and o and compute the total cost of the configuration
 Select the configuration with the lowest cost.
 Repeat steps 2 to 4 until there is no change in the medoid.
 The new feature for MLlib will contain 5 new files
 /main/scala/org/apache/spark/mllib/clustering/PAM.scala
 /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala
 /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala
 /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala
 /main/scala/org/apache/spark/examples/mllib/KMedoids.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function

2014-11-05 Thread Fan Jiang (JIRA)
Fan Jiang created SPARK-4259:


 Summary: Add Spectral Clustering Algorithm with Gaussian 
Similarity Function
 Key: SPARK-4259
 URL: https://issues.apache.org/jira/browse/SPARK-4259
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function

2014-11-05 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-4259:
-
Description: 
In recent years, spectral clustering has become one of the most popular modern 
clustering
algorithms. It is simple to implement, can be solved efficiently by standard 
linear algebra software,
and very often outperforms traditional clustering algorithms such as the 
k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 
• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.
Output: Clusters A1, . . . , Ak with Ai = {j| yj ∈ Ci}.



 Add Spectral Clustering Algorithm with Gaussian Similarity Function
 ---

 Key: SPARK-4259
 URL: https://issues.apache.org/jira/browse/SPARK-4259
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
  Labels: features

 In recent years, spectral clustering has become one of the most popular 
 modern clustering
 algorithms. It is simple to implement, can be solved efficiently by standard 
 linear algebra software,
 and very often outperforms traditional clustering algorithms such as the 
 k-means algorithm.
 We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
 function. A brief design looks like below:
 Unnormalized spectral clustering
 Input: raw data points, number k of clusters to construct: 
 • Comupte Similarity matrix S ∈ Rn×n, .
 • Construct a similarity graph. Let W be its weighted adjacency matrix.
 • Compute the unnormalized Laplacian L = D - W. where D is the Degree 
 diagonal matrix
 • Compute the first k eigenvectors u1, . . . , uk of L.
 • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
 • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th 
 row of U.
 • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
 clusters C1, . . . , Ck.
 Output: Clusters A1, . . . , Ak with Ai = {j| yj ∈ Ci}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function

2014-11-05 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-4259:
-
Description: 
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 
• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.
Output: Clusters A1, . . . , Ak with Ai = {j | yj ∈ Ci}.



  was:
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 
• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.
Output: Clusters A1, . . . , Ak with Ai = {j| yj ∈ Ci}.




 Add Spectral Clustering Algorithm with Gaussian Similarity Function
 ---

 Key: SPARK-4259
 URL: https://issues.apache.org/jira/browse/SPARK-4259
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
  Labels: features

 In recent years, spectral clustering has become one of the most popular 
 modern clustering algorithms. It is simple to implement, can be solved 
 efficiently by standard linear algebra software, and very often outperforms 
 traditional clustering algorithms such as the k-means algorithm.
 We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
 function. A brief design looks like below:
 Unnormalized spectral clustering
 Input: raw data points, number k of clusters to construct: 
 • Comupte Similarity matrix S ∈ Rn×n, .
 • Construct a similarity graph. Let W be its weighted adjacency matrix.
 • Compute the unnormalized Laplacian L = D - W. where D is the Degree 
 diagonal matrix
 • Compute the first k eigenvectors u1, . . . , uk of L.
 • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
 • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th 
 row of U.
 • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
 clusters C1, . . . , Ck.
 Output: Clusters A1, . . . , Ak with Ai = {j | yj ∈ Ci}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function

2014-11-05 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-4259:
-
Description: 
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai = {{j | yj ∈ Ci}.



  was:
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak {with Ai = {j | yj ∈ Ci}.




 Add Spectral Clustering Algorithm with Gaussian Similarity Function
 ---

 Key: SPARK-4259
 URL: https://issues.apache.org/jira/browse/SPARK-4259
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
  Labels: features

 In recent years, spectral clustering has become one of the most popular 
 modern clustering algorithms. It is simple to implement, can be solved 
 efficiently by standard linear algebra software, and very often outperforms 
 traditional clustering algorithms such as the k-means algorithm.
 We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
 function. A brief design looks like below:
 Unnormalized spectral clustering
 Input: raw data points, number k of clusters to construct: 
 • Comupte Similarity matrix S ∈ Rn×n, .
 • Construct a similarity graph. Let W be its weighted adjacency matrix.
 • Compute the unnormalized Laplacian L = D - W. where D is the Degree 
 diagonal matrix
 • Compute the first k eigenvectors u1, . . . , uk of L.
 • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
 • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th 
 row of U.
 • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
 clusters C1, . . . , Ck.
 Output: Clusters A1, . . . , Ak with Ai = {{j | yj ∈ Ci}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function

2014-11-05 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-4259:
-
Description: 
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak {with Ai = {j | yj ∈ Ci}.



  was:
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai = {j | yj ∈ Ci}.




 Add Spectral Clustering Algorithm with Gaussian Similarity Function
 ---

 Key: SPARK-4259
 URL: https://issues.apache.org/jira/browse/SPARK-4259
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
  Labels: features

 In recent years, spectral clustering has become one of the most popular 
 modern clustering algorithms. It is simple to implement, can be solved 
 efficiently by standard linear algebra software, and very often outperforms 
 traditional clustering algorithms such as the k-means algorithm.
 We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
 function. A brief design looks like below:
 Unnormalized spectral clustering
 Input: raw data points, number k of clusters to construct: 
 • Comupte Similarity matrix S ∈ Rn×n, .
 • Construct a similarity graph. Let W be its weighted adjacency matrix.
 • Compute the unnormalized Laplacian L = D - W. where D is the Degree 
 diagonal matrix
 • Compute the first k eigenvectors u1, . . . , uk of L.
 • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
 • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th 
 row of U.
 • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
 clusters C1, . . . , Ck.
 Output: Clusters A1, . . . , Ak {with Ai = {j | yj ∈ Ci}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function

2014-11-05 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-4259:
-
Description: 
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai = {j | yj ∈ Ci}.



  was:
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 
• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.
Output: Clusters A1, . . . , Ak with Ai = {j | yj ∈ Ci}.




 Add Spectral Clustering Algorithm with Gaussian Similarity Function
 ---

 Key: SPARK-4259
 URL: https://issues.apache.org/jira/browse/SPARK-4259
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
  Labels: features

 In recent years, spectral clustering has become one of the most popular 
 modern clustering algorithms. It is simple to implement, can be solved 
 efficiently by standard linear algebra software, and very often outperforms 
 traditional clustering algorithms such as the k-means algorithm.
 We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
 function. A brief design looks like below:
 Unnormalized spectral clustering
 Input: raw data points, number k of clusters to construct: 
 • Comupte Similarity matrix S ∈ Rn×n, .
 • Construct a similarity graph. Let W be its weighted adjacency matrix.
 • Compute the unnormalized Laplacian L = D - W. where D is the Degree 
 diagonal matrix
 • Compute the first k eigenvectors u1, . . . , uk of L.
 • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
 • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th 
 row of U.
 • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
 clusters C1, . . . , Ck.
 Output: Clusters A1, . . . , Ak with Ai = {j | yj ∈ Ci}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function

2014-11-05 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-4259:
-
Description: 
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci}.



  was:
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai ={j | yj ∈ Ci}.




 Add Spectral Clustering Algorithm with Gaussian Similarity Function
 ---

 Key: SPARK-4259
 URL: https://issues.apache.org/jira/browse/SPARK-4259
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
  Labels: features

 In recent years, spectral clustering has become one of the most popular 
 modern clustering algorithms. It is simple to implement, can be solved 
 efficiently by standard linear algebra software, and very often outperforms 
 traditional clustering algorithms such as the k-means algorithm.
 We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
 function. A brief design looks like below:
 Unnormalized spectral clustering
 Input: raw data points, number k of clusters to construct: 
 • Comupte Similarity matrix S ∈ Rn×n, .
 • Construct a similarity graph. Let W be its weighted adjacency matrix.
 • Compute the unnormalized Laplacian L = D - W. where D is the Degree 
 diagonal matrix
 • Compute the first k eigenvectors u1, . . . , uk of L.
 • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
 • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th 
 row of U.
 • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
 clusters C1, . . . , Ck.
 Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function

2014-11-05 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-4259:
-
Description: 
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai = `{ j | yj ∈ Ci }.



  was:
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci}.




 Add Spectral Clustering Algorithm with Gaussian Similarity Function
 ---

 Key: SPARK-4259
 URL: https://issues.apache.org/jira/browse/SPARK-4259
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
  Labels: features

 In recent years, spectral clustering has become one of the most popular 
 modern clustering algorithms. It is simple to implement, can be solved 
 efficiently by standard linear algebra software, and very often outperforms 
 traditional clustering algorithms such as the k-means algorithm.
 We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
 function. A brief design looks like below:
 Unnormalized spectral clustering
 Input: raw data points, number k of clusters to construct: 
 • Comupte Similarity matrix S ∈ Rn×n, .
 • Construct a similarity graph. Let W be its weighted adjacency matrix.
 • Compute the unnormalized Laplacian L = D - W. where D is the Degree 
 diagonal matrix
 • Compute the first k eigenvectors u1, . . . , uk of L.
 • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
 • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th 
 row of U.
 • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
 clusters C1, . . . , Ck.
 Output: Clusters A1, . . . , Ak with Ai = `{ j | yj ∈ Ci }.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function

2014-11-05 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-4259:
-
Description: 
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai ={j | yj ∈ Ci}.



  was:
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai =  {j | yj ∈ Ci}.




 Add Spectral Clustering Algorithm with Gaussian Similarity Function
 ---

 Key: SPARK-4259
 URL: https://issues.apache.org/jira/browse/SPARK-4259
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
  Labels: features

 In recent years, spectral clustering has become one of the most popular 
 modern clustering algorithms. It is simple to implement, can be solved 
 efficiently by standard linear algebra software, and very often outperforms 
 traditional clustering algorithms such as the k-means algorithm.
 We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
 function. A brief design looks like below:
 Unnormalized spectral clustering
 Input: raw data points, number k of clusters to construct: 
 • Comupte Similarity matrix S ∈ Rn×n, .
 • Construct a similarity graph. Let W be its weighted adjacency matrix.
 • Compute the unnormalized Laplacian L = D - W. where D is the Degree 
 diagonal matrix
 • Compute the first k eigenvectors u1, . . . , uk of L.
 • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
 • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th 
 row of U.
 • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
 clusters C1, . . . , Ck.
 Output: Clusters A1, . . . , Ak with Ai ={j | yj ∈ Ci}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function

2014-11-05 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-4259:
-
Description: 
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai =  {j | yj ∈ Ci}.



  was:
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai = {{j | yj ∈ Ci}.




 Add Spectral Clustering Algorithm with Gaussian Similarity Function
 ---

 Key: SPARK-4259
 URL: https://issues.apache.org/jira/browse/SPARK-4259
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
  Labels: features

 In recent years, spectral clustering has become one of the most popular 
 modern clustering algorithms. It is simple to implement, can be solved 
 efficiently by standard linear algebra software, and very often outperforms 
 traditional clustering algorithms such as the k-means algorithm.
 We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
 function. A brief design looks like below:
 Unnormalized spectral clustering
 Input: raw data points, number k of clusters to construct: 
 • Comupte Similarity matrix S ∈ Rn×n, .
 • Construct a similarity graph. Let W be its weighted adjacency matrix.
 • Compute the unnormalized Laplacian L = D - W. where D is the Degree 
 diagonal matrix
 • Compute the first k eigenvectors u1, . . . , uk of L.
 • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
 • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th 
 row of U.
 • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
 clusters C1, . . . , Ck.
 Output: Clusters A1, . . . , Ak with Ai =  {j | yj ∈ Ci}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function

2014-11-05 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-4259:
-
Description: 
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai = {{ j | yj ∈ Ci }.



  was:
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai = `{{ j | yj ∈ Ci }.




 Add Spectral Clustering Algorithm with Gaussian Similarity Function
 ---

 Key: SPARK-4259
 URL: https://issues.apache.org/jira/browse/SPARK-4259
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
  Labels: features

 In recent years, spectral clustering has become one of the most popular 
 modern clustering algorithms. It is simple to implement, can be solved 
 efficiently by standard linear algebra software, and very often outperforms 
 traditional clustering algorithms such as the k-means algorithm.
 We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
 function. A brief design looks like below:
 Unnormalized spectral clustering
 Input: raw data points, number k of clusters to construct: 
 • Comupte Similarity matrix S ∈ Rn×n, .
 • Construct a similarity graph. Let W be its weighted adjacency matrix.
 • Compute the unnormalized Laplacian L = D - W. where D is the Degree 
 diagonal matrix
 • Compute the first k eigenvectors u1, . . . , uk of L.
 • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
 • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th 
 row of U.
 • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
 clusters C1, . . . , Ck.
 Output: Clusters A1, . . . , Ak with Ai = {{ j | yj ∈ Ci }.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function

2014-11-05 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-4259:
-
Description: 
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci }.



  was:
In recent years, spectral clustering has become one of the most popular modern 
clustering algorithms. It is simple to implement, can be solved efficiently by 
standard linear algebra software, and very often outperforms traditional 
clustering algorithms such as the k-means algorithm.

We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
function. A brief design looks like below:

Unnormalized spectral clustering

Input: raw data points, number k of clusters to construct: 

• Comupte Similarity matrix S ∈ Rn×n, .
• Construct a similarity graph. Let W be its weighted adjacency matrix.
• Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal 
matrix
• Compute the first k eigenvectors u1, . . . , uk of L.
• Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
• For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row 
of U.
• Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
clusters C1, . . . , Ck.

Output: Clusters A1, . . . , Ak with Ai = {{ j | yj ∈ Ci }.




 Add Spectral Clustering Algorithm with Gaussian Similarity Function
 ---

 Key: SPARK-4259
 URL: https://issues.apache.org/jira/browse/SPARK-4259
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
  Labels: features

 In recent years, spectral clustering has become one of the most popular 
 modern clustering algorithms. It is simple to implement, can be solved 
 efficiently by standard linear algebra software, and very often outperforms 
 traditional clustering algorithms such as the k-means algorithm.
 We implemented the unnormalized graph Laplacian matrix by Gaussian similarity 
 function. A brief design looks like below:
 Unnormalized spectral clustering
 Input: raw data points, number k of clusters to construct: 
 • Comupte Similarity matrix S ∈ Rn×n, .
 • Construct a similarity graph. Let W be its weighted adjacency matrix.
 • Compute the unnormalized Laplacian L = D - W. where D is the Degree 
 diagonal matrix
 • Compute the first k eigenvectors u1, . . . , uk of L.
 • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns.
 • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th 
 row of U.
 • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into 
 clusters C1, . . . , Ck.
 Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci }.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3189) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)

2014-08-25 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-3189:
-

Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-3188

 Add Robust Regression Algorithm with Turkey bisquare weight  function 
 (Biweight Estimates) 
 ---

 Key: SPARK-3189
 URL: https://issues.apache.org/jira/browse/SPARK-3189
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
  Labels: features
 Fix For: 1.1.1, 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Linear least square estimates assume the error has normal distribution and 
 can behave badly when the errors are heavy-tailed. In practical we get 
 various types of data. We need to include Robust Regression to employ a 
 fitting criterion that is not as vulnerable as least square.
 The Turkey bisquare weight function, also referred to as the biweight 
 function, produces and M-estimator that is more resistant to regression 
 outliers than the Huber M-estimator ()Andersen 2008: 19).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates)

2014-08-25 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-3188:
-

Summary: Add Robust Regression Algorithm with Tukey bisquare weight  
function (Biweight Estimates)   (was: Add Robust Regression Algorithm with 
Turkey bisquare weight  function (Biweight Estimates) )

 Add Robust Regression Algorithm with Tukey bisquare weight  function 
 (Biweight Estimates) 
 --

 Key: SPARK-3188
 URL: https://issues.apache.org/jira/browse/SPARK-3188
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
  Labels: features
 Fix For: 1.1.1, 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Linear least square estimates assume the error has normal distribution and 
 can behave badly when the errors are heavy-tailed. In practical we get 
 various types of data. We need to include Robust Regression to employ a 
 fitting criterion that is not as vulnerable as least square.
 The Turkey bisquare weight function, also referred to as the biweight 
 function, produces an M-estimator that is more resistant to regression 
 outliers than the Huber M-estimator (Andersen 2008: 19).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates)

2014-08-25 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-3188:
-

Description: 
Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression to employ a fitting 
criterion that is not as vulnerable as least square.

The Tukey bisquare weight function, also referred to as the biweight function, 
produces an M-estimator that is more resistant to regression outliers than the 
Huber M-estimator (Andersen 2008: 19).



  was:
Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression to employ a fitting 
criterion that is not as vulnerable as least square.

The Turkey bisquare weight function, also referred to as the biweight function, 
produces an M-estimator that is more resistant to regression outliers than the 
Huber M-estimator (Andersen 2008: 19).




 Add Robust Regression Algorithm with Tukey bisquare weight  function 
 (Biweight Estimates) 
 --

 Key: SPARK-3188
 URL: https://issues.apache.org/jira/browse/SPARK-3188
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
  Labels: features
 Fix For: 1.1.1, 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Linear least square estimates assume the error has normal distribution and 
 can behave badly when the errors are heavy-tailed. In practical we get 
 various types of data. We need to include Robust Regression to employ a 
 fitting criterion that is not as vulnerable as least square.
 The Tukey bisquare weight function, also referred to as the biweight 
 function, produces an M-estimator that is more resistant to regression 
 outliers than the Huber M-estimator (Andersen 2008: 19).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)

2014-08-23 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang updated SPARK-3188:
-

Description: 
Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression to employ a fitting 
criterion that is not as vulnerable as least square.

The Turkey bisquare weight function, also referred to as the biweight function, 
produces an M-estimator that is more resistant to regression outliers than the 
Huber M-estimator (Andersen 2008: 19).



  was:
Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression to employ a fitting 
criterion that is not as vulnerable as least square.

The Turkey bisquare weight function, also referred to as the biweight function, 
produces and M-estimator that is more resistant to regression outliers than the 
Huber M-estimator (Andersen 2008: 19).




 Add Robust Regression Algorithm with Turkey bisquare weight  function 
 (Biweight Estimates) 
 ---

 Key: SPARK-3188
 URL: https://issues.apache.org/jira/browse/SPARK-3188
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
  Labels: features
 Fix For: 1.1.1, 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Linear least square estimates assume the error has normal distribution and 
 can behave badly when the errors are heavy-tailed. In practical we get 
 various types of data. We need to include Robust Regression to employ a 
 fitting criterion that is not as vulnerable as least square.
 The Turkey bisquare weight function, also referred to as the biweight 
 function, produces an M-estimator that is more resistant to regression 
 outliers than the Huber M-estimator (Andersen 2008: 19).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2014-08-22 Thread Fan Jiang (JIRA)
Fan Jiang created SPARK-3181:


 Summary: Add Robust Regression Algorithm with Huber Estimator
 Key: SPARK-3181
 URL: https://issues.apache.org/jira/browse/SPARK-3181
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
 Fix For: 1.1.1, 1.2.0


Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression  to employ a fitting 
criterion that is not as vulnerable as least square.

In 1973, Huber introduced M-estimation for regression which stands for maximum 
likelihood type. The method is resistant to outliers in the response variable 
and has been widely used.

The new feature for MLlib will contain 3 new files
/main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
/test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
/main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala

and one new class HuberRobustGradient in 
/main/scala/org/apache/spark/mllib/optimization/Gradient.scala




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3188) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)

2014-08-22 Thread Fan Jiang (JIRA)
Fan Jiang created SPARK-3188:


 Summary: Add Robust Regression Algorithm with Turkey bisquare 
weight  function (Biweight Estimates) 
 Key: SPARK-3188
 URL: https://issues.apache.org/jira/browse/SPARK-3188
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
 Fix For: 1.1.1, 1.2.0


Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression to employ a fitting 
criterion that is not as vulnerable as least square.

The Turkey bisquare weight function, also referred to as the biweight function, 
produces and M-estimator that is more resistant to regression outliers than the 
Huber M-estimator ()Andersen 2008: 19).





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3189) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)

2014-08-22 Thread Fan Jiang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Jiang closed SPARK-3189.


Resolution: Duplicate

 Add Robust Regression Algorithm with Turkey bisquare weight  function 
 (Biweight Estimates) 
 ---

 Key: SPARK-3189
 URL: https://issues.apache.org/jira/browse/SPARK-3189
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
  Labels: features
 Fix For: 1.1.1, 1.2.0

   Original Estimate: 0h
  Remaining Estimate: 0h

 Linear least square estimates assume the error has normal distribution and 
 can behave badly when the errors are heavy-tailed. In practical we get 
 various types of data. We need to include Robust Regression to employ a 
 fitting criterion that is not as vulnerable as least square.
 The Turkey bisquare weight function, also referred to as the biweight 
 function, produces and M-estimator that is more resistant to regression 
 outliers than the Huber M-estimator ()Andersen 2008: 19).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3189) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)

2014-08-22 Thread Fan Jiang (JIRA)
Fan Jiang created SPARK-3189:


 Summary: Add Robust Regression Algorithm with Turkey bisquare 
weight  function (Biweight Estimates) 
 Key: SPARK-3189
 URL: https://issues.apache.org/jira/browse/SPARK-3189
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Fan Jiang
Priority: Critical
 Fix For: 1.1.1, 1.2.0


Linear least square estimates assume the error has normal distribution and can 
behave badly when the errors are heavy-tailed. In practical we get various 
types of data. We need to include Robust Regression to employ a fitting 
criterion that is not as vulnerable as least square.

The Turkey bisquare weight function, also referred to as the biweight function, 
produces and M-estimator that is more resistant to regression outliers than the 
Huber M-estimator ()Andersen 2008: 19).





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org