[jira] [Created] (SPARK-8497) Graph Clique(Complete Connected Sub-graph) Discovery Algorithm
Fan Jiang created SPARK-8497: Summary: Graph Clique(Complete Connected Sub-graph) Discovery Algorithm Key: SPARK-8497 URL: https://issues.apache.org/jira/browse/SPARK-8497 Project: Spark Issue Type: New Feature Components: GraphX, ML, MLlib, Spark Core Reporter: Fan Jiang In recent years, social network industry has high demand on Complete Connected Sub-Graph Discoveries, so does Telecom. Similar as the graph connection from Twitter, the calls and other activities from telecoms world form a huge social graph, and due to the nature of communication method, it shows the strongest inter-person relationship, the graph based analysis will reveal tremendous value from telecoms connections. We need an algorithm in Spark to figure out ALL the strongest completely connected sub-graph (so called Clique here) for EVERY person in the network which will be one of the start point for understanding user's social behaviour. In Huawei, we have many real-world use cases that invovle telecom social graph of tens billion edges and hundreds million vertices, and the cliques will be also in tens million level. The graph will be a fast changing one which means we need to analyse the graph pattern very often (one result per day/week for moving time window which spans multiple months). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-3530: - Comment: was deleted (was: Hi Xiangrui, Which part of this pipeline project would you like us to work on? Thanks! ) Pipeline and Parameters --- Key: SPARK-3530 URL: https://issues.apache.org/jira/browse/SPARK-3530 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.2.0 This part of the design doc is for pipelines and parameters. I put the design doc at https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing I will copy the proposed interfaces to this JIRA later. Some sample code can be viewed at: https://github.com/mengxr/spark-ml/ Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502364#comment-14502364 ] Fan Jiang commented on SPARK-3530: -- Hi Xiangrui, Which part of this pipeline project would you like us to work on? Thanks! Pipeline and Parameters --- Key: SPARK-3530 URL: https://issues.apache.org/jira/browse/SPARK-3530 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.2.0 This part of the design doc is for pipelines and parameters. I put the design doc at https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing I will copy the proposed interfaces to this JIRA later. Some sample code can be viewed at: https://github.com/mengxr/spark-ml/ Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6417) Add Linear Programming algorithm
Fan Jiang created SPARK-6417: Summary: Add Linear Programming algorithm Key: SPARK-6417 URL: https://issues.apache.org/jira/browse/SPARK-6417 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Linear programming is the problem of finding a vector x that minimizes a linear function fTx subject to linear constraints: minxfTx such that one or more of the following hold: A·x ≤ b, Aeq·x = beq, l ≤ x ≤ u. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4259) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-4259: - Description: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm: computes the Gaussian distance between all pairs of points and represents these distances in an Affinity Matrix calculates a Normalized Affinity Matrix calculates the principal eigenvalue and eigenvector Clusters each of the input points according to their principal eigenvector component value Details of this algorithm are found within [Power Iteration Clustering, Lin and Cohen]{www.icml2010.org/papers/387.pdf} was: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci }. Summary: Add Power Iteration Clustering Algorithm with Gaussian Similarity Function (was: Add Spectral Clustering Algorithm with Gaussian Similarity Function) Add Power Iteration Clustering Algorithm with Gaussian Similarity Function -- Key: SPARK-4259 URL: https://issues.apache.org/jira/browse/SPARK-4259 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Assignee: Fan Jiang Labels: features In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. Power iteration clustering is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm: computes the Gaussian distance between all pairs of points and represents these distances in an Affinity Matrix calculates a Normalized Affinity Matrix calculates the principal eigenvalue and eigenvector Clusters each of the input points according to their principal eigenvector component value Details of this algorithm are found within [Power Iteration Clustering, Lin and Cohen]{www.icml2010.org/papers/387.pdf} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm
[ https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243900#comment-14243900 ] Fan Jiang commented on SPARK-4510: -- [~mengxr] 1) The PAM K-Medoids complexity is: O( (1 + Beta) * K^2 * (N - K)^2) where K = Number of Medoids N = Number of datapoints Beta = Mean # of successful swaps of DataPoints with existing Medoids This is basically a factor of N greater than K-Means - due to recalculation of the Costs as part of considering every swap. This characteristic of K-Medoids has been widely discussed in the academia. And there have been some comparisons (vs K-Means) and possible improvement suggestions made. Seems like it's a tradeoff between performance and complexity when choosing from these 2 algorithms. 2) Regarding the performance: I added a dataset as an example (data/mllib/sample_pam_data.txt). This was found on a paper comparing K-Means and K-Medoids. You can plot the datapoints to see the clusters. If user runs examples/mllib/DenseKMeans.scala examples/mllib/Kmedoids.scala on this dataset for K = 3, the correct clustering can be visualized by the results with a smaller cost. K-Medoids has is more robust in terms of correctness. The papers we referred to are: Article KMedoids complexity http://www.researchgate.net/profile/Dr_Velmurugan_T/publication/47554407_Computational_Complexity_between_K-Means_and_K-Medoids_Clustering_Algorithms_for_Normal_and_Uniform_Distributions_of_Data_Points/links/0fcfd50dc5cd08aa2900 Article on efficient K-Medoids (Springer) http://link.springer.com/chapter/10.1007%2F3-540-46145-0_7#page-1 Clustering for Large Applications CLARA (used in one of Medoid papers) http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA Add k-medoids Partitioning Around Medoids (PAM) algorithm - Key: SPARK-4510 URL: https://issues.apache.org/jira/browse/SPARK-4510 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Assignee: Fan Jiang Labels: features Original Estimate: 0h Remaining Estimate: 0h PAM (k-medoids) is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances. A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the cluster. The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows: Initialize: randomly select (without replacement) k of the n data points as the medoids Associate each data point to the closest medoid. (closest here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) For each medoid m For each non-medoid data point o Swap m and o and compute the total cost of the configuration Select the configuration with the lowest cost. Repeat steps 2 to 4 until there is no change in the medoid. The new feature for MLlib will contain 5 new files /main/scala/org/apache/spark/mllib/clustering/PAM.scala /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala /main/scala/org/apache/spark/examples/mllib/KMedoids.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm
[ https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243900#comment-14243900 ] Fan Jiang edited comment on SPARK-4510 at 12/12/14 9:19 AM: [~mengxr] 1) The PAM K-Medoids complexity is: O( (1 + Beta) * K^2 * (N - K)^2) where K = Number of Medoids N = Number of datapoints Beta = Mean # of successful swaps of DataPoints with existing Medoids This is basically a factor of N greater than K-Means - due to recalculation of the Costs as part of considering every swap. This characteristic of K-Medoids has been widely discussed in the academia. And there have been some comparisons (vs K-Means) and possible improvement suggestions made. Seems like it's a tradeoff between performance and complexity when choosing from these 2 algorithms. 2) Regarding the performance: I added a dataset as an example (data/mllib/sample_pam_data.txt). This was found on a paper comparing K-Means and K-Medoids. You can plot the datapoints to see the clusters. If user runs examples/mllib/DenseKMeans.scala examples/mllib/Kmedoids.scala on this dataset for K = 3, the correct clustering can be visualized by the results with a smaller cost. K-Medoids has is more robust in terms of correctness. The papers we referred to are: Article KMedoids complexity http://www.researchgate.net/profile/Dr_Velmurugan_T/publication/47554407_Computational_Complexity_between_K-Means_and_K-Medoids_Clustering_Algorithms_for_Normal_and_Uniform_Distributions_of_Data_Points/links/0fcfd50dc5cd08aa2900 Article on efficient K-Medoids (Springer) http://link.springer.com/chapter/10.1007%2F3-540-46145-0_7#page-1 Clustering for Large Applications CLARA (used in one of Medoid papers) http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA was (Author: fjiang6): [~mengxr] 1) The PAM K-Medoids complexity is: O( (1 + Beta) * K^2 * (N - K)^2) where K = Number of Medoids N = Number of datapoints Beta = Mean # of successful swaps of DataPoints with existing Medoids This is basically a factor of N greater than K-Means - due to recalculation of the Costs as part of considering every swap. This characteristic of K-Medoids has been widely discussed in the academia. And there have been some comparisons (vs K-Means) and possible improvement suggestions made. Seems like it's a tradeoff between performance and complexity when choosing from these 2 algorithms. 2) Regarding the performance: I added a dataset as an example (data/mllib/sample_pam_data.txt). This was found on a paper comparing K-Means and K-Medoids. You can plot the datapoints to see the clusters. If user runs examples/mllib/DenseKMeans.scala examples/mllib/Kmedoids.scala on this dataset for K = 3, the correct clustering can be visualized by the results with a smaller cost. K-Medoids has is more robust in terms of correctness. The papers we referred to are: Article KMedoids complexity http://www.researchgate.net/profile/Dr_Velmurugan_T/publication/47554407_Computational_Complexity_between_K-Means_and_K-Medoids_Clustering_Algorithms_for_Normal_and_Uniform_Distributions_of_Data_Points/links/0fcfd50dc5cd08aa2900 Article on efficient K-Medoids (Springer) http://link.springer.com/chapter/10.1007%2F3-540-46145-0_7#page-1 Clustering for Large Applications CLARA (used in one of Medoid papers) http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA Add k-medoids Partitioning Around Medoids (PAM) algorithm - Key: SPARK-4510 URL: https://issues.apache.org/jira/browse/SPARK-4510 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Assignee: Fan Jiang Labels: features Original Estimate: 0h Remaining Estimate: 0h PAM (k-medoids) is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances. A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the cluster. The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows: Initialize: randomly select (without replacement) k of the n data points as the medoids Associate each data point to the closest medoid. (closest here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) For each medoid m For each non-medoid data point o Swap m and o and compute the total cost of the configuration Select the configuration with the lowest cost. Repeat steps 2 to 4 until there is
[jira] [Comment Edited] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm
[ https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243900#comment-14243900 ] Fan Jiang edited comment on SPARK-4510 at 12/12/14 9:20 AM: [~mengxr] 1) The PAM K-Medoids complexity is: O( (1 + Beta) * K^2 * (N - K)^2) where K = Number of Medoids N = Number of datapoints Beta = Mean # of successful swaps of DataPoints with existing Medoids This is basically a factor of N greater than K-Means - due to recalculation of the Costs as part of considering every swap. This characteristic of K-Medoids has been widely discussed in the academia. And there have been some comparisons (vs K-Means) and possible improvement suggestions made. Seems like it's a tradeoff between performance and complexity when choosing from these 2 algorithms. 2) Regarding the performance: I added a dataset as an example (data/mllib/sample_pam_data.txt). This was found on a paper comparing K-Means and K-Medoids. You can plot the datapoints to see the clusters. If user runs examples/mllib/DenseKMeans.scala examples/mllib/Kmedoids.scala on this dataset for K = 3, the correct clustering can be visualized by the results with a smaller cost. K-Medoids is more robust in terms of correctness. The papers we referred to are: Article KMedoids complexity http://www.researchgate.net/profile/Dr_Velmurugan_T/publication/47554407_Computational_Complexity_between_K-Means_and_K-Medoids_Clustering_Algorithms_for_Normal_and_Uniform_Distributions_of_Data_Points/links/0fcfd50dc5cd08aa2900 Article on efficient K-Medoids (Springer) http://link.springer.com/chapter/10.1007%2F3-540-46145-0_7#page-1 Clustering for Large Applications CLARA (used in one of Medoid papers) http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA was (Author: fjiang6): [~mengxr] 1) The PAM K-Medoids complexity is: O( (1 + Beta) * K^2 * (N - K)^2) where K = Number of Medoids N = Number of datapoints Beta = Mean # of successful swaps of DataPoints with existing Medoids This is basically a factor of N greater than K-Means - due to recalculation of the Costs as part of considering every swap. This characteristic of K-Medoids has been widely discussed in the academia. And there have been some comparisons (vs K-Means) and possible improvement suggestions made. Seems like it's a tradeoff between performance and complexity when choosing from these 2 algorithms. 2) Regarding the performance: I added a dataset as an example (data/mllib/sample_pam_data.txt). This was found on a paper comparing K-Means and K-Medoids. You can plot the datapoints to see the clusters. If user runs examples/mllib/DenseKMeans.scala examples/mllib/Kmedoids.scala on this dataset for K = 3, the correct clustering can be visualized by the results with a smaller cost. K-Medoids has is more robust in terms of correctness. The papers we referred to are: Article KMedoids complexity http://www.researchgate.net/profile/Dr_Velmurugan_T/publication/47554407_Computational_Complexity_between_K-Means_and_K-Medoids_Clustering_Algorithms_for_Normal_and_Uniform_Distributions_of_Data_Points/links/0fcfd50dc5cd08aa2900 Article on efficient K-Medoids (Springer) http://link.springer.com/chapter/10.1007%2F3-540-46145-0_7#page-1 Clustering for Large Applications CLARA (used in one of Medoid papers) http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/CLARA Add k-medoids Partitioning Around Medoids (PAM) algorithm - Key: SPARK-4510 URL: https://issues.apache.org/jira/browse/SPARK-4510 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Assignee: Fan Jiang Labels: features Original Estimate: 0h Remaining Estimate: 0h PAM (k-medoids) is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances. A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the cluster. The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows: Initialize: randomly select (without replacement) k of the n data points as the medoids Associate each data point to the closest medoid. (closest here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) For each medoid m For each non-medoid data point o Swap m and o and compute the total cost of the configuration Select the configuration with the lowest cost. Repeat steps 2 to 4 until there is no
[jira] [Created] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm
Fan Jiang created SPARK-4510: Summary: Add k-medoids Partitioning Around Medoids (PAM) algorithm Key: SPARK-4510 URL: https://issues.apache.org/jira/browse/SPARK-4510 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm
[ https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-4510: - Description: PAM (k-medoids) is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances. A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the cluster. The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows: Initialize: randomly select (without replacement) k of the n data points as the medoids Associate each data point to the closest medoid. (closest here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) For each medoid m For each non-medoid data point o Swap m and o and compute the total cost of the configuration Select the configuration with the lowest cost. Repeat steps 2 to 4 until there is no change in the medoid. Add k-medoids Partitioning Around Medoids (PAM) algorithm - Key: SPARK-4510 URL: https://issues.apache.org/jira/browse/SPARK-4510 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Labels: features Original Estimate: 0h Remaining Estimate: 0h PAM (k-medoids) is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances. A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the cluster. The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows: Initialize: randomly select (without replacement) k of the n data points as the medoids Associate each data point to the closest medoid. (closest here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) For each medoid m For each non-medoid data point o Swap m and o and compute the total cost of the configuration Select the configuration with the lowest cost. Repeat steps 2 to 4 until there is no change in the medoid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4510) Add k-medoids Partitioning Around Medoids (PAM) algorithm
[ https://issues.apache.org/jira/browse/SPARK-4510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-4510: - Description: PAM (k-medoids) is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances. A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the cluster. The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows: Initialize: randomly select (without replacement) k of the n data points as the medoids Associate each data point to the closest medoid. (closest here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) For each medoid m For each non-medoid data point o Swap m and o and compute the total cost of the configuration Select the configuration with the lowest cost. Repeat steps 2 to 4 until there is no change in the medoid. The new feature for MLlib will contain 5 new files /main/scala/org/apache/spark/mllib/clustering/PAM.scala /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala /main/scala/org/apache/spark/examples/mllib/KMedoids.scala was: PAM (k-medoids) is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances. A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the cluster. The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows: Initialize: randomly select (without replacement) k of the n data points as the medoids Associate each data point to the closest medoid. (closest here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) For each medoid m For each non-medoid data point o Swap m and o and compute the total cost of the configuration Select the configuration with the lowest cost. Repeat steps 2 to 4 until there is no change in the medoid. Add k-medoids Partitioning Around Medoids (PAM) algorithm - Key: SPARK-4510 URL: https://issues.apache.org/jira/browse/SPARK-4510 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Labels: features Original Estimate: 0h Remaining Estimate: 0h PAM (k-medoids) is more robust to noise and outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances. A medoid can be defined as the object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal i.e. it is a most centrally located point in the cluster. The most common realisation of k-medoid clustering is the Partitioning Around Medoids (PAM) algorithm and is as follows: Initialize: randomly select (without replacement) k of the n data points as the medoids Associate each data point to the closest medoid. (closest here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) For each medoid m For each non-medoid data point o Swap m and o and compute the total cost of the configuration Select the configuration with the lowest cost. Repeat steps 2 to 4 until there is no change in the medoid. The new feature for MLlib will contain 5 new files /main/scala/org/apache/spark/mllib/clustering/PAM.scala /main/scala/org/apache/spark/mllib/clustering/PAMModel.scala /main/scala/org/apache/spark/mllib/clustering/LocalPAM.scala /test/scala/org/apache/spark/mllib/clustering/PAMSuite.scala /main/scala/org/apache/spark/examples/mllib/KMedoids.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function
Fan Jiang created SPARK-4259: Summary: Add Spectral Clustering Algorithm with Gaussian Similarity Function Key: SPARK-4259 URL: https://issues.apache.org/jira/browse/SPARK-4259 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-4259: - Description: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j| yj ∈ Ci}. Add Spectral Clustering Algorithm with Gaussian Similarity Function --- Key: SPARK-4259 URL: https://issues.apache.org/jira/browse/SPARK-4259 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Labels: features In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j| yj ∈ Ci}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-4259: - Description: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j | yj ∈ Ci}. was: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j| yj ∈ Ci}. Add Spectral Clustering Algorithm with Gaussian Similarity Function --- Key: SPARK-4259 URL: https://issues.apache.org/jira/browse/SPARK-4259 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Labels: features In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j | yj ∈ Ci}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-4259: - Description: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {{j | yj ∈ Ci}. was: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak {with Ai = {j | yj ∈ Ci}. Add Spectral Clustering Algorithm with Gaussian Similarity Function --- Key: SPARK-4259 URL: https://issues.apache.org/jira/browse/SPARK-4259 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Labels: features In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {{j | yj ∈ Ci}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-4259: - Description: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak {with Ai = {j | yj ∈ Ci}. was: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j | yj ∈ Ci}. Add Spectral Clustering Algorithm with Gaussian Similarity Function --- Key: SPARK-4259 URL: https://issues.apache.org/jira/browse/SPARK-4259 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Labels: features In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak {with Ai = {j | yj ∈ Ci}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-4259: - Description: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j | yj ∈ Ci}. was: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j | yj ∈ Ci}. Add Spectral Clustering Algorithm with Gaussian Similarity Function --- Key: SPARK-4259 URL: https://issues.apache.org/jira/browse/SPARK-4259 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Labels: features In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j | yj ∈ Ci}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-4259: - Description: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci}. was: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai ={j | yj ∈ Ci}. Add Spectral Clustering Algorithm with Gaussian Similarity Function --- Key: SPARK-4259 URL: https://issues.apache.org/jira/browse/SPARK-4259 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Labels: features In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-4259: - Description: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = `{ j | yj ∈ Ci }. was: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci}. Add Spectral Clustering Algorithm with Gaussian Similarity Function --- Key: SPARK-4259 URL: https://issues.apache.org/jira/browse/SPARK-4259 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Labels: features In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = `{ j | yj ∈ Ci }. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-4259: - Description: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai ={j | yj ∈ Ci}. was: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j | yj ∈ Ci}. Add Spectral Clustering Algorithm with Gaussian Similarity Function --- Key: SPARK-4259 URL: https://issues.apache.org/jira/browse/SPARK-4259 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Labels: features In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai ={j | yj ∈ Ci}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-4259: - Description: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j | yj ∈ Ci}. was: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {{j | yj ∈ Ci}. Add Spectral Clustering Algorithm with Gaussian Similarity Function --- Key: SPARK-4259 URL: https://issues.apache.org/jira/browse/SPARK-4259 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Labels: features In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j | yj ∈ Ci}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-4259: - Description: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {{ j | yj ∈ Ci }. was: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = `{{ j | yj ∈ Ci }. Add Spectral Clustering Algorithm with Gaussian Similarity Function --- Key: SPARK-4259 URL: https://issues.apache.org/jira/browse/SPARK-4259 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Labels: features In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {{ j | yj ∈ Ci }. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4259) Add Spectral Clustering Algorithm with Gaussian Similarity Function
[ https://issues.apache.org/jira/browse/SPARK-4259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-4259: - Description: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci }. was: In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {{ j | yj ∈ Ci }. Add Spectral Clustering Algorithm with Gaussian Similarity Function --- Key: SPARK-4259 URL: https://issues.apache.org/jira/browse/SPARK-4259 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Fan Jiang Labels: features In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. We implemented the unnormalized graph Laplacian matrix by Gaussian similarity function. A brief design looks like below: Unnormalized spectral clustering Input: raw data points, number k of clusters to construct: • Comupte Similarity matrix S ∈ Rn×n, . • Construct a similarity graph. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L = D - W. where D is the Degree diagonal matrix • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ Rn×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ Rk be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in Rk with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = { j | yj ∈ Ci }. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3189) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)
[ https://issues.apache.org/jira/browse/SPARK-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-3189: - Issue Type: Sub-task (was: New Feature) Parent: SPARK-3188 Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates) --- Key: SPARK-3189 URL: https://issues.apache.org/jira/browse/SPARK-3189 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.0.2 Reporter: Fan Jiang Priority: Critical Labels: features Fix For: 1.1.1, 1.2.0 Original Estimate: 0h Remaining Estimate: 0h Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Turkey bisquare weight function, also referred to as the biweight function, produces and M-estimator that is more resistant to regression outliers than the Huber M-estimator ()Andersen 2008: 19). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates)
[ https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-3188: - Summary: Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates) (was: Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates) ) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates) -- Key: SPARK-3188 URL: https://issues.apache.org/jira/browse/SPARK-3188 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.2 Reporter: Fan Jiang Priority: Critical Labels: features Fix For: 1.1.1, 1.2.0 Original Estimate: 0h Remaining Estimate: 0h Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Turkey bisquare weight function, also referred to as the biweight function, produces an M-estimator that is more resistant to regression outliers than the Huber M-estimator (Andersen 2008: 19). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates)
[ https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-3188: - Description: Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Tukey bisquare weight function, also referred to as the biweight function, produces an M-estimator that is more resistant to regression outliers than the Huber M-estimator (Andersen 2008: 19). was: Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Turkey bisquare weight function, also referred to as the biweight function, produces an M-estimator that is more resistant to regression outliers than the Huber M-estimator (Andersen 2008: 19). Add Robust Regression Algorithm with Tukey bisquare weight function (Biweight Estimates) -- Key: SPARK-3188 URL: https://issues.apache.org/jira/browse/SPARK-3188 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.2 Reporter: Fan Jiang Priority: Critical Labels: features Fix For: 1.1.1, 1.2.0 Original Estimate: 0h Remaining Estimate: 0h Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Tukey bisquare weight function, also referred to as the biweight function, produces an M-estimator that is more resistant to regression outliers than the Huber M-estimator (Andersen 2008: 19). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3188) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)
[ https://issues.apache.org/jira/browse/SPARK-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang updated SPARK-3188: - Description: Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Turkey bisquare weight function, also referred to as the biweight function, produces an M-estimator that is more resistant to regression outliers than the Huber M-estimator (Andersen 2008: 19). was: Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Turkey bisquare weight function, also referred to as the biweight function, produces and M-estimator that is more resistant to regression outliers than the Huber M-estimator (Andersen 2008: 19). Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates) --- Key: SPARK-3188 URL: https://issues.apache.org/jira/browse/SPARK-3188 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.2 Reporter: Fan Jiang Priority: Critical Labels: features Fix For: 1.1.1, 1.2.0 Original Estimate: 0h Remaining Estimate: 0h Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Turkey bisquare weight function, also referred to as the biweight function, produces an M-estimator that is more resistant to regression outliers than the Huber M-estimator (Andersen 2008: 19). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator
Fan Jiang created SPARK-3181: Summary: Add Robust Regression Algorithm with Huber Estimator Key: SPARK-3181 URL: https://issues.apache.org/jira/browse/SPARK-3181 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.2 Reporter: Fan Jiang Priority: Critical Fix For: 1.1.1, 1.2.0 Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. In 1973, Huber introduced M-estimation for regression which stands for maximum likelihood type. The method is resistant to outliers in the response variable and has been widely used. The new feature for MLlib will contain 3 new files /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala and one new class HuberRobustGradient in /main/scala/org/apache/spark/mllib/optimization/Gradient.scala -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3188) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)
Fan Jiang created SPARK-3188: Summary: Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates) Key: SPARK-3188 URL: https://issues.apache.org/jira/browse/SPARK-3188 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.2 Reporter: Fan Jiang Priority: Critical Fix For: 1.1.1, 1.2.0 Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Turkey bisquare weight function, also referred to as the biweight function, produces and M-estimator that is more resistant to regression outliers than the Huber M-estimator ()Andersen 2008: 19). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3189) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)
[ https://issues.apache.org/jira/browse/SPARK-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fan Jiang closed SPARK-3189. Resolution: Duplicate Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates) --- Key: SPARK-3189 URL: https://issues.apache.org/jira/browse/SPARK-3189 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.2 Reporter: Fan Jiang Priority: Critical Labels: features Fix For: 1.1.1, 1.2.0 Original Estimate: 0h Remaining Estimate: 0h Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Turkey bisquare weight function, also referred to as the biweight function, produces and M-estimator that is more resistant to regression outliers than the Huber M-estimator ()Andersen 2008: 19). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3189) Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates)
Fan Jiang created SPARK-3189: Summary: Add Robust Regression Algorithm with Turkey bisquare weight function (Biweight Estimates) Key: SPARK-3189 URL: https://issues.apache.org/jira/browse/SPARK-3189 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.2 Reporter: Fan Jiang Priority: Critical Fix For: 1.1.1, 1.2.0 Linear least square estimates assume the error has normal distribution and can behave badly when the errors are heavy-tailed. In practical we get various types of data. We need to include Robust Regression to employ a fitting criterion that is not as vulnerable as least square. The Turkey bisquare weight function, also referred to as the biweight function, produces and M-estimator that is more resistant to regression outliers than the Huber M-estimator ()Andersen 2008: 19). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org