[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.

2016-10-26 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.

2016-10-25 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r84998430
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala ---
@@ -356,42 +356,25 @@ object GaussianMixture extends 
DefaultParamsReadable[GaussianMixture] {
  * :: Experimental ::
  * Summary of GaussianMixture.
  *
- * @param predictions  [[DataFrame]] produced by 
[[GaussianMixtureModel.transform()]]
- * @param predictionCol  Name for column of predicted clusters in 
`predictions`
- * @param probabilityCol  Name for column of predicted probability of each 
cluster in `predictions`
- * @param featuresCol  Name for column of features in `predictions`
- * @param k  Number of clusters
+ * @param predictions  [[DataFrame]] produced by 
[[GaussianMixtureModel.transform()]].
+ * @param predictionCol  Name for column of predicted clusters in 
`predictions`.
+ * @param probabilityCol  Name for column of predicted probability of each 
cluster
+ *in `predictions`.
+ * @param featuresCol  Name for column of features in `predictions`.
+ * @param k  Number of clusters.
  */
 @Since("2.0.0")
 @Experimental
 class GaussianMixtureSummary private[clustering] (
-@Since("2.0.0") @transient val predictions: DataFrame,
-@Since("2.0.0") val predictionCol: String,
-@Since("2.0.0") val probabilityCol: String,
-@Since("2.0.0") val featuresCol: String,
-@Since("2.0.0") val k: Int) extends Serializable {
-
-  /**
-   * Cluster centers of the transformed data.
-   */
-  @Since("2.0.0")
-  @transient lazy val cluster: DataFrame = 
predictions.select(predictionCol)
+predictions: DataFrame,
+predictionCol: String,
+val probabilityCol: String,
--- End diff --

could do a Since tag here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.

2016-10-25 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r84837836
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -354,21 +354,41 @@ object KMeans extends DefaultParamsReadable[KMeans] {
 @Since("2.0.0")
 @Experimental
 class KMeansSummary private[clustering] (
-@Since("2.0.0") @transient val predictions: DataFrame,
-@Since("2.0.0") val predictionCol: String,
-@Since("2.0.0") val featuresCol: String,
-@Since("2.0.0") val k: Int) extends Serializable {
+predictions: DataFrame,
+predictionCol: String,
+featuresCol: String,
+k: Int)
+  extends ClusteringSummary (
+predictions,
+predictionCol,
+featuresCol,
+k
+  )
+
+/**
+ * :: Experimental ::
+ * Summary of clustering.
+ *
+ * @param predictions  [[DataFrame]] produced by model.transform()
+ * @param predictionCol  Name for column of predicted clusters in 
`predictions`
+ * @param featuresCol  Name for column of features in `predictions`
+ * @param k  Number of clusters
+ */
+@Experimental
--- End diff --

I'm also ambivalent about this, the reason behind my change is that some 
classes such as ```KMeansSummary``` and ```GaussianMixtureSummary``` were added 
at 2.0. If I put ```@Since("2.1.0")``` here, it looks not quite right, but I'm 
not sure whether it's OK. @jkbradley What's your opinion? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.

2016-10-24 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r84798339
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -354,21 +354,41 @@ object KMeans extends DefaultParamsReadable[KMeans] {
 @Since("2.0.0")
 @Experimental
 class KMeansSummary private[clustering] (
-@Since("2.0.0") @transient val predictions: DataFrame,
-@Since("2.0.0") val predictionCol: String,
-@Since("2.0.0") val featuresCol: String,
-@Since("2.0.0") val k: Int) extends Serializable {
+predictions: DataFrame,
+predictionCol: String,
+featuresCol: String,
+k: Int)
+  extends ClusteringSummary (
+predictions,
+predictionCol,
+featuresCol,
+k
+  )
+
+/**
+ * :: Experimental ::
+ * Summary of clustering.
+ *
+ * @param predictions  [[DataFrame]] produced by model.transform()
+ * @param predictionCol  Name for column of predicted clusters in 
`predictions`
+ * @param featuresCol  Name for column of features in `predictions`
+ * @param k  Number of clusters
+ */
+@Experimental
--- End diff --

I'm not entirely certain on the official policy for the `@Since` tags, but 
it seems better to me to put `@Since("2.1.0")` here for the class and the 
methods. It will be correct for some and will at least not be incorrect for 
others. I'm not positive though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.

2016-10-24 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r84792326
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -354,21 +354,41 @@ object KMeans extends DefaultParamsReadable[KMeans] {
 @Since("2.0.0")
 @Experimental
 class KMeansSummary private[clustering] (
-@Since("2.0.0") @transient val predictions: DataFrame,
-@Since("2.0.0") val predictionCol: String,
-@Since("2.0.0") val featuresCol: String,
-@Since("2.0.0") val k: Int) extends Serializable {
+predictions: DataFrame,
+predictionCol: String,
+featuresCol: String,
+k: Int)
+  extends ClusteringSummary (
+predictions,
+predictionCol,
+featuresCol,
+k
+  )
+
+/**
+ * :: Experimental ::
+ * Summary of clustering.
+ *
+ * @param predictions  [[DataFrame]] produced by model.transform()
--- End diff --

nit: Add periods for each line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.

2016-10-24 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r84792150
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/clustering/BisectingKMeans.scala ---
@@ -297,27 +297,13 @@ object BisectingKMeans extends 
DefaultParamsReadable[BisectingKMeans] {
 @Since("2.1.0")
 @Experimental
 class BisectingKMeansSummary private[clustering] (
-@Since("2.1.0") @transient val predictions: DataFrame,
-@Since("2.1.0") val predictionCol: String,
-@Since("2.1.0") val featuresCol: String,
-@Since("2.1.0") val k: Int) extends Serializable {
-
-  /**
-   * Cluster centers of the transformed data.
-   */
-  @Since("2.1.0")
-  @transient lazy val cluster: DataFrame = 
predictions.select(predictionCol)
-
-  /**
-   * Size of (number of data points in) each cluster.
-   */
-  @Since("2.1.0")
-  lazy val clusterSizes: Array[Long] = {
-val sizes = Array.fill[Long](k)(0)
-cluster.groupBy(predictionCol).count().select(predictionCol, 
"count").collect().foreach {
-  case Row(cluster: Int, count: Long) => sizes(cluster) = count
-}
-sizes
-  }
-
-}
+predictions: DataFrame,
+predictionCol: String,
+featuresCol: String,
+k: Int)
+  extends ClusteringSummary (
--- End diff --

minor: this can fit on one line like: `k: Int) extends 
ClusteringSummary(..., ..., ...)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.

2016-10-24 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r84798486
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -354,21 +354,41 @@ object KMeans extends DefaultParamsReadable[KMeans] {
 @Since("2.0.0")
 @Experimental
 class KMeansSummary private[clustering] (
-@Since("2.0.0") @transient val predictions: DataFrame,
-@Since("2.0.0") val predictionCol: String,
-@Since("2.0.0") val featuresCol: String,
-@Since("2.0.0") val k: Int) extends Serializable {
+predictions: DataFrame,
+predictionCol: String,
+featuresCol: String,
+k: Int)
+  extends ClusteringSummary (
+predictions,
+predictionCol,
+featuresCol,
+k
+  )
+
+/**
+ * :: Experimental ::
+ * Summary of clustering.
--- End diff --

"Summary of clustering algorithms." ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.

2016-10-19 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r84204673
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -354,21 +354,41 @@ object KMeans extends DefaultParamsReadable[KMeans] {
 @Since("2.0.0")
 @Experimental
 class KMeansSummary private[clustering] (
-@Since("2.0.0") @transient val predictions: DataFrame,
-@Since("2.0.0") val predictionCol: String,
-@Since("2.0.0") val featuresCol: String,
-@Since("2.0.0") val k: Int) extends Serializable {
+predictions: DataFrame,
+predictionCol: String,
+featuresCol: String,
+k: Int)
+  extends ClusteringSummary (
+predictions,
+predictionCol,
+featuresCol,
+k
+  )
+
+/**
+ * :: Experimental ::
+ * Summary of clustering.
+ *
+ * @param predictions  [[DataFrame]] produced by model.transform()
+ * @param predictionCol  Name for column of predicted clusters in 
`predictions`
+ * @param featuresCol  Name for column of features in `predictions`
+ * @param k  Number of clusters
+ */
+@Experimental
--- End diff --

```ClusteringSummary``` will be succeeded by summaries who were added in 
different version, so I think we should not add since version here. To the 
issue for a new file, I think ```ClusteringSummary``` is a small class, we can 
place it here temporarily.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.

2016-10-19 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/1#discussion_r84203200
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -354,21 +354,41 @@ object KMeans extends DefaultParamsReadable[KMeans] {
 @Since("2.0.0")
 @Experimental
 class KMeansSummary private[clustering] (
-@Since("2.0.0") @transient val predictions: DataFrame,
-@Since("2.0.0") val predictionCol: String,
-@Since("2.0.0") val featuresCol: String,
-@Since("2.0.0") val k: Int) extends Serializable {
+predictions: DataFrame,
+predictionCol: String,
+featuresCol: String,
+k: Int)
+  extends ClusteringSummary (
+predictions,
+predictionCol,
+featuresCol,
+k
+  )
+
+/**
+ * :: Experimental ::
+ * Summary of clustering.
+ *
+ * @param predictions  [[DataFrame]] produced by model.transform()
+ * @param predictionCol  Name for column of predicted clusters in 
`predictions`
+ * @param featuresCol  Name for column of features in `predictions`
+ * @param k  Number of clusters
+ */
+@Experimental
--- End diff --

what about adding `@Since("2.1.0")` here? 
Create a new scala file named `Clustering.scala` and move 
`ClusteringSummary` into it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15555: [Minor][ML] Refactor clustering summary.

2016-10-19 Thread yanboliang
GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/1

[Minor][ML] Refactor clustering summary.

## What changes were proposed in this pull request?
Abstract ```ClusteringSummary``` from ```KMeansSummary```, 
```GaussianMixtureSummary``` and ```BisectingSummary```, and eliminate 
duplicated pieces of code.

## How was this patch tested?
Existing tests.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark clustering-summary

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1


commit 31692676e2a6df0651ff35992aa8ab3688c113b8
Author: Yanbo Liang 
Date:   2016-10-19T13:46:20Z

Refactor clustering summary.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org