[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-07-12 Thread yanboliang
Github user yanboliang closed the pull request at:

https://github.com/apache/spark/pull/17117


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r105025830
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -300,6 +318,10 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
+  @Since("2.2.0")
+  def setInitialModel(value: KMeansModel): this.type = set(initialModel, 
value)
--- End diff --

Can you elaborate this? I don't fully understand why we can not overwrite 
setting in set method. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r105025280
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") (
 model
   }
 
+  /**
+   * Check validity for interactions between parameters.
+   */
+  private def assertInitialModelValid(): Unit = {
+if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
+  if (isSet(initialModel)) {
+val initialModelK = $(initialModel).parentModel.k
+if (initialModelK != $(k)) {
+  throw new IllegalArgumentException("The initial model's cluster 
count = " +
+s"$initialModelK, mismatched with k = $k.")
+}
+  } else {
+throw new IllegalArgumentException("Users must set param 
initialModel if you choose " +
+  "'initialModel' as the initialization algorithm.")
+  }
+} else {
+  if (isSet(initialModel)) {
+logWarning(s"Param initialModel will take no effect when initMode 
is $initMode.")
+  }
+}
+  }
+
   @Since("1.5.0")
   override def transformSchema(schema: StructType): StructType = {
+assertInitialModelValid()
--- End diff --

`transformSchema ` is called in `transform` method, and `model.transform` 
is called in computing the summary. I think we should fail it earlier instead 
of checking it in the end. Also, it's implicit that it's being checked when 
computing summary. We should explicitly check it. 

If we have small logic in checking, I'll have those checking code in `fit` 
method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r105002771
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -37,9 +37,9 @@ import org.apache.spark.storage.StorageLevel
 import org.apache.spark.util.VersionUtils.majorVersion
 
 /**
- * Common params for KMeans and KMeansModel
+ * Common params for KMeans and KMeansModel.
  */
-private[clustering] trait KMeansParams extends Params with HasMaxIter with 
HasFeaturesCol
+private[clustering] trait KMeansModelParams extends Params with HasMaxIter 
with HasFeaturesCol
   with HasSeed with HasPredictionCol with HasTol {
--- End diff --

Fair enough.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r105002617
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -152,6 +158,35 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 val kmeans = new KMeans()
 testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
   }
+
+  test("training with initial model") {
+val kmeans = new KMeans().setK(2).setSeed(1)
+val model1 = kmeans.fit(rData)
+val model2 = 
kmeans.setInitMode("initialModel").setInitialModel(model1).fit(rData)
+model2.clusterCenters.zip(model1.clusterCenters)
+  .foreach { case (center2, center1) => assert(center2 ~== center1 
absTol 1E-8) }
+  }
+
+  test("training with initial model, error cases") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+// Sets initMode with 'initialModel', but does not specify initial 
model.
+intercept[IllegalArgumentException] {
--- End diff --

@sethah +1 on the behavior you purpose. The only thing I would like to add 
on is `setK` should throw `IllegalArgumentException`.

```scala
// initialModel sets k and init mode
assert(km.getInitMode === MLlibKMeans.K_MEANS_INITIAL_MODEL)
assert(km.getK === initialK)
assert(km.getInitialModel.getK === initialK)

// setting k will throw exception.
intercept[IllegalArgumentException] {
  km.setK(initialK + 1)
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-08 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104924893
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -152,6 +158,35 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 val kmeans = new KMeans()
 testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
   }
+
+  test("training with initial model") {
+val kmeans = new KMeans().setK(2).setSeed(1)
+val model1 = kmeans.fit(rData)
+val model2 = 
kmeans.setInitMode("initialModel").setInitialModel(model1).fit(rData)
+model2.clusterCenters.zip(model1.clusterCenters)
+  .foreach { case (center2, center1) => assert(center2 ~== center1 
absTol 1E-8) }
+  }
+
+  test("training with initial model, error cases") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+// Sets initMode with 'initialModel', but does not specify initial 
model.
+intercept[IllegalArgumentException] {
--- End diff --

I think we can not override param in any ```set***``` function, see reason 
at [here](https://github.com/apache/spark/pull/17117#discussion_r104922364). 
This is why I didn't follow the idea of previous PR. If we prefer the previous 
way, we must to handle override in ```fit``` function, I'll update following 
this idea tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-08 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104922364
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -300,6 +318,10 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
+  @Since("2.2.0")
+  def setInitialModel(value: KMeansModel): this.type = set(initialModel, 
value)
--- End diff --

I think we can not override param in any ```set***``` function, since ML 
pipeline API supports other param setting method like:
```
def fit(dataset: Dataset[_], paramMap: ParamMap): M = {
copy(paramMap).fit(dataset)
  }
```
I think the only way to override param is at the start of ```fit``` 
function.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104830238
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") (
 model
   }
 
+  /**
+   * Check validity for interactions between parameters.
+   */
+  private def assertInitialModelValid(): Unit = {
+if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
+  if (isSet(initialModel)) {
+val initialModelK = $(initialModel).parentModel.k
+if (initialModelK != $(k)) {
+  throw new IllegalArgumentException("The initial model's cluster 
count = " +
+s"$initialModelK, mismatched with k = $k.")
+}
+  } else {
+throw new IllegalArgumentException("Users must set param 
initialModel if you choose " +
+  "'initialModel' as the initialization algorithm.")
+  }
+} else {
+  if (isSet(initialModel)) {
+logWarning(s"Param initialModel will take no effect when initMode 
is $initMode.")
+  }
+}
+  }
+
   @Since("1.5.0")
   override def transformSchema(schema: StructType): StructType = {
+assertInitialModelValid()
--- End diff --

`transformSchema` will be called in the train method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104830001
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -37,9 +37,9 @@ import org.apache.spark.storage.StorageLevel
 import org.apache.spark.util.VersionUtils.majorVersion
 
 /**
- * Common params for KMeans and KMeansModel
+ * Common params for KMeans and KMeansModel.
  */
-private[clustering] trait KMeansParams extends Params with HasMaxIter with 
HasFeaturesCol
+private[clustering] trait KMeansModelParams extends Params with HasMaxIter 
with HasFeaturesCol
   with HasSeed with HasPredictionCol with HasTol {
--- End diff --

Yeah, we decided in the previous discussion to not store the initial model 
in the produced model, for several reasons, including model serialization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104829816
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -152,6 +158,35 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 val kmeans = new KMeans()
 testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
   }
+
+  test("training with initial model") {
+val kmeans = new KMeans().setK(2).setSeed(1)
+val model1 = kmeans.fit(rData)
+val model2 = 
kmeans.setInitMode("initialModel").setInitialModel(model1).fit(rData)
+model2.clusterCenters.zip(model1.clusterCenters)
+  .foreach { case (center2, center1) => assert(center2 ~== center1 
absTol 1E-8) }
+  }
+
+  test("training with initial model, error cases") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+// Sets initMode with 'initialModel', but does not specify initial 
model.
+intercept[IllegalArgumentException] {
--- End diff --

Yeah, I think the general idea laid out in the previous PR is preferable. 
As you and DB say, if you'd like to make the second `.setInitMode` overwrite 
the initial model then that is fine. With that change, then the behavior I 
would prefer is:

scala
  test("params") {
val initialK = 3
val initialEstimator = new KMeans()
  .setK(initialK)
val initialModel = initialEstimator.fit(dataset)

val km = new KMeans()
  .setK(initialK + 1)
  .setInitMode(MLlibKMeans.RANDOM)

assert(km.getK === initialK + 1)
assert(km.getInitMode === MLlibKMeans.RANDOM)

km.setInitialModel(initialModel)

// initialModel sets k and init mode
assert(km.getInitMode === MLlibKMeans.K_MEANS_INITIAL_MODEL)
assert(km.getK === initialK)
assert(km.getInitialModel.getK === initialK)

// setting k is ignored
km.setK(initialK + 1)
assert(km.getK === initialK)

// this should work since we already set initialModel
km.setInitMode(MLlibKMeans.K_MEANS_INITIAL_MODEL)

// changing initMode clears the initial model
km.setInitMode(MLlibKMeans.RANDOM)
assert(km.getInitMode === MLlibKMeans.RANDOM)
assert(!km.isSet(km.initialModel))
// k is retained from initial model
assert(km.getK === initialK)
// now k can be set
km.setK(initialK + 1)
assert(km.getK === initialK + 1)

// kmeans should throw an error since we shouldn't be allowed to set 
init mode to "initialModel"
intercept[IllegalArgumentException] {
  km.setInitMode(MLlibKMeans.K_MEANS_INITIAL_MODEL)
}
  }



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-07 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104820054
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") (
 model
   }
 
+  /**
+   * Check validity for interactions between parameters.
+   */
+  private def assertInitialModelValid(): Unit = {
+if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
+  if (isSet(initialModel)) {
+val initialModelK = $(initialModel).parentModel.k
+if (initialModelK != $(k)) {
+  throw new IllegalArgumentException("The initial model's cluster 
count = " +
+s"$initialModelK, mismatched with k = $k.")
+}
+  } else {
+throw new IllegalArgumentException("Users must set param 
initialModel if you choose " +
+  "'initialModel' as the initialization algorithm.")
+  }
+} else {
+  if (isSet(initialModel)) {
--- End diff --

Also, this is not needed if we do the overwriting work in `setInitialModel`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-07 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104819439
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") (
 model
   }
 
+  /**
+   * Check validity for interactions between parameters.
+   */
+  private def assertInitialModelValid(): Unit = {
+if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
+  if (isSet(initialModel)) {
+val initialModelK = $(initialModel).parentModel.k
+if (initialModelK != $(k)) {
+  throw new IllegalArgumentException("The initial model's cluster 
count = " +
+s"$initialModelK, mismatched with k = $k.")
+}
+  } else {
+throw new IllegalArgumentException("Users must set param 
initialModel if you choose " +
+  "'initialModel' as the initialization algorithm.")
+  }
+} else {
+  if (isSet(initialModel)) {
+logWarning(s"Param initialModel will take no effect when initMode 
is $initMode.")
+  }
+}
+  }
+
   @Since("1.5.0")
   override def transformSchema(schema: StructType): StructType = {
+assertInitialModelValid()
--- End diff --

Why this is not checked in `fit`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-07 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104821332
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") (
 model
   }
 
+  /**
+   * Check validity for interactions between parameters.
+   */
+  private def assertInitialModelValid(): Unit = {
--- End diff --

I think with overwriting above, the only thing we need to check will be 

```scala
if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL && 
!isSet(initialModel)) {
  throw new IllegalArgumentException("Users must set param initialModel if 
you choose " +
   "'initialModel' as the initialization.")
}
```

we can just have it in the body of `fit` method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-07 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104803180
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -37,9 +37,9 @@ import org.apache.spark.storage.StorageLevel
 import org.apache.spark.util.VersionUtils.majorVersion
 
 /**
- * Common params for KMeans and KMeansModel
+ * Common params for KMeans and KMeansModel.
  */
-private[clustering] trait KMeansParams extends Params with HasMaxIter with 
HasFeaturesCol
+private[clustering] trait KMeansModelParams extends Params with HasMaxIter 
with HasFeaturesCol
   with HasSeed with HasPredictionCol with HasTol {
--- End diff --

Now, `KMeansModel` mixes `KMeansModelParams`, does it mean in the model 
level, we can not get the information of the `initiModel`? Also, in the model, 
why do we need to mix the seed in?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-07 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104819810
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -335,17 +369,70 @@ class KMeans @Since("1.5.0") (
 model
   }
 
+  /**
+   * Check validity for interactions between parameters.
+   */
+  private def assertInitialModelValid(): Unit = {
+if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
+  if (isSet(initialModel)) {
+val initialModelK = $(initialModel).parentModel.k
+if (initialModelK != $(k)) {
--- End diff --

I don't think this check is needed if we overwrite `k` when `initialModel` 
is set.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-07 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104815453
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -300,6 +318,10 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
+  @Since("2.2.0")
+  def setInitialModel(value: KMeansModel): this.type = set(initialModel, 
value)
--- End diff --

How about
```scala
def setInitialModel(value: KMeansModel): this.type = {
  if (getK ~= value.getK) {
log the warning
set(k, value)
  }
  set(initMode, MLlibKMeans.K_MEANS_INITIAL_MODEL) // We may log, but I 
don't really care for this one.
  set(initialModel, value)
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-07 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104800325
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -152,6 +158,35 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 val kmeans = new KMeans()
 testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
   }
+
+  test("training with initial model") {
+val kmeans = new KMeans().setK(2).setSeed(1)
+val model1 = kmeans.fit(rData)
+val model2 = 
kmeans.setInitMode("initialModel").setInitialModel(model1).fit(rData)
+model2.clusterCenters.zip(model1.clusterCenters)
+  .foreach { case (center2, center1) => assert(center2 ~== center1 
absTol 1E-8) }
+  }
+
+  test("training with initial model, error cases") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+// Sets initMode with 'initialModel', but does not specify initial 
model.
+intercept[IllegalArgumentException] {
--- End diff --

My 2cents is the latter configuration should be able to overwrite the 
former settings and related settings with warning messages. 

In your example, when `kmeans.setInitMode("k-means||")` is performed, the 
first `setInitialModel` should be ignored with warning message.

 Even we do `setK(k =3)`, and later we do `.setInitialModel(initialModel)`, 
we should ignore the first `setK(k =3)` with warning. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-03 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104170315
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala ---
@@ -111,12 +113,20 @@ trait DefaultReadWriteTest extends TempDirectory { 
self: Suite =>
 val estimator2 = testDefaultReadWrite(estimator)
 testParams.foreach { case (p, v) =>
   val param = estimator.getParam(p)
-  assert(estimator.get(param).get === estimator2.get(param).get)
+  if (param.name == "initialModel") {
+// Estimator's `initialModel` has same type as the model produced 
by this estimator.
--- End diff --

Let's merged #17151 firstly, then I will update this accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-03 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104170135
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -182,6 +224,7 @@ object KMeansSuite {
 "predictionCol" -> "myPrediction",
 "k" -> 3,
 "maxIter" -> 2,
-"tol" -> 0.01
+"tol" -> 0.01,
+"initialModel" -> generateRandomKMeansModel(3, 3)
--- End diff --

Agree, I sent #17151 and feel free to comment it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-03 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104168600
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -152,6 +158,35 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 val kmeans = new KMeans()
 testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
   }
+
+  test("training with initial model") {
+val kmeans = new KMeans().setK(2).setSeed(1)
+val model1 = kmeans.fit(rData)
+val model2 = 
kmeans.setInitMode("initialModel").setInitialModel(model1).fit(rData)
+model2.clusterCenters.zip(model1.clusterCenters)
+  .foreach { case (center2, center1) => assert(center2 ~== center1 
absTol 1E-8) }
+  }
+
+  test("training with initial model, error cases") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+// Sets initMode with 'initialModel', but does not specify initial 
model.
+intercept[IllegalArgumentException] {
--- End diff --

I disagree the way in the other PR, the reason is:
In that PR, if users ```setInitialModel(model)```, it will call 
```set(initMode, "initialModel")```.  Take the following scenarios:
```
val kmeans = new KMeans().setInitialModel(initialModel) // Users want to 
start with an initial model.
val model1 = kmeans.fit(dataset) // The model was fitted by warm start.
// Then they want to try another starting way, for example, starting with 
"k-means||".
val model2 = kmeans.setInitMode("k-means||") // But in #9 's code 
route, it will still starts with initial model. Though we can change this by 
modify the code in mllib.clustering.KMeans, but I think it's confused. 
```

Another scenario is users set ```initialModel``` by mistake, but they still 
want to start with ```random``` mode, they will confused what happened. 
So I'm more prefer to let users set ```initMode``` to ```initialModel``` 
explicitly, and set ```initialModel``` to corresponding model. Otherwise, we 
just throw exceptions to let users correct their setting. I'm OK to add a test 
for the first case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-03 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104165327
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -22,22 +22,28 @@ import scala.util.Random
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.linalg.{Vector, Vectors}
 import org.apache.spark.ml.param.ParamMap
-import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils}
-import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans}
+import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable, 
MLTestingUtils}
+import org.apache.spark.ml.util.TestingUtils._
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vectors => MLlibVectors}
 import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
 
 private[clustering] case class TestRow(features: Vector)
 
 class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
 
+  import testImplicits._
+
   final val k = 5
   @transient var dataset: Dataset[_] = _
+  @transient var rData: Dataset[_] = _
 
   override def beforeAll(): Unit = {
 super.beforeAll()
 
 dataset = KMeansSuite.generateKMeansData(spark, 50, 3, k)
+rData = 
GaussianMixtureSuite.rData.map(GaussianMixtureSuite.FeatureData).toDF()
--- End diff --

Updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-03 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104165225
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -337,15 +366,61 @@ class KMeans @Since("1.5.0") (
 
   @Since("1.5.0")
   override def transformSchema(schema: StructType): StructType = {
+if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
+  if (isSet(initialModel)) {
+val initialModelK = $(initialModel).parentModel.k
+if (initialModelK != $(k)) {
+  throw new IllegalArgumentException("The initial model's cluster 
count = " +
+s"$initialModelK, mismatched with k = $k.")
+}
+  } else {
+throw new IllegalArgumentException("Users must set param 
initialModel if you choose " +
+  "'initialModel' as the initialization algorithm.")
+  }
+} else {
+  if (isSet(initialModel)) {
+logWarning(s"Param initialModel will take no effect when initMode 
is $initMode.")
+  }
+}
 validateAndTransformSchema(schema)
   }
+
+  @Since("2.2.0")
+  override def write: MLWriter = new KMeans.KMeansWriter(this)
 }
 
 @Since("1.6.0")
-object KMeans extends DefaultParamsReadable[KMeans] {
+object KMeans extends MLReadable[KMeans] {
 
   @Since("1.6.0")
   override def load(path: String): KMeans = super.load(path)
+
+  @Since("2.2.0")
+  override def read: MLReader[KMeans] = new KMeansReader
+
+  /** [[MLWriter]] instance for [[KMeans]] */
+  private[KMeans] class KMeansWriter(instance: KMeans) extends MLWriter {
+
+override protected def saveImpl(path: String): Unit = {
+  DefaultParamsWriter.saveInitialModel(instance, path)
+  DefaultParamsWriter.saveMetadata(instance, path, sc)
+}
+  }
+
+  private class KMeansReader extends MLReader[KMeans] {
+
+override def load(path: String): KMeans = {
+  val metadata = DefaultParamsReader.loadMetadata(path, sc, 
classOf[KMeans].getName)
+  val instance = new KMeans(metadata.uid)
+
+  DefaultParamsReader.getAndSetParams(instance, metadata)
+  DefaultParamsReader.loadInitialModel[KMeansModel](path, sc) match {
--- End diff --

Yeah, your suggestion can work well, but I'm more prefer to my way, since 
it's more clear for developer to understand what happened.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-03 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104164797
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -253,7 +255,18 @@ object KMeansModel extends MLReadable[KMeansModel] {
 @Since("1.5.0")
 class KMeans @Since("1.5.0") (
 @Since("1.5.0") override val uid: String)
-  extends Estimator[KMeansModel] with KMeansParams with 
DefaultParamsWritable {
+  extends Estimator[KMeansModel]
+with KMeansParams with HasInitialModel[KMeansModel] with MLWritable {
+
+  /**
+   * A KMeansModel to use for warm start.
+   * Note the cluster count of initial model must be equal with [[k]],
+   * otherwise, throws IllegalArgumentException.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val initialModel: Param[KMeansModel] =
--- End diff --

Make sense, I refactored them as ```ALSParams``` and ```ALSModelParams```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-02 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104084997
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -253,7 +255,18 @@ object KMeansModel extends MLReadable[KMeansModel] {
 @Since("1.5.0")
 class KMeans @Since("1.5.0") (
 @Since("1.5.0") override val uid: String)
-  extends Estimator[KMeansModel] with KMeansParams with 
DefaultParamsWritable {
+  extends Estimator[KMeansModel]
+with KMeansParams with HasInitialModel[KMeansModel] with MLWritable {
+
+  /**
+   * A KMeansModel to use for warm start.
+   * Note the cluster count of initial model must be equal with [[k]],
+   * otherwise, throws IllegalArgumentException.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val initialModel: Param[KMeansModel] =
--- End diff --

I prefer doing this in the same way that ALS does it. By having separate 
param traits `KMeansParams extends KMeansModelParams with HasInitialModel`. 
It's more explicit since now our `KMeans` class would have extra params on top 
of `KMeansParams`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-02 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104084877
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -123,7 +126,8 @@ class KMeansModel private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
-val predictUDF = udf((vector: Vector) => predict(vector))
+val tmpParent: MLlibKMeansModel = parentModel
--- End diff --

Can we change it to `localParent`? That's the convention we have taken 
elsewhere when we want to get a separate pointer to a class member.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-02 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104095197
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -182,6 +224,7 @@ object KMeansSuite {
 "predictionCol" -> "myPrediction",
 "k" -> 3,
 "maxIter" -> 2,
-"tol" -> 0.01
+"tol" -> 0.01,
+"initialModel" -> generateRandomKMeansModel(3, 3)
--- End diff --

It would be nicer to change `testEstimatorAndModelReadWrite` to accept 
`estimatorTestParams` and `modelTestParams` separately so we don't have to hard 
code certain params to be filtered out inside that method. Though we wouldn't 
have to that in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-02 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104091867
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -418,6 +418,8 @@ object KMeans {
   val RANDOM = "random"
   @Since("0.8.0")
   val K_MEANS_PARALLEL = "k-means||"
+  @Since("2.2.0")
+  val K_MEANS_INITIAL_MODEL = "initialModel"
--- End diff --

It can be private I think. That, or we should update the valid options for 
the `setInitializationMode` doc. But I think it's best to make it private.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-02 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104092158
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -22,22 +22,28 @@ import scala.util.Random
 import org.apache.spark.SparkFunSuite
 import org.apache.spark.ml.linalg.{Vector, Vectors}
 import org.apache.spark.ml.param.ParamMap
-import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTestingUtils}
-import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans}
+import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable, 
MLTestingUtils}
+import org.apache.spark.ml.util.TestingUtils._
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vectors => MLlibVectors}
 import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
 
 private[clustering] case class TestRow(features: Vector)
 
 class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
 
+  import testImplicits._
+
   final val k = 5
   @transient var dataset: Dataset[_] = _
+  @transient var rData: Dataset[_] = _
 
   override def beforeAll(): Unit = {
 super.beforeAll()
 
 dataset = KMeansSuite.generateKMeansData(spark, 50, 3, k)
+rData = 
GaussianMixtureSuite.rData.map(GaussianMixtureSuite.FeatureData).toDF()
--- End diff --

`GaussianMixtureSuite.rData.map(Tuple1.apply).toDF()` ? Mapping the dummy 
case class from another test suite is less clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-02 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104090529
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -337,15 +366,61 @@ class KMeans @Since("1.5.0") (
 
   @Since("1.5.0")
   override def transformSchema(schema: StructType): StructType = {
+if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
+  if (isSet(initialModel)) {
+val initialModelK = $(initialModel).parentModel.k
+if (initialModelK != $(k)) {
+  throw new IllegalArgumentException("The initial model's cluster 
count = " +
+s"$initialModelK, mismatched with k = $k.")
+}
+  } else {
+throw new IllegalArgumentException("Users must set param 
initialModel if you choose " +
+  "'initialModel' as the initialization algorithm.")
+  }
+} else {
+  if (isSet(initialModel)) {
+logWarning(s"Param initialModel will take no effect when initMode 
is $initMode.")
+  }
+}
 validateAndTransformSchema(schema)
   }
+
+  @Since("2.2.0")
+  override def write: MLWriter = new KMeans.KMeansWriter(this)
 }
 
 @Since("1.6.0")
-object KMeans extends DefaultParamsReadable[KMeans] {
+object KMeans extends MLReadable[KMeans] {
 
   @Since("1.6.0")
   override def load(path: String): KMeans = super.load(path)
+
+  @Since("2.2.0")
+  override def read: MLReader[KMeans] = new KMeansReader
+
+  /** [[MLWriter]] instance for [[KMeans]] */
+  private[KMeans] class KMeansWriter(instance: KMeans) extends MLWriter {
+
+override protected def saveImpl(path: String): Unit = {
+  DefaultParamsWriter.saveInitialModel(instance, path)
+  DefaultParamsWriter.saveMetadata(instance, path, sc)
+}
+  }
+
+  private class KMeansReader extends MLReader[KMeans] {
+
+override def load(path: String): KMeans = {
+  val metadata = DefaultParamsReader.loadMetadata(path, sc, 
classOf[KMeans].getName)
+  val instance = new KMeans(metadata.uid)
+
+  DefaultParamsReader.getAndSetParams(instance, metadata)
+  DefaultParamsReader.loadInitialModel[KMeansModel](path, sc) match {
--- End diff --

This can be done as:

scala
 DefaultParamsReader.loadInitialModel[KMeansModel](path, 
sc).foreach(instance.setInitialModel)


I think it's nicer, but I'm not sure if there is a universal preference for 
side effects with options in Spark, so I'll leave it to you to decide.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-02 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104094526
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala ---
@@ -111,12 +113,20 @@ trait DefaultReadWriteTest extends TempDirectory { 
self: Suite =>
 val estimator2 = testDefaultReadWrite(estimator)
 testParams.foreach { case (p, v) =>
   val param = estimator.getParam(p)
-  assert(estimator.get(param).get === estimator2.get(param).get)
+  if (param.name == "initialModel") {
+// Estimator's `initialModel` has same type as the model produced 
by this estimator.
--- End diff --

This is an assumption, and is not enforced by the compiler. There is 
nothing in the trait `HasInitialModel[T <: Model[T]]`that prevents us from 
creating an estimator with an initialModel type that is not the same type of 
the model that the estimator produces. We can discuss whether or not we'd like 
to enforce this assumption, but if we do not then this method should probably 
be changed. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-02 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104090273
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -337,15 +366,61 @@ class KMeans @Since("1.5.0") (
 
   @Since("1.5.0")
   override def transformSchema(schema: StructType): StructType = {
+if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
--- End diff --

It might be nice to factor this logic out into a method like 
`assertInitialModelValid` or something similar. Actually, we could add an 
abstract method to the `HasInitialModel` trait that each subclass can implement 
differently. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-02 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r104092773
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -152,6 +158,35 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 val kmeans = new KMeans()
 testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
   }
+
+  test("training with initial model") {
+val kmeans = new KMeans().setK(2).setSeed(1)
+val model1 = kmeans.fit(rData)
+val model2 = 
kmeans.setInitMode("initialModel").setInitialModel(model1).fit(rData)
+model2.clusterCenters.zip(model1.clusterCenters)
+  .foreach { case (center2, center1) => assert(center2 ~== center1 
absTol 1E-8) }
+  }
+
+  test("training with initial model, error cases") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+// Sets initMode with 'initialModel', but does not specify initial 
model.
+intercept[IllegalArgumentException] {
--- End diff --

I'm not sure I agree with the behavior. We discussed it quite a bit in the 
other PR - maybe you can summarize the reason you went away from the previous 
decisions? At any rate, it seems currently we have the following behavior:

| k  | initMode   | initialModel  | result |
--- | --- | --- | ---
| ?| not set | set | ignore InitialModel |
| ?| set | not set | error |
|  set (k != initialModelK)   | set | set | error |
|  set (k == initialModelK)   | set | set | use initialModel |

If we keep this behavior, we should add a test for the first case. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-01 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17117#discussion_r103675288
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -337,15 +366,61 @@ class KMeans @Since("1.5.0") (
 
   @Since("1.5.0")
   override def transformSchema(schema: StructType): StructType = {
+if ($(initMode) == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
+  if (isSet(initialModel)) {
+val initialModelK = $(initialModel).parentModel.k
+if (initialModelK != $(k)) {
+  throw new IllegalArgumentException("The initial model's cluster 
count = " +
+s"$initialModelK, mismatched with k = $k.")
+}
+  } else {
+throw new IllegalArgumentException("Users must set param 
initialModel if you choose " +
+  "'initialModel' as the initialization algorithm.")
+  }
+} else {
+  if (isSet(initialModel)) {
+logWarning(s"Param initialModel will take no effect when initMode 
is $initMode.")
+  }
+}
 validateAndTransformSchema(schema)
   }
+
+  @Since("2.2.0")
+  override def write: MLWriter = new KMeans.KMeansWriter(this)
 }
 
 @Since("1.6.0")
-object KMeans extends DefaultParamsReadable[KMeans] {
+object KMeans extends MLReadable[KMeans] {
 
   @Since("1.6.0")
   override def load(path: String): KMeans = super.load(path)
+
+  @Since("2.2.0")
+  override def read: MLReader[KMeans] = new KMeansReader
+
+  /** [[MLWriter]] instance for [[KMeans]] */
+  private[KMeans] class KMeansWriter(instance: KMeans) extends MLWriter {
+
+override protected def saveImpl(path: String): Unit = {
+  DefaultParamsWriter.saveInitialModel(instance, path)
+  DefaultParamsWriter.saveMetadata(instance, path, sc)
+}
--- End diff --

I was trying to move ```saveInitialModel``` into ```saveMetadata``` and 
making it more succinct. We can do this for ```MLWriter```, but it's hard for 
```MLReader[T]```. Since we need to explicitly pass the type of 
```initialModel``` as well, so we need refactor ```MLReader[T]``` to 
```MLReader[T, M]```. However, I think lots of estimators/transformers will not 
use ```initialModel```, so the extra type ```[M]``` does not make sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17117: [SPARK-10780][ML] Support initial model for KMean...

2017-03-01 Thread yanboliang
GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/17117

[SPARK-10780][ML] Support initial model for KMeans.

## What changes were proposed in this pull request?
Support initial model for ```KMeans```.

## How was this patch tested?
Add unit tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-10780

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17117.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17117


commit ddd8d86b8a8fca89705a438bc6336e589fa27c8b
Author: Yanbo Liang 
Date:   2017-03-01T09:29:12Z

Support initial model for KMeans.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org