[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2017-05-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/9


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r88725427
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -35,7 +38,25 @@ import org.apache.spark.sql.functions.{col, udf}
 import org.apache.spark.sql.types.{IntegerType, StructType}
 
 /**
- * Common params for KMeans and KMeansModel
+ * Params for KMeans
+ */
+
+private[clustering] trait KMeansInitialModelParams extends 
HasInitialModel[KMeansModel] {
+  /**
+   * Param for KMeansModel to use for warm start.
+   * Whenever initialModel is set:
+   *   1. the initialModel k will override the param k;
+   *   2. the param initMode is set to initialModel and manually set is 
ignored;
+   *   3. other params are untouched.
+   * @group param
+   */
+  final val initialModel: Param[KMeansModel] =
--- End diff --

`override final val`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r88724978
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -35,7 +38,25 @@ import org.apache.spark.sql.functions.{col, udf}
 import org.apache.spark.sql.types.{IntegerType, StructType}
 
 /**
- * Common params for KMeans and KMeansModel
+ * Params for KMeans
+ */
+
+private[clustering] trait KMeansInitialModelParams extends 
HasInitialModel[KMeansModel] {
+  /**
+   * Param for KMeansModel to use for warm start.
+   * Whenever initialModel is set:
+   *   1. the initialModel k will override the param k;
+   *   2. the param initMode is set to initialModel and manually set is 
ignored;
--- End diff --

2. the param initMode is set to "initialModel" and manually setting 
initMode will be ignored

nit: Let's just remove the punctuation from the numbered list


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r88714626
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -124,7 +147,8 @@ class KMeansModel private[ml] (
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
-val predictUDF = udf((vector: Vector) => predict(vector))
+val tmpParent: MLlibKMeansModel = parentModel
--- End diff --

maybe a comment would be useful? `// avoid encapsulating the entire model 
in the closure`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r88713108
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -414,6 +414,8 @@ object KMeans {
   val RANDOM = "random"
   @Since("0.8.0")
   val K_MEANS_PARALLEL = "k-means||"
+  @Since("2.1.0")
+  val K_MEANS_INITIAL_MODEL = "initialModel"
--- End diff --

Does it need to be public? This only serves a purpose when used with ML I 
think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r88722547
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -145,18 +150,67 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
   assert(model.clusterCenters === model2.clusterCenters)
 }
 val kmeans = new KMeans()
-testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
+testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData,
+  Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => 
Unit]))
+  }
+
+  test("Initialize using a trained model") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+val oneIterModel = kmeans.fit(dataset)
+val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 
2))).fit(dataset)
+val oneMoreIterModel = 
kmeans.setInitialModel(oneIterModel).fit(dataset)
+
+assert(oneMoreIterModel.getK === k)
+
+twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters)
+  .foreach { case (center1, center2) => assert(center1 ~== center2 
absTol 1E-8) }
+  }
+
+  test("Initialize using a model with wrong dimension of cluster centers") 
{
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k)
+val wrongDimModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongDimModel).fit(dataset)
+}
+assert(wrongDimModelThrown.getMessage.contains("mismatched dimension"))
+  }
+
+  test("Infer K from an initial model") {
--- End diff --

So, the behavior is getting a bit confusing because we now have three 
params which are intertwined.  For that reason, we should be very thorough on 
the tests. With my understanding of the behavior we decided on, the following 
tests should all pass. Can you tell me if it looks right to you?:

scala
test("initialModel params") {
val initialK = 3
val initialEstimator = new KMeans()
  .setK(initialK)
val initialModel = initialEstimator.fit(dataset)

val km = new KMeans()
  .setK(initialK + 1)
  .setInitialModel(initialModel)

// intialModel sets k and init mode
assert(km.getInitMode === MLlibKMeans.K_MEANS_INITIAL_MODEL)
assert(km.getK === initialK)
assert(km.getInitialModel.getK === initialK)

// setting k is ignored
km.setK(initialK + 1)
assert(km.getK === initialK)

// this should work since we already set initialModel
km.setInitMode(MLlibKMeans.K_MEANS_INITIAL_MODEL)

// this is ignored because initialModel is set
km.setInitMode(MLlibKMeans.RANDOM)
assert(km.getInitMode === MLlibKMeans.K_MEANS_INITIAL_MODEL)

km.clear(km.initialModel)
// kmeans now accepts init mode
km.setInitMode(MLlibKMeans.RANDOM)
assert(km.getInitMode === MLlibKMeans.RANDOM)
// kmeans should throw an error since we shouldn't be allowed to set 
init mode to "initialModel"
intercept[IllegalArgumentException] {
  km.setInitMode(MLlibKMeans.K_MEANS_INITIAL_MODEL)
}
}



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r88713359
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -284,11 +309,26 @@ class KMeans @Since("1.5.0") (
 
   /** @group setParam */
   @Since("1.5.0")
-  def setK(value: Int): this.type = set(k, value)
+  def setK(value: Int): this.type = {
+if (isSet(initialModel)) {
+  logWarning("initialModel is set, so k will be ignored. Clear 
initialModel first.")
+  this
+} else {
+  set(k, value)
+}
+  }
 
   /** @group expertSetParam */
   @Since("1.5.0")
-  def setInitMode(value: String): this.type = set(initMode, value)
+  def setInitMode(value: String): this.type = {
+if (isSet(initialModel)) {
+  logWarning(s"initialModel is set, so initMode will be ignored. Clear 
initialModel first.")
+}
+if (value == MLlibKMeans.K_MEANS_INITIAL_MODEL) {
+  logWarning(s"initMode of $value is not supported here, please use 
setInitialModel.")
--- End diff --

From the discussion, I think we decided to throw an error for 
`setInitMode("initialModel")` if initialModel wasn't already set. If 
initialModel has been set, then we'd just update the initMode as normal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r88715322
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -306,6 +346,25 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setInitialModel(value: KMeansModel): this.type = {
+val kOfInitialModel = value.parentModel.clusterCenters.length
+if (isSet(k)) {
+  if ($(k) != kOfInitialModel) {
+val previousK = $(k)
+set(k, kOfInitialModel)
+logWarning(s"Param K is set to $kOfInitialModel by the 
initialModel." +
--- End diff --

nit: Maybe `s"Param k was changed from $previousK to $kOfInitialModel to 
match the initialModel"`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r88725396
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -35,7 +38,25 @@ import org.apache.spark.sql.functions.{col, udf}
 import org.apache.spark.sql.types.{IntegerType, StructType}
 
 /**
- * Common params for KMeans and KMeansModel
+ * Params for KMeans
+ */
+
+private[clustering] trait KMeansInitialModelParams extends 
HasInitialModel[KMeansModel] {
--- End diff --

If we follow the convention in ALS, then we should have `KMeansModelParams` 
and `KMeansParams extends KMeansModelParams with ...`. I think it would be good 
to do the same here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-11-18 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r88713635
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -284,11 +309,26 @@ class KMeans @Since("1.5.0") (
 
   /** @group setParam */
   @Since("1.5.0")
-  def setK(value: Int): this.type = set(k, value)
+  def setK(value: Int): this.type = {
+if (isSet(initialModel)) {
+  logWarning("initialModel is set, so k will be ignored. Clear 
initialModel first.")
+  this
+} else {
+  set(k, value)
+}
+  }
 
   /** @group expertSetParam */
   @Since("1.5.0")
-  def setInitMode(value: String): this.type = set(initMode, value)
+  def setInitMode(value: String): this.type = {
+if (isSet(initialModel)) {
+  logWarning(s"initialModel is set, so initMode will be ignored. Clear 
initialModel first.")
--- End diff --

We say it will be ignored, but then still set it below. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r84731896
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -320,6 +368,23 @@ class KMeans @Since("1.5.0") (
   .setMaxIterations($(maxIter))
   .setSeed($(seed))
   .setEpsilon($(tol))
+
+if (isDefined(initialModel)) {
+  // Check that the feature dimensions are equal
+  val dimOfData = rdd.first().size
+  val dimOfInitialModel = $(initialModel).clusterCenters.head.size
+  require(dimOfData == dimOfInitialModel,
+s"mismatched dimension, $dimOfData in data while 
$dimOfInitialModel in the initial model.")
+
+  // Check that the number of clusters are equal
+  val kOfInitialModel = 
$(initialModel).parentModel.clusterCenters.length
+  require(kOfInitialModel == $(k),
--- End diff --

I'd recommend that this log a warning instead of causing a failure.  If we 
use CrossValidator to select amongst initial models, then 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-17 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83764721
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -252,8 +254,10 @@ object KMeansModel extends MLReadable[KMeansModel] {
   }
   val model = new KMeansModel(metadata.uid, new 
MLlibKMeansModel(clusterCenters))
   DefaultParamsReader.getAndSetParams(model, metadata)
-  DefaultParamsReader.loadInitialModel[KMeansModel](path, sc)
-.foreach(v => model.set(model.initialModel, v))
+  DefaultParamsReader.loadInitialModel[KMeansModel](path, sc) match {
+case Success(v) => model.set(model.initialModel, v)
+case Failure(e) => if (!e.isInstanceOf[InvalidInputException]) 
throw e
--- End diff --

@MLnick Is it good to ignore only `InvalidInputException` while keep other 
exceptions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-17 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83719858
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -81,6 +81,13 @@ private[clustering] trait KMeansParams extends Params 
with HasMaxIter with HasFe
   def getInitSteps: Int = $(initSteps)
 
   /**
+   * Param for KMeansModel to use for warm start.
--- End diff --

Actually, we also need to be clear that the `initMode` is entirely ignored 
when setting an initial model. Let's add it to the doc of the `initMode` param


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83703876
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -446,6 +459,11 @@ private[ml] object DefaultParamsReader {
 val cls = Utils.classForName(metadata.className)
 cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path)
   }
+
+  def loadInitialModel[M <: Model[M]](path: String, sc: SparkContext): 
Try[M] = {
+val initialModelPath = new Path(path, "initialModel").toString
+Try(loadParamsInstance[M](initialModelPath, sc))
--- End diff --

Also, this is not a critical path, so I think the overhead is okay.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-17 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83693443
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -81,6 +83,14 @@ private[clustering] trait KMeansParams extends Params 
with HasMaxIter with HasFe
   def getInitSteps: Int = $(initSteps)
 
   /**
+   * Param for KMeansModel to use for warm start.
+   * Whenever initialModel is set, the initialModel k will override the 
param k.
--- End diff --

I think we should also be clear that other params won't be overridden.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-17 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83694799
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -446,6 +459,11 @@ private[ml] object DefaultParamsReader {
 val cls = Utils.classForName(metadata.className)
 cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path)
   }
+
+  def loadInitialModel[M <: Model[M]](path: String, sc: SparkContext): 
Try[M] = {
+val initialModelPath = new Path(path, "initialModel").toString
+Try(loadParamsInstance[M](initialModelPath, sc))
--- End diff --

I don't think the overhead is that large, and if there is no model then it 
should fail fast as the directory won't exist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-17 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83694228
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -303,6 +312,20 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
--- End diff --

I don't feel really strongly about it. But I think it is not a critical 
failure, so does not really require a hard stop of an exception. Since we make 
it clear in the doc that initial model `k` takes precedence, if a user sets a 
different `k` anyway we can still proceed, just let them know that we are 
ignoring their setting.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-17 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83694961
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -333,13 +389,45 @@ class KMeans @Since("1.5.0") (
   override def transformSchema(schema: StructType): StructType = {
 validateAndTransformSchema(schema)
   }
+
+  @Since("2.1.0")
+  override def write: MLWriter = new KMeans.KMeansWriter(this)
 }
 
 @Since("1.6.0")
 object KMeans extends DefaultParamsReadable[KMeans] {
 
+  // TODO: [SPARK-17784]: Add a fromCenters method
+
   @Since("1.6.0")
   override def load(path: String): KMeans = super.load(path)
+
+  @Since("1.6.0")
+  override def read: MLReader[KMeans] = new KMeansReader
+
+  /** [[MLWriter]] instance for [[KMeans]] */
+  private[KMeans] class KMeansWriter(instance: KMeans) extends MLWriter {
+override protected def saveImpl(path: String): Unit = {
+  DefaultParamsWriter.saveInitialModel(instance, path)
+  DefaultParamsWriter.saveMetadata(instance, path, sc)
+}
+  }
+
+  private class KMeansReader extends MLReader[KMeans] {
+
+/** Checked against metadata when loading estimator */
+private val className = classOf[KMeans].getName
+
+override def load(path: String): KMeans = {
+  val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
+  val instance = new KMeans(metadata.uid)
+
+  DefaultParamsReader.getAndSetParams(instance, metadata)
+  DefaultParamsReader.loadInitialModel[KMeansModel](path, sc)
+.foreach(v => instance.setInitialModel(v))
--- End diff --

Here I was thinking more along the lines that we could explicitly ignore 
the error we expect. So if the model load fails for the reason that the 
"initialModel" directory is empty, then we can ignore (that's basically the 
only case). Otherwise, it may be a legitimate failure to load the model (say 
some future load incompatibility or bug, or some other issue with loading an 
initial model that is there) and we will just swallow it without the user 
knowing there is a problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-17 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83695311
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter {
   paramMap: Option[JValue] = None): String = {
 val uid = instance.uid
 val cls = instance.getClass.getName
-val params = 
instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]]
+val params = instance.extractParamMap().toSeq
+  .filter(_.param.name != 
"initialModel").asInstanceOf[Seq[ParamPair[Any]]]
--- End diff --

Fair enough


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-17 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83662587
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -17,6 +17,8 @@
 
 package org.apache.spark.ml.clustering
 
+import scala.util.{Failure, Success}
--- End diff --

please revert these


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83600176
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -303,6 +312,20 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
--- End diff --

My personally preference is throwing an exception to make it clear for 
users; but I don't have strong opinion about this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-17 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83599465
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -303,6 +312,20 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
--- End diff --

Agree with @sethah on this - initial model should take precedence - 
essentially ignoring any `setK` call whether before or after setting initial 
model.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-16 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83579367
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -137,18 +143,67 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
   assert(model.clusterCenters === model2.clusterCenters)
 }
 val kmeans = new KMeans()
-testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
+testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData,
+  Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => 
Unit]))
+  }
+
+  test("Initialize using a trained model") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+val oneIterModel = kmeans.fit(dataset)
+val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 
2))).fit(dataset)
+val oneMoreIterModel = 
kmeans.setInitialModel(oneIterModel).fit(dataset)
+
+assert(oneMoreIterModel.getK === k)
+
+twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters)
+  .foreach { case (center1, center2) => assert(center1 ~== center2 
absTol 1E-8) }
+  }
+
+  test("Initialize using a model with wrong dimension of cluster centers") 
{
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k)
+val wrongDimModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongDimModel).fit(dataset)
+}
+assert(wrongDimModelThrown.getMessage.contains("mismatched dimension"))
+  }
+
+  test("Infer K from an initial model") {
+val kmeans = new KMeans().setK(5)
+val testNewK = 10
+val randomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK)
+assert(kmeans.setInitialModel(randomModel).getK === testNewK)
+  }
+
+  test("Ignore k if initialModel is set") {
+val kmeans = new KMeans()
+
+val m1 = KMeansSuite.generateRandomKMeansModel(dim, k)
--- End diff --

minor: call it `initialModel` instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-16 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83579289
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -320,6 +362,23 @@ class KMeans @Since("1.5.0") (
   .setMaxIterations($(maxIter))
   .setSeed($(seed))
   .setEpsilon($(tol))
+
+if (isDefined(initialModel)) {
+  // Check the equal of dimension
--- End diff --

very minor, but this doesn't make sense. Maybe `// Check that the feature 
dimensions are equal`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-16 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83579235
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -241,6 +252,12 @@ object KMeansModel extends MLReadable[KMeansModel] {
   }
   val model = new KMeansModel(metadata.uid, new 
MLlibKMeansModel(clusterCenters))
   DefaultParamsReader.getAndSetParams(model, metadata)
+  DefaultParamsReader.loadInitialModel[KMeansModel](path, sc) match {
--- End diff --

minor: If we are going to use a pattern match here, let's use it in the 
`KMeansReader` too. I think either is fine, but let's use a single convention. 
Also, if you eliminate the pattern match entirely, then the imports for Success 
and Failure can be removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-16 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83579306
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -320,6 +362,23 @@ class KMeans @Since("1.5.0") (
   .setMaxIterations($(maxIter))
   .setSeed($(seed))
   .setEpsilon($(tol))
+
+if (isDefined(initialModel)) {
+  // Check the equal of dimension
+  val dimOfData = rdd.first().size
+  val dimOfInitialModel = $(initialModel).clusterCenters.head.size
+  require(dimOfData == dimOfInitialModel,
+s"mismatched dimension, $dimOfData in data while 
$dimOfInitialModel in the initial model.")
+
+  // Check the equal of number of clusters
--- End diff --

very minor, but this doesn't make sense. Maybe `// Check that the number of 
clusters are equal`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83521035
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -446,6 +459,11 @@ private[ml] object DefaultParamsReader {
 val cls = Utils.classForName(metadata.className)
 cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path)
   }
+
+  def loadInitialModel[M <: Model[M]](path: String, sc: SparkContext): 
Try[M] = {
+val initialModelPath = new Path(path, "initialModel").toString
+Try(loadParamsInstance[M](initialModelPath, sc))
--- End diff --

Note, now we always try to load the initial model, which I would guess adds 
some overhead in the cases when there is none, but we have to fail and catch 
the exception. Let's leave this for now since it's cleaner, and only change it 
if it becomes a problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83514937
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -81,6 +81,14 @@ private[clustering] trait KMeansParams extends Params 
with HasMaxIter with HasFe
   def getInitSteps: Int = $(initSteps)
 
   /**
+   * Param for KMeansModel to use for warm start.
+   * The setting of initialModel takes precedence of param K.
--- End diff --

Let's be more clear. "Whenever initialModel is set, the initialModel `k` 
will override the param `k`" or something similar.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83514617
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -303,6 +313,23 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setInitialModel(value: KMeansModel): this.type = {
+val kOfInitialModel = value.parentModel.clusterCenters.length
+if (isSet(k)) {
+  if ($(k) != kOfInitialModel) {
+set(k, kOfInitialModel)
+logWarning(s"Param K is set to $kOfInitialModel by the 
initialModel." +
--- End diff --

The previous value is lost because you set it before logging this warning. 
Need to reorder the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83521481
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -137,18 +143,69 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
   assert(model.clusterCenters === model2.clusterCenters)
 }
 val kmeans = new KMeans()
-testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
+testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData,
+  Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => 
Unit]))
+  }
+
+  test("Initialize using a trained model") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+val oneIterModel = kmeans.fit(dataset)
+val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 
2))).fit(dataset)
+val oneMoreIterModel = 
kmeans.setInitialModel(oneIterModel).fit(dataset)
+
+assert(oneMoreIterModel.getK === k)
+
+twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters)
+  .foreach { case (center1, center2) => assert(center1 ~== center2 
absTol 1E-8) }
+  }
+
+  test("Initialize using a model with wrong dimension of cluster centers") 
{
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k)
+val wrongDimModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongDimModel).fit(dataset)
+}
+assert(wrongDimModelThrown.getMessage.contains("mismatched dimension"))
+  }
+
+  test("Infer K from an initial model") {
+val kmeans = new KMeans().setK(5)
+val testNewK = 10
+val randomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK)
+assert(kmeans.setInitialModel(randomModel).getK === testNewK)
+
+val differentKRandomModel = KMeansSuite.generateRandomKMeansModel(dim, 
testNewK + 1)
+assert(kmeans.setInitialModel(differentKRandomModel).getK === testNewK 
+ 1)
--- End diff --

I don't see what this does. Seems to test exactly the same thing as above. 
I think we should add a test case to first set `k` then set `initialModel` with 
a different `k`, then check if `getK === initialModelK`. To make sure we 
override the first one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83520955
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -333,13 +384,48 @@ class KMeans @Since("1.5.0") (
   override def transformSchema(schema: StructType): StructType = {
 validateAndTransformSchema(schema)
   }
+
+  @Since("2.1.0")
+  override def write: MLWriter = new KMeans.KMeansWriter(this)
 }
 
 @Since("1.6.0")
 object KMeans extends DefaultParamsReadable[KMeans] {
 
+  // TODO: [SPARK-17784]: Add a fromCenters method
+
   @Since("1.6.0")
   override def load(path: String): KMeans = super.load(path)
+
+  @Since("1.6.0")
+  override def read: MLReader[KMeans] = new KMeansReader
+
+  /** [[MLWriter]] instance for [[KMeans]] */
+  private[KMeans] class KMeansWriter(instance: KMeans) extends MLWriter {
+override protected def saveImpl(path: String): Unit = {
+  DefaultParamsWriter.saveInitialModel(instance, path)
+  DefaultParamsWriter.saveMetadata(instance, path, sc)
+}
+  }
+
+  private class KMeansReader extends MLReader[KMeans] {
+
+/** Checked against metadata when loading estimator */
+private val className = classOf[KMeans].getName
+
+override def load(path: String): KMeans = {
+  val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
+  val instance = new KMeans(metadata.uid)
+
+  DefaultParamsReader.getAndSetParams(instance, metadata)
+  DefaultParamsReader.loadInitialModel[KMeansModel](path, sc) match {
+case Success(v) =>
+  instance.setInitialModel(v)
+case Failure(e) =>  // Fail to load initial model
--- End diff --

nit/minor: `case Failure(_) =>`

`foreach` on the `Try` is a bit more concise, but I guess it's less clear 
to non-scala developers


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83514641
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -303,6 +312,20 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
--- End diff --

Yeah, so I think myself and Nick were under the impression that we would 
simply ignore any calls to `setK` if initialModel is set. So, 

scala
  def setK(value: Int): this.type = {
if (isSet(initialModel)) {
  logWarning("initialModel is set, so k will be ignored. Clear 
initialModel first.")
  this
} else {
  set(k, value)
}
  }

@MLnick @dbtsai does that seem correct?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83520888
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -320,6 +354,23 @@ class KMeans @Since("1.5.0") (
   .setMaxIterations($(maxIter))
   .setSeed($(seed))
   .setEpsilon($(tol))
+
+if (isDefined(initialModel)) {
+  // Check the equal of number of dimension
--- End diff --

nit/minor: "Check the dimension" and "Check the number of clusters"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83515271
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter {
   paramMap: Option[JValue] = None): String = {
 val uid = instance.uid
 val cls = instance.getClass.getName
-val params = 
instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]]
+val params = instance.extractParamMap().toSeq
+  .filter(_.param.name != 
"initialModel").asInstanceOf[Seq[ParamPair[Any]]]
--- End diff --

It's also possible that we introduce other params that can be of type model 
in the future. Thus, being an instance of `Param[Mode[_]]` may not be 
sufficient to determine if it's the initialModel.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83521550
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -137,18 +143,69 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
   assert(model.clusterCenters === model2.clusterCenters)
 }
 val kmeans = new KMeans()
-testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
+testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData,
+  Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => 
Unit]))
+  }
+
+  test("Initialize using a trained model") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+val oneIterModel = kmeans.fit(dataset)
+val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 
2))).fit(dataset)
+val oneMoreIterModel = 
kmeans.setInitialModel(oneIterModel).fit(dataset)
+
+assert(oneMoreIterModel.getK === k)
+
+twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters)
+  .foreach { case (center1, center2) => assert(center1 ~== center2 
absTol 1E-8) }
+  }
+
+  test("Initialize using a model with wrong dimension of cluster centers") 
{
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k)
+val wrongDimModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongDimModel).fit(dataset)
+}
+assert(wrongDimModelThrown.getMessage.contains("mismatched dimension"))
+  }
+
+  test("Infer K from an initial model") {
+val kmeans = new KMeans().setK(5)
+val testNewK = 10
+val randomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK)
+assert(kmeans.setInitialModel(randomModel).getK === testNewK)
+
+val differentKRandomModel = KMeansSuite.generateRandomKMeansModel(dim, 
testNewK + 1)
+assert(kmeans.setInitialModel(differentKRandomModel).getK === testNewK 
+ 1)
+  }
+
+  test("Reset K after setting initial model") {
--- End diff --

If we do change the behavior of just ignoring `k` as noted above, then this 
test will be invalid. In that case, we should just check that `k` is properly 
ignored, like:

scala
  test("Ignore k if initialModel is set") {
val kmeans = new KMeans()

val m1 = KMeansSuite.generateRandomKMeansModel(dim, k)
// ignore k if initialModel is set
assert(kmeans.setInitialModel(m1).setK(k - 1).getK === k)
kmeans.clear(kmeans.initialModel)
// k is not ignored after initialModel is cleared
assert(kmeans.setK(k - 1).getK === k - 1)
}



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-14 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83510194
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter {
   paramMap: Option[JValue] = None): String = {
 val uid = instance.uid
 val cls = instance.getClass.getName
-val params = 
instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]]
+val params = instance.extractParamMap().toSeq
+  .filter(_.param.name != 
"initialModel").asInstanceOf[Seq[ParamPair[Any]]]
--- End diff --

I tried `_.isInsance[Param[Model[_]]]` and it failed because of the erasing 
of `Model[_]` in Java runtime.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-14 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r83509799
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -137,18 +143,64 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
   assert(model.clusterCenters === model2.clusterCenters)
 }
 val kmeans = new KMeans()
-testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
+testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData,
+  Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => 
Unit]))
+  }
+
+  test("Initialize using a trained model") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+val oneIterModel = kmeans.fit(dataset)
+val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 
2))).fit(dataset)
+val oneMoreIterModel = 
kmeans.setInitialModel(oneIterModel).fit(dataset)
+
+twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters)
+  .foreach { case (center1, center2) => assert(center1 ~== center2 
absTol 1E-8) }
+  }
+
+  test("Initialize using a model with wrong dimension of cluster centers") 
{
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k)
+val wrongDimModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongDimModel).fit(dataset)
+}
+assert(wrongDimModelThrown.getMessage.contains("mismatched dimension"))
+  }
+
+  test("Infer K from an initial model if K is unset") {
+val kmeans = new KMeans()
+val testNewK = 10
+val randomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK)
+assert(kmeans.setInitialModel(randomModel).getK === testNewK)
+  }
+
+  test("Initialize using a model with wrong K if K is set") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+val wrongKModel = KMeansSuite.generateRandomKMeansModel(3, k + 1)
+val wrongKModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongKModel).fit(dataset)
--- End diff --

According to discussions below, `setInitialModel` only overwrites param k.

1. If we use `km.setInitialModel(m1)` or `km.setK(k).setInitialModel(m1)`, 
then km with `k=m1.getK`;

2. If we use `km.setInitialModel(m1).setInitialModel(m2)`, then km with `k 
= m2.getK`;

3. If we use `km.setInitialModel(m1).setK(differentK)`, then an exception 
will be thrown out when `km.fit(data)`.

I think the case 3 is a wrong use case, so it's reasonable to throw an 
exception. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-11 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82751668
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -446,6 +463,20 @@ private[ml] object DefaultParamsReader {
 val cls = Utils.classForName(metadata.className)
 cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path)
   }
+
+  def loadAndSetInitialModel[M <: Model[M]](
+  instance: HasInitialModel[M], metadata: Metadata, path: String, sc: 
SparkContext): Unit = {
+implicit val format = DefaultFormats
+// Try to load the initial model
--- End diff --

What about just wrapping the load initial model in a `Try`? That way there 
is no need to rely on metadata. It will fail if the "initialModel" path doesn't 
exist. We can ignore that exception but throw (or log) if we encounter an error 
in the actual loading.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-11 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82753706
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter {
   paramMap: Option[JValue] = None): String = {
 val uid = instance.uid
 val cls = instance.getClass.getName
-val params = 
instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]]
+val params = instance.extractParamMap().toSeq
+  .filter(_.param.name != 
"initialModel").asInstanceOf[Seq[ParamPair[Any]]]
--- End diff --

Is it not possible to check if the param is an instance of 
`Param[Model[_]]` rather than the name?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-11 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82752375
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -137,18 +143,64 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
   assert(model.clusterCenters === model2.clusterCenters)
 }
 val kmeans = new KMeans()
-testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
+testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData,
+  Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => 
Unit]))
+  }
+
+  test("Initialize using a trained model") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+val oneIterModel = kmeans.fit(dataset)
+val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 
2))).fit(dataset)
+val oneMoreIterModel = 
kmeans.setInitialModel(oneIterModel).fit(dataset)
+
+twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters)
+  .foreach { case (center1, center2) => assert(center1 ~== center2 
absTol 1E-8) }
+  }
+
+  test("Initialize using a model with wrong dimension of cluster centers") 
{
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k)
+val wrongDimModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongDimModel).fit(dataset)
+}
+assert(wrongDimModelThrown.getMessage.contains("mismatched dimension"))
+  }
+
+  test("Infer K from an initial model if K is unset") {
+val kmeans = new KMeans()
+val testNewK = 10
+val randomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK)
+assert(kmeans.setInitialModel(randomModel).getK === testNewK)
+  }
+
+  test("Initialize using a model with wrong K if K is set") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+val wrongKModel = KMeansSuite.generateRandomKMeansModel(3, k + 1)
--- End diff --

`3` can be `dim` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-11 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82750575
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -333,13 +372,44 @@ class KMeans @Since("1.5.0") (
   override def transformSchema(schema: StructType): StructType = {
 validateAndTransformSchema(schema)
   }
+
+  @Since("2.1.0")
+  override def write: MLWriter = new KMeans.KMeansWriter(this)
 }
 
 @Since("1.6.0")
 object KMeans extends DefaultParamsReadable[KMeans] {
 
+  // TODO: [SPARK-17784]: Add a fromCenters method
+
   @Since("1.6.0")
   override def load(path: String): KMeans = super.load(path)
+
+  @Since("1.6.0")
+  override def read: MLReader[KMeans] = new KMeansReader
+
+  /** [[MLWriter]] instance for [[KMeans]] */
+  private[KMeans] class KMeansWriter(instance: KMeans) extends MLWriter {
+override protected def saveImpl(path: String): Unit = {
+  DefaultParamsWriter.saveInitialModel(instance, path)
+  DefaultParamsWriter.saveMetadata(instance, path, sc)
+}
+  }
+
+  private class KMeansReader extends MLReader[KMeans] {
+
+/** Checked against metadata when loading model */
--- End diff --

nit - should this be "... when loading estimator"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-11 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82749390
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -303,6 +312,20 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
--- End diff --

Above, we may want to also log warning in `setK` if `initialModel` has 
already been set?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-11 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82749322
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -303,6 +312,20 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setInitialModel(value: KMeansModel): this.type = {
+val kOfInitialModel = value.parentModel.clusterCenters.length
+if (isSet(k)) {
+  require(kOfInitialModel == $(k),
--- End diff --

As discussed elsewhere, at set time I think we can log a warning and let 
`initialModel` take precedence.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-11 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82749661
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -81,6 +81,13 @@ private[clustering] trait KMeansParams extends Params 
with HasMaxIter with HasFe
   def getInitSteps: Int = $(initSteps)
 
   /**
+   * Param for KMeansModel to use for warm start.
--- End diff --

We need more detail here to document the behavior that `initialModel` param 
settings take precedence.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-08 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82497437
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -137,18 +143,64 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
   assert(model.clusterCenters === model2.clusterCenters)
 }
 val kmeans = new KMeans()
-testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
+testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData,
+  Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => 
Unit]))
+  }
+
+  test("Initialize using a trained model") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+val oneIterModel = kmeans.fit(dataset)
+val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 
2))).fit(dataset)
+val oneMoreIterModel = 
kmeans.setInitialModel(oneIterModel).fit(dataset)
+
+twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters)
+  .foreach { case (center1, center2) => assert(center1 ~== center2 
absTol 1E-8) }
+  }
+
+  test("Initialize using a model with wrong dimension of cluster centers") 
{
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k)
+val wrongDimModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongDimModel).fit(dataset)
+}
+assert(wrongDimModelThrown.getMessage.contains("mismatched dimension"))
+  }
+
+  test("Infer K from an initial model if K is unset") {
+val kmeans = new KMeans()
+val testNewK = 10
+val randomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK)
+assert(kmeans.setInitialModel(randomModel).getK === testNewK)
+  }
+
+  test("Initialize using a model with wrong K if K is set") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+val wrongKModel = KMeansSuite.generateRandomKMeansModel(3, k + 1)
+val wrongKModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongKModel).fit(dataset)
--- End diff --

I think the behavior could be that `setInitialModel` should overwrite the 
configuration including # of iterations and k, etc. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-07 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82489783
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -137,18 +143,64 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
   assert(model.clusterCenters === model2.clusterCenters)
 }
 val kmeans = new KMeans()
-testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
+testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData,
+  Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => 
Unit]))
+  }
+
+  test("Initialize using a trained model") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+val oneIterModel = kmeans.fit(dataset)
+val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 
2))).fit(dataset)
+val oneMoreIterModel = 
kmeans.setInitialModel(oneIterModel).fit(dataset)
+
+twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters)
+  .foreach { case (center1, center2) => assert(center1 ~== center2 
absTol 1E-8) }
+  }
+
+  test("Initialize using a model with wrong dimension of cluster centers") 
{
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k)
+val wrongDimModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongDimModel).fit(dataset)
+}
+assert(wrongDimModelThrown.getMessage.contains("mismatched dimension"))
+  }
+
+  test("Infer K from an initial model if K is unset") {
+val kmeans = new KMeans()
+val testNewK = 10
+val randomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK)
+assert(kmeans.setInitialModel(randomModel).getK === testNewK)
+  }
+
+  test("Initialize using a model with wrong K if K is set") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+val wrongKModel = KMeansSuite.generateRandomKMeansModel(3, k + 1)
+val wrongKModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongKModel).fit(dataset)
--- End diff --

How about a user trains a model with inititalModel1, and then want to train 
another one with initalModel2 with different k? For the first case, do we 
really need to throw an error instead of warning?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82474666
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -137,18 +143,64 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
   assert(model.clusterCenters === model2.clusterCenters)
 }
 val kmeans = new KMeans()
-testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
+testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData,
+  Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => 
Unit]))
+  }
+
+  test("Initialize using a trained model") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+val oneIterModel = kmeans.fit(dataset)
+val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 
2))).fit(dataset)
+val oneMoreIterModel = 
kmeans.setInitialModel(oneIterModel).fit(dataset)
+
+twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters)
+  .foreach { case (center1, center2) => assert(center1 ~== center2 
absTol 1E-8) }
+  }
+
+  test("Initialize using a model with wrong dimension of cluster centers") 
{
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k)
+val wrongDimModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongDimModel).fit(dataset)
+}
+assert(wrongDimModelThrown.getMessage.contains("mismatched dimension"))
+  }
+
+  test("Infer K from an initial model if K is unset") {
+val kmeans = new KMeans()
+val testNewK = 10
+val randomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK)
+assert(kmeans.setInitialModel(randomModel).getK === testNewK)
+  }
+
+  test("Initialize using a model with wrong K if K is set") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+
+val wrongKModel = KMeansSuite.generateRandomKMeansModel(3, k + 1)
+val wrongKModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongKModel).fit(dataset)
--- End diff --

This fails on set, so fit should do nothing here. Let's test the following 
cases:

1. `new KMeans().setK(5).setInitialModel(wrongKModel)` - throws an error in 
the setter
2. `new KMeans().setInitialModel(kEquals6Model).setK(5).fit(df)` - throws 
an error in the fit method


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82474849
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -137,18 +143,64 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
   assert(model.clusterCenters === model2.clusterCenters)
 }
 val kmeans = new KMeans()
-testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
+testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData,
+  Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => 
Unit]))
+  }
+
+  test("Initialize using a trained model") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+val oneIterModel = kmeans.fit(dataset)
+val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 
2))).fit(dataset)
+val oneMoreIterModel = 
kmeans.setInitialModel(oneIterModel).fit(dataset)
--- End diff --

`assert(oneMoreIterModel.getK === k)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-06 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82230001
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -303,6 +312,10 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setInitialModel(value: KMeansModel): this.type = set(initialModel, 
value)
--- End diff --

+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-06 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82229991
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -303,6 +312,10 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setInitialModel(value: KMeansModel): this.type = set(initialModel, 
value)
--- End diff --

+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-06 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82215606
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -303,6 +312,10 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setInitialModel(value: KMeansModel): this.type = set(initialModel, 
value)
--- End diff --

There was some discussion on this in this PR (it was in March :). IF the 
above is the desired behavior, we still need to check that `k` and the initial 
model line up since you can set the initial model, and then set `k`. I tested 
it and an error still gets thrown, but it's thrown by the mllib KMeans instead. 
We should check it in ML explicitly. I prefer the following behavior:

* If `isSet(initialModel && isSet(k)` then check that they are equal at 
train time and throw an error if not
* if `isSet(initialModel) && !isSet(k)` then set k to the initial model k 
at train time (can log a warning maybe)

Actually, the current behavior is essentially equivalent. But, we still 
need a test to check that an error is thrown when the two mismatch, and we need 
to check that case inside of the train method still.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-05 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82105944
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -137,18 +142,53 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
   assert(model.clusterCenters === model2.clusterCenters)
 }
 val kmeans = new KMeans()
-testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
+testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData,
+  Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => 
Unit]))
+  }
+
+  test("Initialize using given cluster centers") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+val oneIterModel = kmeans.fit(dataset)
+val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 
2))).fit(dataset)
+val oneMoreIterModel = 
kmeans.setInitialModel(oneIterModel).fit(dataset)
+
+twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters)
+  .foreach { case (center1, center2) => assert(center1 ~== center2 
absTol 1E-8) }
+  }
+
+  test("Initialize using wrong model") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(10)
+
+val wrongKModel = KMeansSuite.generateRandomKMeansModel(3, k + 1)
+val wrongKModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongKModel).fit(dataset)
+}
+assert(wrongKModelThrown.getMessage.contains("mismatched cluster 
count"))
+
+val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k)
+val wrongDimModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongDimModel).fit(dataset)
+}
+assert(wrongDimModelThrown.getMessage.contains("mismatched dimension"))
   }
 }
 
--- End diff --

Let's have a test case that without explicitly setting k, k can be inferred 
from initial model.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-05 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82105608
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -303,6 +312,10 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setInitialModel(value: KMeansModel): this.type = set(initialModel, 
value)
--- End diff --

Do we need to 
```scala
def setInitialModel(value: KMeansModel): this.type = 
  set(k, initialModel.parentModel.clusterCenters.length).set(initialModel, 
value)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-10-05 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r82050056
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -137,18 +142,53 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
   assert(model.clusterCenters === model2.clusterCenters)
 }
 val kmeans = new KMeans()
-testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
+testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData,
+  Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => 
Unit]))
+  }
+
+  test("Initialize using given cluster centers") {
--- End diff --

rename this to "Initialize using trained model"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-29 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r81264106
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala ---
@@ -107,24 +133,34 @@ trait DefaultReadWriteTest extends TempDirectory { 
self: Suite =>
 }
 val model = estimator.fit(dataset)
 
-// Test Estimator save/load
-val estimator2 = testDefaultReadWrite(estimator)
-testParams.foreach { case (p, v) =>
-  val param = estimator.getParam(p)
-  assert(estimator.get(param).get === estimator2.get(param).get)
+val testFunctions = if (testParams.contains("initialModel")) {
+  Map(("initialModel", checkModelData.asInstanceOf[(Any, Any) => 
Unit]))
--- End diff --

Won't model equality depend on algorithm specific details? Each model has 
different parameters and we really need to use those to check equality...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-29 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r81263907
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter {
   paramMap: Option[JValue] = None): String = {
 val uid = instance.uid
 val cls = instance.getClass.getName
-val params = 
instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]]
+val params = instance.extractParamMap().toSeq
+  .filter(_.param.name != 
"initialModel").asInstanceOf[Seq[ParamPair[Any]]]
 val jsonParams = paramMap.getOrElse(render(params.map { case 
ParamPair(p, v) =>
   p.name -> parse(p.jsonEncode(v))
 }.toList))
+// If the instance has an "initialModel" param and the param is 
defined, then the initial model
+// will be saved along with the instance.
+val initialModelFlag =
+  instance.hasParam("initialModel") && 
instance.isDefined(instance.getParam("initialModel"))
 val basicMetadata = ("class" -> cls) ~
   ("timestamp" -> System.currentTimeMillis()) ~
   ("sparkVersion" -> sc.version) ~
   ("uid" -> uid) ~
-  ("paramMap" -> jsonParams)
+  ("paramMap" -> jsonParams) ~
+  // TODO: Figure out more robust way to detect the existing of the 
initialModel.
--- End diff --

Hmm, the comment seems to have been placed in the wrong spot. I meant that 
in the test suite, there is a test called "Initialize using given cluster 
centers" when we are really initializing it by training a model and using it as 
the initial model. Sorry for the confusion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-29 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r81262768
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter {
   paramMap: Option[JValue] = None): String = {
 val uid = instance.uid
 val cls = instance.getClass.getName
-val params = 
instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]]
+val params = instance.extractParamMap().toSeq
+  .filter(_.param.name != 
"initialModel").asInstanceOf[Seq[ParamPair[Any]]]
 val jsonParams = paramMap.getOrElse(render(params.map { case 
ParamPair(p, v) =>
   p.name -> parse(p.jsonEncode(v))
 }.toList))
+// If the instance has an "initialModel" param and the param is 
defined, then the initial model
+// will be saved along with the instance.
+val initialModelFlag =
+  instance.hasParam("initialModel") && 
instance.isDefined(instance.getParam("initialModel"))
 val basicMetadata = ("class" -> cls) ~
   ("timestamp" -> System.currentTimeMillis()) ~
   ("sparkVersion" -> sc.version) ~
   ("uid" -> uid) ~
-  ("paramMap" -> jsonParams)
+  ("paramMap" -> jsonParams) ~
+  // TODO: Figure out more robust way to detect the existing of the 
initialModel.
--- End diff --

@sethah I don't quite get this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-29 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r81262198
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala ---
@@ -107,24 +133,34 @@ trait DefaultReadWriteTest extends TempDirectory { 
self: Suite =>
 }
 val model = estimator.fit(dataset)
 
-// Test Estimator save/load
-val estimator2 = testDefaultReadWrite(estimator)
-testParams.foreach { case (p, v) =>
-  val param = estimator.getParam(p)
-  assert(estimator.get(param).get === estimator2.get(param).get)
+val testFunctions = if (testParams.contains("initialModel")) {
+  Map(("initialModel", checkModelData.asInstanceOf[(Any, Any) => 
Unit]))
--- End diff --

One possible solution on this is overriding the equal method for the base 
Model class so that we needn't override the equal method to all Models.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-27 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r80834369
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter {
   paramMap: Option[JValue] = None): String = {
 val uid = instance.uid
 val cls = instance.getClass.getName
-val params = 
instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]]
+val params = instance.extractParamMap().toSeq
+  .filter(_.param.name != 
"initialModel").asInstanceOf[Seq[ParamPair[Any]]]
 val jsonParams = paramMap.getOrElse(render(params.map { case 
ParamPair(p, v) =>
   p.name -> parse(p.jsonEncode(v))
 }.toList))
+// If the instance has an "initialModel" param and the param is 
defined, then the initial model
+// will be saved along with the instance.
+val initialModelFlag =
+  instance.hasParam("initialModel") && 
instance.isDefined(instance.getParam("initialModel"))
 val basicMetadata = ("class" -> cls) ~
   ("timestamp" -> System.currentTimeMillis()) ~
   ("sparkVersion" -> sc.version) ~
   ("uid" -> uid) ~
-  ("paramMap" -> jsonParams)
+  ("paramMap" -> jsonParams) ~
+  // TODO: Figure out more robust way to detect the existing of the 
initialModel.
--- End diff --

This test should probably be called "Initialize using a trained model"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-27 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r80835092
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -17,12 +17,19 @@
 
 package org.apache.spark.ml.clustering
 
+import scala.util.Random
+
 import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.Model
--- End diff --

unused.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-27 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r80835086
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -17,12 +17,19 @@
 
 package org.apache.spark.ml.clustering
 
+import scala.util.Random
+
 import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.Model
 import org.apache.spark.ml.linalg.{Vector, Vectors}
-import org.apache.spark.ml.util.DefaultReadWriteTest
-import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans}
+import org.apache.spark.ml.param.{ParamMap, ParamPair}
+import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable}
+import org.apache.spark.ml.util.TestingUtils._
+import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, 
KMeansModel => MLlibKMeansModel}
+import org.apache.spark.mllib.linalg.{Vectors => MLlibVectors}
 import org.apache.spark.mllib.util.MLlibTestSparkContext
 import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
+import org.apache.spark.sql.types.StructType
--- End diff --

unused.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-27 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r80835233
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -17,12 +17,19 @@
 
 package org.apache.spark.ml.clustering
 
+import scala.util.Random
+
 import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.Model
 import org.apache.spark.ml.linalg.{Vector, Vectors}
-import org.apache.spark.ml.util.DefaultReadWriteTest
-import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans}
+import org.apache.spark.ml.param.{ParamMap, ParamPair}
+import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable}
--- End diff --

unused.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-27 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r80834115
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -303,6 +312,10 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setInitialModel(value: KMeansModel): this.type = set(initialModel, 
value)
--- End diff --

It looks like the `fromCenters` method never got implemented. I think this 
is important since otherwise users can only set an initial model if they 
already have a trained KMeans model.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-27 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r80833730
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter {
   paramMap: Option[JValue] = None): String = {
 val uid = instance.uid
 val cls = instance.getClass.getName
-val params = 
instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]]
+val params = instance.extractParamMap().toSeq
+  .filter(_.param.name != 
"initialModel").asInstanceOf[Seq[ParamPair[Any]]]
 val jsonParams = paramMap.getOrElse(render(params.map { case 
ParamPair(p, v) =>
   p.name -> parse(p.jsonEncode(v))
 }.toList))
+// If the instance has an "initialModel" param and the param is 
defined, then the initial model
+// will be saved along with the instance.
+val initialModelFlag =
+  instance.hasParam("initialModel") && 
instance.isDefined(instance.getParam("initialModel"))
 val basicMetadata = ("class" -> cls) ~
   ("timestamp" -> System.currentTimeMillis()) ~
   ("sparkVersion" -> sc.version) ~
   ("uid" -> uid) ~
-  ("paramMap" -> jsonParams)
+  ("paramMap" -> jsonParams) ~
+  // TODO: Figure out more robust way to detect the existing of the 
initialModel.
--- End diff --

If we're going to leave it as a TODO, let's have a Jira for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-27 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r80833366
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala ---
@@ -107,24 +133,34 @@ trait DefaultReadWriteTest extends TempDirectory { 
self: Suite =>
 }
 val model = estimator.fit(dataset)
 
-// Test Estimator save/load
-val estimator2 = testDefaultReadWrite(estimator)
-testParams.foreach { case (p, v) =>
-  val param = estimator.getParam(p)
-  assert(estimator.get(param).get === estimator2.get(param).get)
+val testFunctions = if (testParams.contains("initialModel")) {
+  Map(("initialModel", checkModelData.asInstanceOf[(Any, Any) => 
Unit]))
--- End diff --

I don't think it's a good idea. The code is technically incorrect. At the 
very least, we could pass in a map of functions for each param that has a 
special check. Like adding an extra method argument `checkParamsFunctions: 
Map[String, (Any, Any) => Unit] = Map.empty`. I don't think it's an elegant 
solution but it should at least apply generically. I think it might be 
reasonable to override the equals method, but we should check with @jkbradley.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-27 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r80630417
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -81,11 +81,23 @@ private[clustering] trait KMeansParams extends Params 
with HasMaxIter with HasFe
   def getInitSteps: Int = $(initSteps)
 
   /**
+   * Param for KMeansModel to use for warm start.
+   * @group param
+   */
+  final val initialModel: Param[KMeansModel] =
+new Param[KMeansModel](this, "initialModel", "A KMeansModel for warm 
start.")
+
+  /**
* Validates and transforms the input schema.
* @param schema input schema
* @return output schema
*/
   protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+if (isDefined(initialModel)) {
+  val kOfInitialModel = 
$(initialModel).parentModel.clusterCenters.length
+  require(kOfInitialModel == $(k),
+s"mismatched cluster count, ${$(k)} cluster centers required but 
$kOfInitialModel found.")
+}
--- End diff --

Good catch, I move the check to fit()


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-27 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r80630435
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala ---
@@ -39,7 +40,7 @@ trait DefaultReadWriteTest extends TempDirectory { self: 
Suite =>
* @tparam T ML instance type
* @return  Instance loaded from file
*/
-  def testDefaultReadWrite[T <: Params with MLWritable](
+  def testDefaultReadWrite[T <: Params with MLWritable, M <: Model[M]](
--- End diff --

deleted


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-27 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r80630450
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala ---
@@ -107,24 +133,34 @@ trait DefaultReadWriteTest extends TempDirectory { 
self: Suite =>
 }
 val model = estimator.fit(dataset)
 
-// Test Estimator save/load
-val estimator2 = testDefaultReadWrite(estimator)
-testParams.foreach { case (p, v) =>
-  val param = estimator.getParam(p)
-  assert(estimator.get(param).get === estimator2.get(param).get)
+val testFunctions = if (testParams.contains("initialModel")) {
+  Map(("initialModel", checkModelData.asInstanceOf[(Any, Any) => 
Unit]))
+} else {
+  Map.empty[String, (Any, Any) => Unit]
 }
 
+// Test Estimator save/load
+val estimator2 = testDefaultReadWrite(estimator, testParams = false)
+compareParamsWithComplexTypes(estimator, estimator2, testParams, 
testFunctions)
+
 // Test Model save/load
-val model2 = testDefaultReadWrite(model)
-testParams.foreach { case (p, v) =>
-  val param = model.getParam(p)
-  assert(model.get(param).get === model2.get(param).get)
-}
+val model2 = testDefaultReadWrite(model, testParams = false)
+compareParamsWithComplexTypes(model, model2, testParams, testFunctions)
 
 checkModelData(model, model2)
   }
 }
 
+class MyModel extends Model[MyModel] {
--- End diff --

deleted


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-27 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r80630372
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala ---
@@ -107,24 +133,34 @@ trait DefaultReadWriteTest extends TempDirectory { 
self: Suite =>
 }
 val model = estimator.fit(dataset)
 
-// Test Estimator save/load
-val estimator2 = testDefaultReadWrite(estimator)
-testParams.foreach { case (p, v) =>
-  val param = estimator.getParam(p)
-  assert(estimator.get(param).get === estimator2.get(param).get)
+val testFunctions = if (testParams.contains("initialModel")) {
+  Map(("initialModel", checkModelData.asInstanceOf[(Any, Any) => 
Unit]))
--- End diff --

I put a TODO here. For the current situation, I think it's OK. Let's figure 
it out later?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-26 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r80565861
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala ---
@@ -107,24 +133,34 @@ trait DefaultReadWriteTest extends TempDirectory { 
self: Suite =>
 }
 val model = estimator.fit(dataset)
 
-// Test Estimator save/load
-val estimator2 = testDefaultReadWrite(estimator)
-testParams.foreach { case (p, v) =>
-  val param = estimator.getParam(p)
-  assert(estimator.get(param).get === estimator2.get(param).get)
+val testFunctions = if (testParams.contains("initialModel")) {
+  Map(("initialModel", checkModelData.asInstanceOf[(Any, Any) => 
Unit]))
--- End diff --

On second thought, it really might actually be easiest to override the 
equals method. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-21 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r79957341
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -81,11 +81,23 @@ private[clustering] trait KMeansParams extends Params 
with HasMaxIter with HasFe
   def getInitSteps: Int = $(initSteps)
 
   /**
+   * Param for KMeansModel to use for warm start.
+   * @group param
+   */
+  final val initialModel: Param[KMeansModel] =
+new Param[KMeansModel](this, "initialModel", "A KMeansModel for warm 
start.")
+
+  /**
* Validates and transforms the input schema.
* @param schema input schema
* @return output schema
*/
   protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+if (isDefined(initialModel)) {
+  val kOfInitialModel = 
$(initialModel).parentModel.clusterCenters.length
+  require(kOfInitialModel == $(k),
+s"mismatched cluster count, ${$(k)} cluster centers required but 
$kOfInitialModel found.")
+}
--- End diff --

This method is called at the start of training, and right now this is the 
only place we check `k`. TBH, I think it would be cleaner to implement a 
isInitialModelValid method in the companion object which checks the k and the 
feature dimensions all in one place. Checking the k parameter here is a bit odd 
since it's not exactly schema related. Seems cleaner to me to check everything 
initialModel related in one place. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-21 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r79958291
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -139,16 +145,32 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 val kmeans = new KMeans()
 testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
   }
+
--- End diff --

Just want to note here, that we don't need to verify that setting the wrong 
type of model throws an error anymore, since it won't compile :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-21 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r79955766
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -139,16 +146,61 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 val kmeans = new KMeans()
 testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
   }
+
+  test("Initialize using given cluster centers") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+val oneIterModel = kmeans.fit(dataset)
+val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 
2))).fit(dataset)
+val oneMoreIterModel = 
kmeans.setInitialModel(oneIterModel).fit(dataset)
+
+twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters)
+  .foreach { case (center1, center2) => assert(center1 ~== center2 
absTol 1E-8) }
+  }
+
+  test("Initialize using wrong model") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(10)
+
+val wrongKModel = KMeansSuite.generateRandomKMeansModel(3, k + 1)
+val wrongKModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongKModel).fit(dataset)
+}
+assert(wrongKModelThrown.getMessage.contains("mismatched cluster 
count"))
+
+val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k)
+val wrongDimModelThrown = intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongDimModel).fit(dataset)
+}
+assert(wrongDimModelThrown.getMessage.contains("mismatched dimension"))
+  }
 }
 
 object KMeansSuite {
+
+  class MockModel(override val uid: String) extends Model[MockModel] {
--- End diff --

never used?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-21 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r79955656
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala ---
@@ -107,24 +133,34 @@ trait DefaultReadWriteTest extends TempDirectory { 
self: Suite =>
 }
 val model = estimator.fit(dataset)
 
-// Test Estimator save/load
-val estimator2 = testDefaultReadWrite(estimator)
-testParams.foreach { case (p, v) =>
-  val param = estimator.getParam(p)
-  assert(estimator.get(param).get === estimator2.get(param).get)
+val testFunctions = if (testParams.contains("initialModel")) {
+  Map(("initialModel", checkModelData.asInstanceOf[(Any, Any) => 
Unit]))
--- End diff --

There is an assumption here that checkModelData will work on the initial 
models, which is not always true. In the `HasInitialModel` trait we do not 
require the initial model to be of the same type of the model which inherits 
it. So, if we ever use a model that has an initial model other than itself, we 
will get runtime errors here. We either need to update this to handle arbitrary 
initial models, or specify somehow that the initial model must be the same type 
of the model which inherits it. 

I prefer updating the tests, but I spent some time tinkering with it and 
didn't come up with anything that wasn't really clunky.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-21 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r79958476
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala ---
@@ -22,10 +22,11 @@ import java.io.{File, IOException}
 import org.scalatest.Suite
 
 import org.apache.spark.SparkFunSuite
-import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.{Estimator, Model, PipelineStage}
 import org.apache.spark.ml.param._
 import org.apache.spark.mllib.util.MLlibTestSparkContext
-import org.apache.spark.sql.Dataset
+import org.apache.spark.sql.{DataFrame, Dataset}
+import org.apache.spark.sql.types.StructType
--- End diff --

unused imports after `MyModel` is removed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-21 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r79958909
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala ---
@@ -39,7 +40,7 @@ trait DefaultReadWriteTest extends TempDirectory { self: 
Suite =>
* @tparam T ML instance type
* @return  Instance loaded from file
*/
-  def testDefaultReadWrite[T <: Params with MLWritable](
+  def testDefaultReadWrite[T <: Params with MLWritable, M <: Model[M]](
--- End diff --

what is this for?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-21 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r79947128
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala ---
@@ -107,24 +133,34 @@ trait DefaultReadWriteTest extends TempDirectory { 
self: Suite =>
 }
 val model = estimator.fit(dataset)
 
-// Test Estimator save/load
-val estimator2 = testDefaultReadWrite(estimator)
-testParams.foreach { case (p, v) =>
-  val param = estimator.getParam(p)
-  assert(estimator.get(param).get === estimator2.get(param).get)
+val testFunctions = if (testParams.contains("initialModel")) {
+  Map(("initialModel", checkModelData.asInstanceOf[(Any, Any) => 
Unit]))
+} else {
+  Map.empty[String, (Any, Any) => Unit]
 }
 
+// Test Estimator save/load
+val estimator2 = testDefaultReadWrite(estimator, testParams = false)
+compareParamsWithComplexTypes(estimator, estimator2, testParams, 
testFunctions)
+
 // Test Model save/load
-val model2 = testDefaultReadWrite(model)
-testParams.foreach { case (p, v) =>
-  val param = model.getParam(p)
-  assert(model.get(param).get === model2.get(param).get)
-}
+val model2 = testDefaultReadWrite(model, testParams = false)
+compareParamsWithComplexTypes(model, model2, testParams, testFunctions)
 
 checkModelData(model, model2)
   }
 }
 
+class MyModel extends Model[MyModel] {
--- End diff --

This is never used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r79936136
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -81,11 +81,23 @@ private[clustering] trait KMeansParams extends Params 
with HasMaxIter with HasFe
   def getInitSteps: Int = $(initSteps)
 
   /**
+   * Param for KMeansModel to use for warm start.
+   * @group param
+   */
+  final val initialModel: Param[KMeansModel] =
+new Param[KMeansModel](this, "initialModel", "A KMeansModel for warm 
start.")
+
+  /**
* Validates and transforms the input schema.
* @param schema input schema
* @return output schema
*/
   protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+if (isDefined(initialModel)) {
+  val kOfInitialModel = 
$(initialModel).parentModel.clusterCenters.length
+  require(kOfInitialModel == $(k),
+s"mismatched cluster count, ${$(k)} cluster centers required but 
$kOfInitialModel found.")
+}
--- End diff --

Why do you need to check this? This should already be checked in training.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r78690235
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -139,16 +146,72 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 val kmeans = new KMeans()
 testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
   }
+
+  test("Initialize using given cluster centers") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+val oneIterModel = kmeans.fit(dataset)
+val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 
2))).fit(dataset)
+val oneMoreIterModel = 
kmeans.setInitialModel(oneIterModel).fit(dataset)
+
+twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters)
+  .foreach { case (center1, center2) => assert(center1 ~== center2 
absTol 1E-8) }
+  }
+
+  test("Initialize using wrong model") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(10)
+val wrongTypeModel = new KMeansSuite.MockModel()
+
+withClue("The type of an initial model should only be a KMeansModel.") 
{
+  intercept[IllegalArgumentException] {
+kmeans.setInitialModel(wrongTypeModel).fit(dataset)
+  }
+}
+
+val wrongKModel = KMeansSuite.generateRandomKMeansModel(3, k + 1)
+withClue("The number of clusters set in the given model should be the 
same with the one set" +
+  " in the KMeans estimator.") {
+  intercept[IllegalArgumentException] {
+kmeans.setInitialModel(wrongKModel).fit(dataset)
+  }
+}
+
+val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k)
+withClue("The dimension of points in the model should be the same with 
the dimension of the" +
--- End diff --

`withClue` just prepends the provided message to the error message when the 
exception is not thrown as expected. I think we should check that the correct 
exception is thrown:

scala
val thrown = intercept[IllegalArgumentException] {
  ...
}
assert(thrown.getMessage.contains("mismatched dimension"))



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-13 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r78689714
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -303,6 +322,29 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setInitialModel(value: KMeansModel): this.type = set(initialModel, 
value)
+
+  /** @group setParam */
+  @Since("2.1.0")
+  def setInitialModel(value: Model[_]): this.type = {
--- End diff --

As a follow on, we could eliminate the setter `def setInitialModel(value: 
Model[_])`. To have better documentation, we could leave the param as abstract 
in the `HasInitialModel` trait:

scala
def hasInitialModel: Param[T]


Then, when we add this to new models, we implement the param there. So, in 
KMeansParams:

scala
/**
 * Param for KMeansModel to use for warm start".
 * @group param
 */
final val hasInitialModel: Param[KMeansModel] = new 
Param[KMeansModel](this, "initialModel", 
  "A KMeansModel to use for warm start")


That way the params are explicit in what type of model is used for initial 
model and the documentation is more clear.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-13 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r78682701
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -303,6 +322,29 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setInitialModel(value: KMeansModel): this.type = set(initialModel, 
value)
+
+  /** @group setParam */
+  @Since("2.1.0")
+  def setInitialModel(value: Model[_]): this.type = {
--- End diff --

+1 on using `KMeansModel.fromCenters(centers)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-13 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r78669003
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -303,6 +322,29 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setInitialModel(value: KMeansModel): this.type = set(initialModel, 
value)
+
+  /** @group setParam */
+  @Since("2.1.0")
+  def setInitialModel(value: Model[_]): this.type = {
--- End diff --

Is the reason we provide this method just so we can throw a better error 
message? I am concerned about providing _three different_ setter methods for 
this param, particularly, are we going to have to do this every time? There are 
ways to provide smarter error messages and more specific param docs, which may 
be better than adding extra setters.

I see why we need to provide a way to set the initial model with just 
cluster centers, but I think we should limit the "convenience" methods we add, 
since we plan to extend this design to many other models. For instance, in 
logistic regression do we add:

scala
def setInitialModel(coefficients: Vector, intercept: Double) = ...
def setInitialModel(coefficients: Vector) = setInitialModel(coefficients, 
0.0)
def setInitialModel(coefficients: Matrix, intercept: Vector) = ...
def setInitialModel(coefficients: Matrix) = ...


I think we can get away with only specifying one setter method, as we do in 
the other params. To allow users to specify the model from centers we could add 
a method like `KMeansModel.fromCenters(centers)` and users can use that. I 
appreciate others' thoughts on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-13 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r78610830
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -139,16 +145,32 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 val kmeans = new KMeans()
 testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
   }
+
--- End diff --

+1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-13 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r78604875
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -203,10 +219,13 @@ object KMeansModel extends MLReadable[KMeansModel] {
 
   /** [[MLWriter]] instance for [[KMeansModel]] */
   private[KMeansModel] class KMeansModelWriter(instance: KMeansModel) 
extends MLWriter {
+import org.json4s.JsonDSL._
--- End diff --

Not used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-07 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r77834310
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -139,16 +146,61 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 val kmeans = new KMeans()
 testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
   }
+
+  test("Initialize using given cluster centers") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+val oneIterModel = kmeans.fit(dataset)
+val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 
2))).fit(dataset)
+val oneMoreIterModel = 
kmeans.setInitialModel(oneIterModel).fit(dataset)
+
+twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters)
+  .foreach { case (center1, center2) => assert(center1 ~== center2 
absTol 1E-8) }
+  }
+
+  test("Initialize using wrong model") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(10)
+val wrongTypeModel = new KMeansSuite.MockModel()
+
assert(!kmeans.setInitialModel(wrongTypeModel).isSet(kmeans.initialModel))
+
+val wrongKModel = KMeansSuite.generateKMeansModel(3, k + 1)
+intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongKModel).fit(dataset)
+}
+
+val wrongDimModel = KMeansSuite.generateKMeansModel(4, k)
+intercept[IllegalArgumentException] {
+  kmeans.setInitialModel(wrongDimModel).fit(dataset)
+}
+  }
 }
 
 object KMeansSuite {
+
+  class MockModel(override val uid: String) extends Model[MockModel] {
+
+def this() = this(Identifiable.randomUID("mockModel"))
+
+override def copy(extra: ParamMap): MockModel = throw new 
NotImplementedError()
+
+override def transform(dataset: Dataset[_]): DataFrame = throw new 
NotImplementedError()
+
+override def transformSchema(schema: StructType): StructType = throw 
new NotImplementedError()
+  }
+
   def generateKMeansData(spark: SparkSession, rows: Int, dim: Int, k: 
Int): DataFrame = {
 val sc = spark.sparkContext
 val rdd = sc.parallelize(1 to rows).map(i => 
Vectors.dense(Array.fill(dim)((i % k).toDouble)))
-  .map(v => new TestRow(v))
+  .map(v => TestRow(v))
 spark.createDataFrame(rdd)
   }
 
+  def generateKMeansModel(dim: Int, k: Int, seed: Int = 42): KMeansModel = 
{
--- End diff --

can we call it `generateRandomKMeansModel`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-06 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r77758918
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -137,6 +138,17 @@ class KMeansModel private[ml] (
   @Since("1.6.0")
   override def write: MLWriter = new KMeansModel.KMeansModelWriter(this)
 
+  override def hashCode(): Int = {
+(Array(this.getClass, uid) ++ clusterCenters)
--- End diff --

@yinxusen Correct me if I'm wrong, but I believe you override the equals 
method is because the params are checked for equality in the read/write tests. 
Just thinking ahead, we will have to do this for every model we use as an 
initial model. We can avoid this by adding some handling inside the read/write 
params test, and then checking the initial model equality for read/write inside 
the `checkModelData` method. I guess I'd prefer not to randomly overwrite some 
models equals methods, and not others, especially since the reasoning behind 
won't be clear. What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-06 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r77737542
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -446,6 +463,20 @@ private[ml] object DefaultParamsReader {
 val cls = Utils.classForName(metadata.className)
 cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path)
   }
+
+  def loadAndSetInitialModel[M <: Model[M]](
+  instance: HasInitialModel[M], metadata: Metadata, path: String, sc: 
SparkContext): Unit = {
+implicit val format = DefaultFormats
+// Try to load the initial model
+if (metadata.metadata \ "initialModel" != JNothing) {
+  val hasInitialModel = (metadata.metadata \ 
"initialModel").extract[Boolean]
+  if (hasInitialModel) {
+val initialModelPath = new Path(path, "initialModel").toString
+val initialModel = loadParamsInstance[Model[M]](initialModelPath, 
sc)
--- End diff --

why not `loadParamsInstance[M]`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-06 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r77737428
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -446,6 +463,20 @@ private[ml] object DefaultParamsReader {
 val cls = Utils.classForName(metadata.className)
 cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path)
   }
+
+  def loadAndSetInitialModel[M <: Model[M]](
+  instance: HasInitialModel[M], metadata: Metadata, path: String, sc: 
SparkContext): Unit = {
+implicit val format = DefaultFormats
+// Try to load the initial model
+if (metadata.metadata \ "initialModel" != JNothing) {
+  val hasInitialModel = (metadata.metadata \ 
"initialModel").extract[Boolean]
+  if (hasInitialModel) {
+val initialModelPath = new Path(path, "initialModel").toString
+val initialModel = loadParamsInstance[Model[M]](initialModelPath, 
sc)
+instance.set(instance.getParam("initialModel"), initialModel)
--- End diff --

can access the param directly here instead of `getParam`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-06 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r77737335
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -139,16 +146,61 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 val kmeans = new KMeans()
 testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
   }
+
+  test("Initialize using given cluster centers") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1)
+val oneIterModel = kmeans.fit(dataset)
+val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 
2))).fit(dataset)
+val oneMoreIterModel = 
kmeans.setInitialModel(oneIterModel).fit(dataset)
+
+twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters)
+  .foreach { case (center1, center2) => assert(center1 ~== center2 
absTol 1E-8) }
+  }
+
+  test("Initialize using wrong model") {
+val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(10)
+val wrongTypeModel = new KMeansSuite.MockModel()
+
assert(!kmeans.setInitialModel(wrongTypeModel).isSet(kmeans.initialModel))
+
+val wrongKModel = KMeansSuite.generateKMeansModel(3, k + 1)
+intercept[IllegalArgumentException] {
--- End diff --

In other places with similar checks in ML, we typically check that the 
error message contains the expected message. I'd prefer to keep with that 
pattern here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-06 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r77737026
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -303,6 +322,29 @@ class KMeans @Since("1.5.0") (
   @Since("1.5.0")
   def setSeed(value: Long): this.type = set(seed, value)
 
+  /** @group setParam */
+  @Since("2.1.0")
+  def setInitialModel(value: KMeansModel): this.type = set(initialModel, 
value)
+
+  /** @group setParam */
+  @Since("2.1.0")
+  def setInitialModel(value: Model[_]): this.type = {
+value match {
+  case m: KMeansModel => setInitialModel(m)
+  case other =>
+logWarning(s"KMeansModel required but 
${other.getClass.getSimpleName} found.")
--- End diff --

I prefer an error here. Other parameters throw errors when invalid values 
are set, so I don't see any reason to deviate from that behavior here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-06 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r77734912
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -318,6 +327,14 @@ private[ml] object DefaultParamsWriter {
 val metadataJson: String = compact(render(metadata))
 metadataJson
   }
+
+  def saveInitialModel[T <: HasInitialModel[_ <: MLWritable]](instance: T, 
path: String): Unit = {
+if (instance.isDefined(instance.getParam("initialModel"))) {
+  val initialModelPath = new Path(path, "initialModel").toString
+  val initialModel = 
instance.getOrDefault(instance.getParam("initialModel"))
+  initialModel.asInstanceOf[MLWritable].save(initialModelPath)
--- End diff --

No need for the `asInstanceOf` anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-06 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r77734849
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -318,6 +327,14 @@ private[ml] object DefaultParamsWriter {
 val metadataJson: String = compact(render(metadata))
 metadataJson
   }
+
+  def saveInitialModel[T <: HasInitialModel[_ <: MLWritable]](instance: T, 
path: String): Unit = {
+if (instance.isDefined(instance.getParam("initialModel"))) {
--- End diff --

Since we specify the type parameters, you can access the param directly 
`instance.isDefined(instance.initialModel)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans

2016-09-02 Thread yinxusen
Github user yinxusen commented on a diff in the pull request:

https://github.com/apache/spark/pull/9#discussion_r77414688
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala ---
@@ -139,16 +145,32 @@ class KMeansSuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultR
 val kmeans = new KMeans()
 testEstimatorAndModelReadWrite(kmeans, dataset, 
KMeansSuite.allParamSettings, checkModelData)
   }
+
+  test("Initialize using given cluster centers") {
--- End diff --

I think the current test is OK to assert the right behavior of 
initialModel. And it's more economic to test with only one or two iterations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org