[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/9 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88725427 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -35,7 +38,25 @@ import org.apache.spark.sql.functions.{col, udf} import org.apache.spark.sql.types.{IntegerType, StructType} /** - * Common params for KMeans and KMeansModel + * Params for KMeans + */ + +private[clustering] trait KMeansInitialModelParams extends HasInitialModel[KMeansModel] { + /** + * Param for KMeansModel to use for warm start. + * Whenever initialModel is set: + * 1. the initialModel k will override the param k; + * 2. the param initMode is set to initialModel and manually set is ignored; + * 3. other params are untouched. + * @group param + */ + final val initialModel: Param[KMeansModel] = --- End diff -- `override final val` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88724978 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -35,7 +38,25 @@ import org.apache.spark.sql.functions.{col, udf} import org.apache.spark.sql.types.{IntegerType, StructType} /** - * Common params for KMeans and KMeansModel + * Params for KMeans + */ + +private[clustering] trait KMeansInitialModelParams extends HasInitialModel[KMeansModel] { + /** + * Param for KMeansModel to use for warm start. + * Whenever initialModel is set: + * 1. the initialModel k will override the param k; + * 2. the param initMode is set to initialModel and manually set is ignored; --- End diff -- 2. the param initMode is set to "initialModel" and manually setting initMode will be ignored nit: Let's just remove the punctuation from the numbered list --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88714626 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -124,7 +147,8 @@ class KMeansModel private[ml] ( @Since("2.0.0") override def transform(dataset: Dataset[_]): DataFrame = { transformSchema(dataset.schema, logging = true) -val predictUDF = udf((vector: Vector) => predict(vector)) +val tmpParent: MLlibKMeansModel = parentModel --- End diff -- maybe a comment would be useful? `// avoid encapsulating the entire model in the closure` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88713108 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -414,6 +414,8 @@ object KMeans { val RANDOM = "random" @Since("0.8.0") val K_MEANS_PARALLEL = "k-means||" + @Since("2.1.0") + val K_MEANS_INITIAL_MODEL = "initialModel" --- End diff -- Does it need to be public? This only serves a purpose when used with ML I think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88722547 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -145,18 +150,67 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(model.clusterCenters === model2.clusterCenters) } val kmeans = new KMeans() -testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) +testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData, + Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => Unit])) + } + + test("Initialize using a trained model") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) +val oneIterModel = kmeans.fit(dataset) +val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 2))).fit(dataset) +val oneMoreIterModel = kmeans.setInitialModel(oneIterModel).fit(dataset) + +assert(oneMoreIterModel.getK === k) + +twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters) + .foreach { case (center1, center2) => assert(center1 ~== center2 absTol 1E-8) } + } + + test("Initialize using a model with wrong dimension of cluster centers") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k) +val wrongDimModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongDimModel).fit(dataset) +} +assert(wrongDimModelThrown.getMessage.contains("mismatched dimension")) + } + + test("Infer K from an initial model") { --- End diff -- So, the behavior is getting a bit confusing because we now have three params which are intertwined. For that reason, we should be very thorough on the tests. With my understanding of the behavior we decided on, the following tests should all pass. Can you tell me if it looks right to you?: scala test("initialModel params") { val initialK = 3 val initialEstimator = new KMeans() .setK(initialK) val initialModel = initialEstimator.fit(dataset) val km = new KMeans() .setK(initialK + 1) .setInitialModel(initialModel) // intialModel sets k and init mode assert(km.getInitMode === MLlibKMeans.K_MEANS_INITIAL_MODEL) assert(km.getK === initialK) assert(km.getInitialModel.getK === initialK) // setting k is ignored km.setK(initialK + 1) assert(km.getK === initialK) // this should work since we already set initialModel km.setInitMode(MLlibKMeans.K_MEANS_INITIAL_MODEL) // this is ignored because initialModel is set km.setInitMode(MLlibKMeans.RANDOM) assert(km.getInitMode === MLlibKMeans.K_MEANS_INITIAL_MODEL) km.clear(km.initialModel) // kmeans now accepts init mode km.setInitMode(MLlibKMeans.RANDOM) assert(km.getInitMode === MLlibKMeans.RANDOM) // kmeans should throw an error since we shouldn't be allowed to set init mode to "initialModel" intercept[IllegalArgumentException] { km.setInitMode(MLlibKMeans.K_MEANS_INITIAL_MODEL) } } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88713359 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -284,11 +309,26 @@ class KMeans @Since("1.5.0") ( /** @group setParam */ @Since("1.5.0") - def setK(value: Int): this.type = set(k, value) + def setK(value: Int): this.type = { +if (isSet(initialModel)) { + logWarning("initialModel is set, so k will be ignored. Clear initialModel first.") + this +} else { + set(k, value) +} + } /** @group expertSetParam */ @Since("1.5.0") - def setInitMode(value: String): this.type = set(initMode, value) + def setInitMode(value: String): this.type = { +if (isSet(initialModel)) { + logWarning(s"initialModel is set, so initMode will be ignored. Clear initialModel first.") +} +if (value == MLlibKMeans.K_MEANS_INITIAL_MODEL) { + logWarning(s"initMode of $value is not supported here, please use setInitialModel.") --- End diff -- From the discussion, I think we decided to throw an error for `setInitMode("initialModel")` if initialModel wasn't already set. If initialModel has been set, then we'd just update the initMode as normal. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88715322 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -306,6 +346,25 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: KMeansModel): this.type = { +val kOfInitialModel = value.parentModel.clusterCenters.length +if (isSet(k)) { + if ($(k) != kOfInitialModel) { +val previousK = $(k) +set(k, kOfInitialModel) +logWarning(s"Param K is set to $kOfInitialModel by the initialModel." + --- End diff -- nit: Maybe `s"Param k was changed from $previousK to $kOfInitialModel to match the initialModel"` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88725396 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -35,7 +38,25 @@ import org.apache.spark.sql.functions.{col, udf} import org.apache.spark.sql.types.{IntegerType, StructType} /** - * Common params for KMeans and KMeansModel + * Params for KMeans + */ + +private[clustering] trait KMeansInitialModelParams extends HasInitialModel[KMeansModel] { --- End diff -- If we follow the convention in ALS, then we should have `KMeansModelParams` and `KMeansParams extends KMeansModelParams with ...`. I think it would be good to do the same here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r88713635 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -284,11 +309,26 @@ class KMeans @Since("1.5.0") ( /** @group setParam */ @Since("1.5.0") - def setK(value: Int): this.type = set(k, value) + def setK(value: Int): this.type = { +if (isSet(initialModel)) { + logWarning("initialModel is set, so k will be ignored. Clear initialModel first.") + this +} else { + set(k, value) +} + } /** @group expertSetParam */ @Since("1.5.0") - def setInitMode(value: String): this.type = set(initMode, value) + def setInitMode(value: String): this.type = { +if (isSet(initialModel)) { + logWarning(s"initialModel is set, so initMode will be ignored. Clear initialModel first.") --- End diff -- We say it will be ignored, but then still set it below. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r84731896 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -320,6 +368,23 @@ class KMeans @Since("1.5.0") ( .setMaxIterations($(maxIter)) .setSeed($(seed)) .setEpsilon($(tol)) + +if (isDefined(initialModel)) { + // Check that the feature dimensions are equal + val dimOfData = rdd.first().size + val dimOfInitialModel = $(initialModel).clusterCenters.head.size + require(dimOfData == dimOfInitialModel, +s"mismatched dimension, $dimOfData in data while $dimOfInitialModel in the initial model.") + + // Check that the number of clusters are equal + val kOfInitialModel = $(initialModel).parentModel.clusterCenters.length + require(kOfInitialModel == $(k), --- End diff -- I'd recommend that this log a warning instead of causing a failure. If we use CrossValidator to select amongst initial models, then --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83764721 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -252,8 +254,10 @@ object KMeansModel extends MLReadable[KMeansModel] { } val model = new KMeansModel(metadata.uid, new MLlibKMeansModel(clusterCenters)) DefaultParamsReader.getAndSetParams(model, metadata) - DefaultParamsReader.loadInitialModel[KMeansModel](path, sc) -.foreach(v => model.set(model.initialModel, v)) + DefaultParamsReader.loadInitialModel[KMeansModel](path, sc) match { +case Success(v) => model.set(model.initialModel, v) +case Failure(e) => if (!e.isInstanceOf[InvalidInputException]) throw e --- End diff -- @MLnick Is it good to ignore only `InvalidInputException` while keep other exceptions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83719858 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -81,6 +81,13 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe def getInitSteps: Int = $(initSteps) /** + * Param for KMeansModel to use for warm start. --- End diff -- Actually, we also need to be clear that the `initMode` is entirely ignored when setting an initial model. Let's add it to the doc of the `initMode` param --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83703876 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -446,6 +459,11 @@ private[ml] object DefaultParamsReader { val cls = Utils.classForName(metadata.className) cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path) } + + def loadInitialModel[M <: Model[M]](path: String, sc: SparkContext): Try[M] = { +val initialModelPath = new Path(path, "initialModel").toString +Try(loadParamsInstance[M](initialModelPath, sc)) --- End diff -- Also, this is not a critical path, so I think the overhead is okay. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83693443 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -81,6 +83,14 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe def getInitSteps: Int = $(initSteps) /** + * Param for KMeansModel to use for warm start. + * Whenever initialModel is set, the initialModel k will override the param k. --- End diff -- I think we should also be clear that other params won't be overridden. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83694799 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -446,6 +459,11 @@ private[ml] object DefaultParamsReader { val cls = Utils.classForName(metadata.className) cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path) } + + def loadInitialModel[M <: Model[M]](path: String, sc: SparkContext): Try[M] = { +val initialModelPath = new Path(path, "initialModel").toString +Try(loadParamsInstance[M](initialModelPath, sc)) --- End diff -- I don't think the overhead is that large, and if there is no model then it should fail fast as the directory won't exist. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83694228 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +312,20 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ --- End diff -- I don't feel really strongly about it. But I think it is not a critical failure, so does not really require a hard stop of an exception. Since we make it clear in the doc that initial model `k` takes precedence, if a user sets a different `k` anyway we can still proceed, just let them know that we are ignoring their setting. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83694961 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -333,13 +389,45 @@ class KMeans @Since("1.5.0") ( override def transformSchema(schema: StructType): StructType = { validateAndTransformSchema(schema) } + + @Since("2.1.0") + override def write: MLWriter = new KMeans.KMeansWriter(this) } @Since("1.6.0") object KMeans extends DefaultParamsReadable[KMeans] { + // TODO: [SPARK-17784]: Add a fromCenters method + @Since("1.6.0") override def load(path: String): KMeans = super.load(path) + + @Since("1.6.0") + override def read: MLReader[KMeans] = new KMeansReader + + /** [[MLWriter]] instance for [[KMeans]] */ + private[KMeans] class KMeansWriter(instance: KMeans) extends MLWriter { +override protected def saveImpl(path: String): Unit = { + DefaultParamsWriter.saveInitialModel(instance, path) + DefaultParamsWriter.saveMetadata(instance, path, sc) +} + } + + private class KMeansReader extends MLReader[KMeans] { + +/** Checked against metadata when loading estimator */ +private val className = classOf[KMeans].getName + +override def load(path: String): KMeans = { + val metadata = DefaultParamsReader.loadMetadata(path, sc, className) + val instance = new KMeans(metadata.uid) + + DefaultParamsReader.getAndSetParams(instance, metadata) + DefaultParamsReader.loadInitialModel[KMeansModel](path, sc) +.foreach(v => instance.setInitialModel(v)) --- End diff -- Here I was thinking more along the lines that we could explicitly ignore the error we expect. So if the model load fails for the reason that the "initialModel" directory is empty, then we can ignore (that's basically the only case). Otherwise, it may be a legitimate failure to load the model (say some future load incompatibility or bug, or some other issue with loading an initial model that is there) and we will just swallow it without the user knowing there is a problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83695311 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter { paramMap: Option[JValue] = None): String = { val uid = instance.uid val cls = instance.getClass.getName -val params = instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]] +val params = instance.extractParamMap().toSeq + .filter(_.param.name != "initialModel").asInstanceOf[Seq[ParamPair[Any]]] --- End diff -- Fair enough --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83662587 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -17,6 +17,8 @@ package org.apache.spark.ml.clustering +import scala.util.{Failure, Success} --- End diff -- please revert these --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83600176 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +312,20 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ --- End diff -- My personally preference is throwing an exception to make it clear for users; but I don't have strong opinion about this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83599465 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +312,20 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ --- End diff -- Agree with @sethah on this - initial model should take precedence - essentially ignoring any `setK` call whether before or after setting initial model. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83579367 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -137,18 +143,67 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(model.clusterCenters === model2.clusterCenters) } val kmeans = new KMeans() -testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) +testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData, + Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => Unit])) + } + + test("Initialize using a trained model") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) +val oneIterModel = kmeans.fit(dataset) +val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 2))).fit(dataset) +val oneMoreIterModel = kmeans.setInitialModel(oneIterModel).fit(dataset) + +assert(oneMoreIterModel.getK === k) + +twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters) + .foreach { case (center1, center2) => assert(center1 ~== center2 absTol 1E-8) } + } + + test("Initialize using a model with wrong dimension of cluster centers") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k) +val wrongDimModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongDimModel).fit(dataset) +} +assert(wrongDimModelThrown.getMessage.contains("mismatched dimension")) + } + + test("Infer K from an initial model") { +val kmeans = new KMeans().setK(5) +val testNewK = 10 +val randomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK) +assert(kmeans.setInitialModel(randomModel).getK === testNewK) + } + + test("Ignore k if initialModel is set") { +val kmeans = new KMeans() + +val m1 = KMeansSuite.generateRandomKMeansModel(dim, k) --- End diff -- minor: call it `initialModel` instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83579289 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -320,6 +362,23 @@ class KMeans @Since("1.5.0") ( .setMaxIterations($(maxIter)) .setSeed($(seed)) .setEpsilon($(tol)) + +if (isDefined(initialModel)) { + // Check the equal of dimension --- End diff -- very minor, but this doesn't make sense. Maybe `// Check that the feature dimensions are equal` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83579235 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -241,6 +252,12 @@ object KMeansModel extends MLReadable[KMeansModel] { } val model = new KMeansModel(metadata.uid, new MLlibKMeansModel(clusterCenters)) DefaultParamsReader.getAndSetParams(model, metadata) + DefaultParamsReader.loadInitialModel[KMeansModel](path, sc) match { --- End diff -- minor: If we are going to use a pattern match here, let's use it in the `KMeansReader` too. I think either is fine, but let's use a single convention. Also, if you eliminate the pattern match entirely, then the imports for Success and Failure can be removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83579306 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -320,6 +362,23 @@ class KMeans @Since("1.5.0") ( .setMaxIterations($(maxIter)) .setSeed($(seed)) .setEpsilon($(tol)) + +if (isDefined(initialModel)) { + // Check the equal of dimension + val dimOfData = rdd.first().size + val dimOfInitialModel = $(initialModel).clusterCenters.head.size + require(dimOfData == dimOfInitialModel, +s"mismatched dimension, $dimOfData in data while $dimOfInitialModel in the initial model.") + + // Check the equal of number of clusters --- End diff -- very minor, but this doesn't make sense. Maybe `// Check that the number of clusters are equal` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83521035 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -446,6 +459,11 @@ private[ml] object DefaultParamsReader { val cls = Utils.classForName(metadata.className) cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path) } + + def loadInitialModel[M <: Model[M]](path: String, sc: SparkContext): Try[M] = { +val initialModelPath = new Path(path, "initialModel").toString +Try(loadParamsInstance[M](initialModelPath, sc)) --- End diff -- Note, now we always try to load the initial model, which I would guess adds some overhead in the cases when there is none, but we have to fail and catch the exception. Let's leave this for now since it's cleaner, and only change it if it becomes a problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83514937 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -81,6 +81,14 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe def getInitSteps: Int = $(initSteps) /** + * Param for KMeansModel to use for warm start. + * The setting of initialModel takes precedence of param K. --- End diff -- Let's be more clear. "Whenever initialModel is set, the initialModel `k` will override the param `k`" or something similar. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83514617 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +313,23 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: KMeansModel): this.type = { +val kOfInitialModel = value.parentModel.clusterCenters.length +if (isSet(k)) { + if ($(k) != kOfInitialModel) { +set(k, kOfInitialModel) +logWarning(s"Param K is set to $kOfInitialModel by the initialModel." + --- End diff -- The previous value is lost because you set it before logging this warning. Need to reorder the code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83521481 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -137,18 +143,69 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(model.clusterCenters === model2.clusterCenters) } val kmeans = new KMeans() -testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) +testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData, + Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => Unit])) + } + + test("Initialize using a trained model") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) +val oneIterModel = kmeans.fit(dataset) +val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 2))).fit(dataset) +val oneMoreIterModel = kmeans.setInitialModel(oneIterModel).fit(dataset) + +assert(oneMoreIterModel.getK === k) + +twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters) + .foreach { case (center1, center2) => assert(center1 ~== center2 absTol 1E-8) } + } + + test("Initialize using a model with wrong dimension of cluster centers") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k) +val wrongDimModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongDimModel).fit(dataset) +} +assert(wrongDimModelThrown.getMessage.contains("mismatched dimension")) + } + + test("Infer K from an initial model") { +val kmeans = new KMeans().setK(5) +val testNewK = 10 +val randomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK) +assert(kmeans.setInitialModel(randomModel).getK === testNewK) + +val differentKRandomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK + 1) +assert(kmeans.setInitialModel(differentKRandomModel).getK === testNewK + 1) --- End diff -- I don't see what this does. Seems to test exactly the same thing as above. I think we should add a test case to first set `k` then set `initialModel` with a different `k`, then check if `getK === initialModelK`. To make sure we override the first one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83520955 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -333,13 +384,48 @@ class KMeans @Since("1.5.0") ( override def transformSchema(schema: StructType): StructType = { validateAndTransformSchema(schema) } + + @Since("2.1.0") + override def write: MLWriter = new KMeans.KMeansWriter(this) } @Since("1.6.0") object KMeans extends DefaultParamsReadable[KMeans] { + // TODO: [SPARK-17784]: Add a fromCenters method + @Since("1.6.0") override def load(path: String): KMeans = super.load(path) + + @Since("1.6.0") + override def read: MLReader[KMeans] = new KMeansReader + + /** [[MLWriter]] instance for [[KMeans]] */ + private[KMeans] class KMeansWriter(instance: KMeans) extends MLWriter { +override protected def saveImpl(path: String): Unit = { + DefaultParamsWriter.saveInitialModel(instance, path) + DefaultParamsWriter.saveMetadata(instance, path, sc) +} + } + + private class KMeansReader extends MLReader[KMeans] { + +/** Checked against metadata when loading estimator */ +private val className = classOf[KMeans].getName + +override def load(path: String): KMeans = { + val metadata = DefaultParamsReader.loadMetadata(path, sc, className) + val instance = new KMeans(metadata.uid) + + DefaultParamsReader.getAndSetParams(instance, metadata) + DefaultParamsReader.loadInitialModel[KMeansModel](path, sc) match { +case Success(v) => + instance.setInitialModel(v) +case Failure(e) => // Fail to load initial model --- End diff -- nit/minor: `case Failure(_) =>` `foreach` on the `Try` is a bit more concise, but I guess it's less clear to non-scala developers --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83514641 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +312,20 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ --- End diff -- Yeah, so I think myself and Nick were under the impression that we would simply ignore any calls to `setK` if initialModel is set. So, scala def setK(value: Int): this.type = { if (isSet(initialModel)) { logWarning("initialModel is set, so k will be ignored. Clear initialModel first.") this } else { set(k, value) } } @MLnick @dbtsai does that seem correct? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83520888 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -320,6 +354,23 @@ class KMeans @Since("1.5.0") ( .setMaxIterations($(maxIter)) .setSeed($(seed)) .setEpsilon($(tol)) + +if (isDefined(initialModel)) { + // Check the equal of number of dimension --- End diff -- nit/minor: "Check the dimension" and "Check the number of clusters" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83515271 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter { paramMap: Option[JValue] = None): String = { val uid = instance.uid val cls = instance.getClass.getName -val params = instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]] +val params = instance.extractParamMap().toSeq + .filter(_.param.name != "initialModel").asInstanceOf[Seq[ParamPair[Any]]] --- End diff -- It's also possible that we introduce other params that can be of type model in the future. Thus, being an instance of `Param[Mode[_]]` may not be sufficient to determine if it's the initialModel. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83521550 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -137,18 +143,69 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(model.clusterCenters === model2.clusterCenters) } val kmeans = new KMeans() -testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) +testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData, + Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => Unit])) + } + + test("Initialize using a trained model") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) +val oneIterModel = kmeans.fit(dataset) +val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 2))).fit(dataset) +val oneMoreIterModel = kmeans.setInitialModel(oneIterModel).fit(dataset) + +assert(oneMoreIterModel.getK === k) + +twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters) + .foreach { case (center1, center2) => assert(center1 ~== center2 absTol 1E-8) } + } + + test("Initialize using a model with wrong dimension of cluster centers") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k) +val wrongDimModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongDimModel).fit(dataset) +} +assert(wrongDimModelThrown.getMessage.contains("mismatched dimension")) + } + + test("Infer K from an initial model") { +val kmeans = new KMeans().setK(5) +val testNewK = 10 +val randomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK) +assert(kmeans.setInitialModel(randomModel).getK === testNewK) + +val differentKRandomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK + 1) +assert(kmeans.setInitialModel(differentKRandomModel).getK === testNewK + 1) + } + + test("Reset K after setting initial model") { --- End diff -- If we do change the behavior of just ignoring `k` as noted above, then this test will be invalid. In that case, we should just check that `k` is properly ignored, like: scala test("Ignore k if initialModel is set") { val kmeans = new KMeans() val m1 = KMeansSuite.generateRandomKMeansModel(dim, k) // ignore k if initialModel is set assert(kmeans.setInitialModel(m1).setK(k - 1).getK === k) kmeans.clear(kmeans.initialModel) // k is not ignored after initialModel is cleared assert(kmeans.setK(k - 1).getK === k - 1) } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83510194 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter { paramMap: Option[JValue] = None): String = { val uid = instance.uid val cls = instance.getClass.getName -val params = instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]] +val params = instance.extractParamMap().toSeq + .filter(_.param.name != "initialModel").asInstanceOf[Seq[ParamPair[Any]]] --- End diff -- I tried `_.isInsance[Param[Model[_]]]` and it failed because of the erasing of `Model[_]` in Java runtime. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r83509799 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -137,18 +143,64 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(model.clusterCenters === model2.clusterCenters) } val kmeans = new KMeans() -testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) +testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData, + Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => Unit])) + } + + test("Initialize using a trained model") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) +val oneIterModel = kmeans.fit(dataset) +val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 2))).fit(dataset) +val oneMoreIterModel = kmeans.setInitialModel(oneIterModel).fit(dataset) + +twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters) + .foreach { case (center1, center2) => assert(center1 ~== center2 absTol 1E-8) } + } + + test("Initialize using a model with wrong dimension of cluster centers") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k) +val wrongDimModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongDimModel).fit(dataset) +} +assert(wrongDimModelThrown.getMessage.contains("mismatched dimension")) + } + + test("Infer K from an initial model if K is unset") { +val kmeans = new KMeans() +val testNewK = 10 +val randomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK) +assert(kmeans.setInitialModel(randomModel).getK === testNewK) + } + + test("Initialize using a model with wrong K if K is set") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +val wrongKModel = KMeansSuite.generateRandomKMeansModel(3, k + 1) +val wrongKModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongKModel).fit(dataset) --- End diff -- According to discussions below, `setInitialModel` only overwrites param k. 1. If we use `km.setInitialModel(m1)` or `km.setK(k).setInitialModel(m1)`, then km with `k=m1.getK`; 2. If we use `km.setInitialModel(m1).setInitialModel(m2)`, then km with `k = m2.getK`; 3. If we use `km.setInitialModel(m1).setK(differentK)`, then an exception will be thrown out when `km.fit(data)`. I think the case 3 is a wrong use case, so it's reasonable to throw an exception. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82751668 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -446,6 +463,20 @@ private[ml] object DefaultParamsReader { val cls = Utils.classForName(metadata.className) cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path) } + + def loadAndSetInitialModel[M <: Model[M]]( + instance: HasInitialModel[M], metadata: Metadata, path: String, sc: SparkContext): Unit = { +implicit val format = DefaultFormats +// Try to load the initial model --- End diff -- What about just wrapping the load initial model in a `Try`? That way there is no need to rely on metadata. It will fail if the "initialModel" path doesn't exist. We can ignore that exception but throw (or log) if we encounter an error in the actual loading. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82753706 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter { paramMap: Option[JValue] = None): String = { val uid = instance.uid val cls = instance.getClass.getName -val params = instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]] +val params = instance.extractParamMap().toSeq + .filter(_.param.name != "initialModel").asInstanceOf[Seq[ParamPair[Any]]] --- End diff -- Is it not possible to check if the param is an instance of `Param[Model[_]]` rather than the name? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82752375 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -137,18 +143,64 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(model.clusterCenters === model2.clusterCenters) } val kmeans = new KMeans() -testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) +testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData, + Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => Unit])) + } + + test("Initialize using a trained model") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) +val oneIterModel = kmeans.fit(dataset) +val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 2))).fit(dataset) +val oneMoreIterModel = kmeans.setInitialModel(oneIterModel).fit(dataset) + +twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters) + .foreach { case (center1, center2) => assert(center1 ~== center2 absTol 1E-8) } + } + + test("Initialize using a model with wrong dimension of cluster centers") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k) +val wrongDimModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongDimModel).fit(dataset) +} +assert(wrongDimModelThrown.getMessage.contains("mismatched dimension")) + } + + test("Infer K from an initial model if K is unset") { +val kmeans = new KMeans() +val testNewK = 10 +val randomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK) +assert(kmeans.setInitialModel(randomModel).getK === testNewK) + } + + test("Initialize using a model with wrong K if K is set") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +val wrongKModel = KMeansSuite.generateRandomKMeansModel(3, k + 1) --- End diff -- `3` can be `dim` here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82750575 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -333,13 +372,44 @@ class KMeans @Since("1.5.0") ( override def transformSchema(schema: StructType): StructType = { validateAndTransformSchema(schema) } + + @Since("2.1.0") + override def write: MLWriter = new KMeans.KMeansWriter(this) } @Since("1.6.0") object KMeans extends DefaultParamsReadable[KMeans] { + // TODO: [SPARK-17784]: Add a fromCenters method + @Since("1.6.0") override def load(path: String): KMeans = super.load(path) + + @Since("1.6.0") + override def read: MLReader[KMeans] = new KMeansReader + + /** [[MLWriter]] instance for [[KMeans]] */ + private[KMeans] class KMeansWriter(instance: KMeans) extends MLWriter { +override protected def saveImpl(path: String): Unit = { + DefaultParamsWriter.saveInitialModel(instance, path) + DefaultParamsWriter.saveMetadata(instance, path, sc) +} + } + + private class KMeansReader extends MLReader[KMeans] { + +/** Checked against metadata when loading model */ --- End diff -- nit - should this be "... when loading estimator"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82749390 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +312,20 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ --- End diff -- Above, we may want to also log warning in `setK` if `initialModel` has already been set? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82749322 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +312,20 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: KMeansModel): this.type = { +val kOfInitialModel = value.parentModel.clusterCenters.length +if (isSet(k)) { + require(kOfInitialModel == $(k), --- End diff -- As discussed elsewhere, at set time I think we can log a warning and let `initialModel` take precedence. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82749661 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -81,6 +81,13 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe def getInitSteps: Int = $(initSteps) /** + * Param for KMeansModel to use for warm start. --- End diff -- We need more detail here to document the behavior that `initialModel` param settings take precedence. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82497437 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -137,18 +143,64 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(model.clusterCenters === model2.clusterCenters) } val kmeans = new KMeans() -testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) +testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData, + Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => Unit])) + } + + test("Initialize using a trained model") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) +val oneIterModel = kmeans.fit(dataset) +val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 2))).fit(dataset) +val oneMoreIterModel = kmeans.setInitialModel(oneIterModel).fit(dataset) + +twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters) + .foreach { case (center1, center2) => assert(center1 ~== center2 absTol 1E-8) } + } + + test("Initialize using a model with wrong dimension of cluster centers") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k) +val wrongDimModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongDimModel).fit(dataset) +} +assert(wrongDimModelThrown.getMessage.contains("mismatched dimension")) + } + + test("Infer K from an initial model if K is unset") { +val kmeans = new KMeans() +val testNewK = 10 +val randomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK) +assert(kmeans.setInitialModel(randomModel).getK === testNewK) + } + + test("Initialize using a model with wrong K if K is set") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +val wrongKModel = KMeansSuite.generateRandomKMeansModel(3, k + 1) +val wrongKModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongKModel).fit(dataset) --- End diff -- I think the behavior could be that `setInitialModel` should overwrite the configuration including # of iterations and k, etc. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82489783 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -137,18 +143,64 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(model.clusterCenters === model2.clusterCenters) } val kmeans = new KMeans() -testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) +testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData, + Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => Unit])) + } + + test("Initialize using a trained model") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) +val oneIterModel = kmeans.fit(dataset) +val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 2))).fit(dataset) +val oneMoreIterModel = kmeans.setInitialModel(oneIterModel).fit(dataset) + +twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters) + .foreach { case (center1, center2) => assert(center1 ~== center2 absTol 1E-8) } + } + + test("Initialize using a model with wrong dimension of cluster centers") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k) +val wrongDimModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongDimModel).fit(dataset) +} +assert(wrongDimModelThrown.getMessage.contains("mismatched dimension")) + } + + test("Infer K from an initial model if K is unset") { +val kmeans = new KMeans() +val testNewK = 10 +val randomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK) +assert(kmeans.setInitialModel(randomModel).getK === testNewK) + } + + test("Initialize using a model with wrong K if K is set") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +val wrongKModel = KMeansSuite.generateRandomKMeansModel(3, k + 1) +val wrongKModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongKModel).fit(dataset) --- End diff -- How about a user trains a model with inititalModel1, and then want to train another one with initalModel2 with different k? For the first case, do we really need to throw an error instead of warning? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82474666 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -137,18 +143,64 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(model.clusterCenters === model2.clusterCenters) } val kmeans = new KMeans() -testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) +testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData, + Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => Unit])) + } + + test("Initialize using a trained model") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) +val oneIterModel = kmeans.fit(dataset) +val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 2))).fit(dataset) +val oneMoreIterModel = kmeans.setInitialModel(oneIterModel).fit(dataset) + +twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters) + .foreach { case (center1, center2) => assert(center1 ~== center2 absTol 1E-8) } + } + + test("Initialize using a model with wrong dimension of cluster centers") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k) +val wrongDimModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongDimModel).fit(dataset) +} +assert(wrongDimModelThrown.getMessage.contains("mismatched dimension")) + } + + test("Infer K from an initial model if K is unset") { +val kmeans = new KMeans() +val testNewK = 10 +val randomModel = KMeansSuite.generateRandomKMeansModel(dim, testNewK) +assert(kmeans.setInitialModel(randomModel).getK === testNewK) + } + + test("Initialize using a model with wrong K if K is set") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) + +val wrongKModel = KMeansSuite.generateRandomKMeansModel(3, k + 1) +val wrongKModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongKModel).fit(dataset) --- End diff -- This fails on set, so fit should do nothing here. Let's test the following cases: 1. `new KMeans().setK(5).setInitialModel(wrongKModel)` - throws an error in the setter 2. `new KMeans().setInitialModel(kEquals6Model).setK(5).fit(df)` - throws an error in the fit method --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82474849 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -137,18 +143,64 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(model.clusterCenters === model2.clusterCenters) } val kmeans = new KMeans() -testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) +testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData, + Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => Unit])) + } + + test("Initialize using a trained model") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) +val oneIterModel = kmeans.fit(dataset) +val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 2))).fit(dataset) +val oneMoreIterModel = kmeans.setInitialModel(oneIterModel).fit(dataset) --- End diff -- `assert(oneMoreIterModel.getK === k)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82230001 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +312,10 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: KMeansModel): this.type = set(initialModel, value) --- End diff -- +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82229991 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +312,10 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: KMeansModel): this.type = set(initialModel, value) --- End diff -- +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82215606 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +312,10 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: KMeansModel): this.type = set(initialModel, value) --- End diff -- There was some discussion on this in this PR (it was in March :). IF the above is the desired behavior, we still need to check that `k` and the initial model line up since you can set the initial model, and then set `k`. I tested it and an error still gets thrown, but it's thrown by the mllib KMeans instead. We should check it in ML explicitly. I prefer the following behavior: * If `isSet(initialModel && isSet(k)` then check that they are equal at train time and throw an error if not * if `isSet(initialModel) && !isSet(k)` then set k to the initial model k at train time (can log a warning maybe) Actually, the current behavior is essentially equivalent. But, we still need a test to check that an error is thrown when the two mismatch, and we need to check that case inside of the train method still. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82105944 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -137,18 +142,53 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(model.clusterCenters === model2.clusterCenters) } val kmeans = new KMeans() -testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) +testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData, + Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => Unit])) + } + + test("Initialize using given cluster centers") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) +val oneIterModel = kmeans.fit(dataset) +val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 2))).fit(dataset) +val oneMoreIterModel = kmeans.setInitialModel(oneIterModel).fit(dataset) + +twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters) + .foreach { case (center1, center2) => assert(center1 ~== center2 absTol 1E-8) } + } + + test("Initialize using wrong model") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(10) + +val wrongKModel = KMeansSuite.generateRandomKMeansModel(3, k + 1) +val wrongKModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongKModel).fit(dataset) +} +assert(wrongKModelThrown.getMessage.contains("mismatched cluster count")) + +val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k) +val wrongDimModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongDimModel).fit(dataset) +} +assert(wrongDimModelThrown.getMessage.contains("mismatched dimension")) } } --- End diff -- Let's have a test case that without explicitly setting k, k can be inferred from initial model. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82105608 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +312,10 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: KMeansModel): this.type = set(initialModel, value) --- End diff -- Do we need to ```scala def setInitialModel(value: KMeansModel): this.type = set(k, initialModel.parentModel.clusterCenters.length).set(initialModel, value) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r82050056 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -137,18 +142,53 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR assert(model.clusterCenters === model2.clusterCenters) } val kmeans = new KMeans() -testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) +testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData, + Map("initialModel" -> (checkModelData _).asInstanceOf[(Any, Any) => Unit])) + } + + test("Initialize using given cluster centers") { --- End diff -- rename this to "Initialize using trained model" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r81264106 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala --- @@ -107,24 +133,34 @@ trait DefaultReadWriteTest extends TempDirectory { self: Suite => } val model = estimator.fit(dataset) -// Test Estimator save/load -val estimator2 = testDefaultReadWrite(estimator) -testParams.foreach { case (p, v) => - val param = estimator.getParam(p) - assert(estimator.get(param).get === estimator2.get(param).get) +val testFunctions = if (testParams.contains("initialModel")) { + Map(("initialModel", checkModelData.asInstanceOf[(Any, Any) => Unit])) --- End diff -- Won't model equality depend on algorithm specific details? Each model has different parameters and we really need to use those to check equality... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r81263907 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter { paramMap: Option[JValue] = None): String = { val uid = instance.uid val cls = instance.getClass.getName -val params = instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]] +val params = instance.extractParamMap().toSeq + .filter(_.param.name != "initialModel").asInstanceOf[Seq[ParamPair[Any]]] val jsonParams = paramMap.getOrElse(render(params.map { case ParamPair(p, v) => p.name -> parse(p.jsonEncode(v)) }.toList)) +// If the instance has an "initialModel" param and the param is defined, then the initial model +// will be saved along with the instance. +val initialModelFlag = + instance.hasParam("initialModel") && instance.isDefined(instance.getParam("initialModel")) val basicMetadata = ("class" -> cls) ~ ("timestamp" -> System.currentTimeMillis()) ~ ("sparkVersion" -> sc.version) ~ ("uid" -> uid) ~ - ("paramMap" -> jsonParams) + ("paramMap" -> jsonParams) ~ + // TODO: Figure out more robust way to detect the existing of the initialModel. --- End diff -- Hmm, the comment seems to have been placed in the wrong spot. I meant that in the test suite, there is a test called "Initialize using given cluster centers" when we are really initializing it by training a model and using it as the initial model. Sorry for the confusion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r81262768 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter { paramMap: Option[JValue] = None): String = { val uid = instance.uid val cls = instance.getClass.getName -val params = instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]] +val params = instance.extractParamMap().toSeq + .filter(_.param.name != "initialModel").asInstanceOf[Seq[ParamPair[Any]]] val jsonParams = paramMap.getOrElse(render(params.map { case ParamPair(p, v) => p.name -> parse(p.jsonEncode(v)) }.toList)) +// If the instance has an "initialModel" param and the param is defined, then the initial model +// will be saved along with the instance. +val initialModelFlag = + instance.hasParam("initialModel") && instance.isDefined(instance.getParam("initialModel")) val basicMetadata = ("class" -> cls) ~ ("timestamp" -> System.currentTimeMillis()) ~ ("sparkVersion" -> sc.version) ~ ("uid" -> uid) ~ - ("paramMap" -> jsonParams) + ("paramMap" -> jsonParams) ~ + // TODO: Figure out more robust way to detect the existing of the initialModel. --- End diff -- @sethah I don't quite get this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r81262198 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala --- @@ -107,24 +133,34 @@ trait DefaultReadWriteTest extends TempDirectory { self: Suite => } val model = estimator.fit(dataset) -// Test Estimator save/load -val estimator2 = testDefaultReadWrite(estimator) -testParams.foreach { case (p, v) => - val param = estimator.getParam(p) - assert(estimator.get(param).get === estimator2.get(param).get) +val testFunctions = if (testParams.contains("initialModel")) { + Map(("initialModel", checkModelData.asInstanceOf[(Any, Any) => Unit])) --- End diff -- One possible solution on this is overriding the equal method for the base Model class so that we needn't override the equal method to all Models. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r80834369 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter { paramMap: Option[JValue] = None): String = { val uid = instance.uid val cls = instance.getClass.getName -val params = instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]] +val params = instance.extractParamMap().toSeq + .filter(_.param.name != "initialModel").asInstanceOf[Seq[ParamPair[Any]]] val jsonParams = paramMap.getOrElse(render(params.map { case ParamPair(p, v) => p.name -> parse(p.jsonEncode(v)) }.toList)) +// If the instance has an "initialModel" param and the param is defined, then the initial model +// will be saved along with the instance. +val initialModelFlag = + instance.hasParam("initialModel") && instance.isDefined(instance.getParam("initialModel")) val basicMetadata = ("class" -> cls) ~ ("timestamp" -> System.currentTimeMillis()) ~ ("sparkVersion" -> sc.version) ~ ("uid" -> uid) ~ - ("paramMap" -> jsonParams) + ("paramMap" -> jsonParams) ~ + // TODO: Figure out more robust way to detect the existing of the initialModel. --- End diff -- This test should probably be called "Initialize using a trained model" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r80835092 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -17,12 +17,19 @@ package org.apache.spark.ml.clustering +import scala.util.Random + import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.Model --- End diff -- unused. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r80835086 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -17,12 +17,19 @@ package org.apache.spark.ml.clustering +import scala.util.Random + import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.Model import org.apache.spark.ml.linalg.{Vector, Vectors} -import org.apache.spark.ml.util.DefaultReadWriteTest -import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans} +import org.apache.spark.ml.param.{ParamMap, ParamPair} +import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable} +import org.apache.spark.ml.util.TestingUtils._ +import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans, KMeansModel => MLlibKMeansModel} +import org.apache.spark.mllib.linalg.{Vectors => MLlibVectors} import org.apache.spark.mllib.util.MLlibTestSparkContext import org.apache.spark.sql.{DataFrame, Dataset, SparkSession} +import org.apache.spark.sql.types.StructType --- End diff -- unused. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r80835233 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -17,12 +17,19 @@ package org.apache.spark.ml.clustering +import scala.util.Random + import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.Model import org.apache.spark.ml.linalg.{Vector, Vectors} -import org.apache.spark.ml.util.DefaultReadWriteTest -import org.apache.spark.mllib.clustering.{KMeans => MLlibKMeans} +import org.apache.spark.ml.param.{ParamMap, ParamPair} +import org.apache.spark.ml.util.{DefaultReadWriteTest, Identifiable} --- End diff -- unused. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r80834115 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +312,10 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: KMeansModel): this.type = set(initialModel, value) --- End diff -- It looks like the `fromCenters` method never got implemented. I think this is important since otherwise users can only set an initial model if they already have a trained KMeans model. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r80833730 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -300,15 +301,23 @@ private[ml] object DefaultParamsWriter { paramMap: Option[JValue] = None): String = { val uid = instance.uid val cls = instance.getClass.getName -val params = instance.extractParamMap().toSeq.asInstanceOf[Seq[ParamPair[Any]]] +val params = instance.extractParamMap().toSeq + .filter(_.param.name != "initialModel").asInstanceOf[Seq[ParamPair[Any]]] val jsonParams = paramMap.getOrElse(render(params.map { case ParamPair(p, v) => p.name -> parse(p.jsonEncode(v)) }.toList)) +// If the instance has an "initialModel" param and the param is defined, then the initial model +// will be saved along with the instance. +val initialModelFlag = + instance.hasParam("initialModel") && instance.isDefined(instance.getParam("initialModel")) val basicMetadata = ("class" -> cls) ~ ("timestamp" -> System.currentTimeMillis()) ~ ("sparkVersion" -> sc.version) ~ ("uid" -> uid) ~ - ("paramMap" -> jsonParams) + ("paramMap" -> jsonParams) ~ + // TODO: Figure out more robust way to detect the existing of the initialModel. --- End diff -- If we're going to leave it as a TODO, let's have a Jira for it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r80833366 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala --- @@ -107,24 +133,34 @@ trait DefaultReadWriteTest extends TempDirectory { self: Suite => } val model = estimator.fit(dataset) -// Test Estimator save/load -val estimator2 = testDefaultReadWrite(estimator) -testParams.foreach { case (p, v) => - val param = estimator.getParam(p) - assert(estimator.get(param).get === estimator2.get(param).get) +val testFunctions = if (testParams.contains("initialModel")) { + Map(("initialModel", checkModelData.asInstanceOf[(Any, Any) => Unit])) --- End diff -- I don't think it's a good idea. The code is technically incorrect. At the very least, we could pass in a map of functions for each param that has a special check. Like adding an extra method argument `checkParamsFunctions: Map[String, (Any, Any) => Unit] = Map.empty`. I don't think it's an elegant solution but it should at least apply generically. I think it might be reasonable to override the equals method, but we should check with @jkbradley. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r80630417 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -81,11 +81,23 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe def getInitSteps: Int = $(initSteps) /** + * Param for KMeansModel to use for warm start. + * @group param + */ + final val initialModel: Param[KMeansModel] = +new Param[KMeansModel](this, "initialModel", "A KMeansModel for warm start.") + + /** * Validates and transforms the input schema. * @param schema input schema * @return output schema */ protected def validateAndTransformSchema(schema: StructType): StructType = { +if (isDefined(initialModel)) { + val kOfInitialModel = $(initialModel).parentModel.clusterCenters.length + require(kOfInitialModel == $(k), +s"mismatched cluster count, ${$(k)} cluster centers required but $kOfInitialModel found.") +} --- End diff -- Good catch, I move the check to fit() --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r80630435 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala --- @@ -39,7 +40,7 @@ trait DefaultReadWriteTest extends TempDirectory { self: Suite => * @tparam T ML instance type * @return Instance loaded from file */ - def testDefaultReadWrite[T <: Params with MLWritable]( + def testDefaultReadWrite[T <: Params with MLWritable, M <: Model[M]]( --- End diff -- deleted --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r80630450 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala --- @@ -107,24 +133,34 @@ trait DefaultReadWriteTest extends TempDirectory { self: Suite => } val model = estimator.fit(dataset) -// Test Estimator save/load -val estimator2 = testDefaultReadWrite(estimator) -testParams.foreach { case (p, v) => - val param = estimator.getParam(p) - assert(estimator.get(param).get === estimator2.get(param).get) +val testFunctions = if (testParams.contains("initialModel")) { + Map(("initialModel", checkModelData.asInstanceOf[(Any, Any) => Unit])) +} else { + Map.empty[String, (Any, Any) => Unit] } +// Test Estimator save/load +val estimator2 = testDefaultReadWrite(estimator, testParams = false) +compareParamsWithComplexTypes(estimator, estimator2, testParams, testFunctions) + // Test Model save/load -val model2 = testDefaultReadWrite(model) -testParams.foreach { case (p, v) => - val param = model.getParam(p) - assert(model.get(param).get === model2.get(param).get) -} +val model2 = testDefaultReadWrite(model, testParams = false) +compareParamsWithComplexTypes(model, model2, testParams, testFunctions) checkModelData(model, model2) } } +class MyModel extends Model[MyModel] { --- End diff -- deleted --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r80630372 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala --- @@ -107,24 +133,34 @@ trait DefaultReadWriteTest extends TempDirectory { self: Suite => } val model = estimator.fit(dataset) -// Test Estimator save/load -val estimator2 = testDefaultReadWrite(estimator) -testParams.foreach { case (p, v) => - val param = estimator.getParam(p) - assert(estimator.get(param).get === estimator2.get(param).get) +val testFunctions = if (testParams.contains("initialModel")) { + Map(("initialModel", checkModelData.asInstanceOf[(Any, Any) => Unit])) --- End diff -- I put a TODO here. For the current situation, I think it's OK. Let's figure it out later? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r80565861 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala --- @@ -107,24 +133,34 @@ trait DefaultReadWriteTest extends TempDirectory { self: Suite => } val model = estimator.fit(dataset) -// Test Estimator save/load -val estimator2 = testDefaultReadWrite(estimator) -testParams.foreach { case (p, v) => - val param = estimator.getParam(p) - assert(estimator.get(param).get === estimator2.get(param).get) +val testFunctions = if (testParams.contains("initialModel")) { + Map(("initialModel", checkModelData.asInstanceOf[(Any, Any) => Unit])) --- End diff -- On second thought, it really might actually be easiest to override the equals method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r79957341 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -81,11 +81,23 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe def getInitSteps: Int = $(initSteps) /** + * Param for KMeansModel to use for warm start. + * @group param + */ + final val initialModel: Param[KMeansModel] = +new Param[KMeansModel](this, "initialModel", "A KMeansModel for warm start.") + + /** * Validates and transforms the input schema. * @param schema input schema * @return output schema */ protected def validateAndTransformSchema(schema: StructType): StructType = { +if (isDefined(initialModel)) { + val kOfInitialModel = $(initialModel).parentModel.clusterCenters.length + require(kOfInitialModel == $(k), +s"mismatched cluster count, ${$(k)} cluster centers required but $kOfInitialModel found.") +} --- End diff -- This method is called at the start of training, and right now this is the only place we check `k`. TBH, I think it would be cleaner to implement a isInitialModelValid method in the companion object which checks the k and the feature dimensions all in one place. Checking the k parameter here is a bit odd since it's not exactly schema related. Seems cleaner to me to check everything initialModel related in one place. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r79958291 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -139,16 +145,32 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR val kmeans = new KMeans() testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) } + --- End diff -- Just want to note here, that we don't need to verify that setting the wrong type of model throws an error anymore, since it won't compile :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r79955766 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -139,16 +146,61 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR val kmeans = new KMeans() testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) } + + test("Initialize using given cluster centers") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) +val oneIterModel = kmeans.fit(dataset) +val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 2))).fit(dataset) +val oneMoreIterModel = kmeans.setInitialModel(oneIterModel).fit(dataset) + +twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters) + .foreach { case (center1, center2) => assert(center1 ~== center2 absTol 1E-8) } + } + + test("Initialize using wrong model") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(10) + +val wrongKModel = KMeansSuite.generateRandomKMeansModel(3, k + 1) +val wrongKModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongKModel).fit(dataset) +} +assert(wrongKModelThrown.getMessage.contains("mismatched cluster count")) + +val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k) +val wrongDimModelThrown = intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongDimModel).fit(dataset) +} +assert(wrongDimModelThrown.getMessage.contains("mismatched dimension")) + } } object KMeansSuite { + + class MockModel(override val uid: String) extends Model[MockModel] { --- End diff -- never used? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r79955656 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala --- @@ -107,24 +133,34 @@ trait DefaultReadWriteTest extends TempDirectory { self: Suite => } val model = estimator.fit(dataset) -// Test Estimator save/load -val estimator2 = testDefaultReadWrite(estimator) -testParams.foreach { case (p, v) => - val param = estimator.getParam(p) - assert(estimator.get(param).get === estimator2.get(param).get) +val testFunctions = if (testParams.contains("initialModel")) { + Map(("initialModel", checkModelData.asInstanceOf[(Any, Any) => Unit])) --- End diff -- There is an assumption here that checkModelData will work on the initial models, which is not always true. In the `HasInitialModel` trait we do not require the initial model to be of the same type of the model which inherits it. So, if we ever use a model that has an initial model other than itself, we will get runtime errors here. We either need to update this to handle arbitrary initial models, or specify somehow that the initial model must be the same type of the model which inherits it. I prefer updating the tests, but I spent some time tinkering with it and didn't come up with anything that wasn't really clunky. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r79958476 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala --- @@ -22,10 +22,11 @@ import java.io.{File, IOException} import org.scalatest.Suite import org.apache.spark.SparkFunSuite -import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.{Estimator, Model, PipelineStage} import org.apache.spark.ml.param._ import org.apache.spark.mllib.util.MLlibTestSparkContext -import org.apache.spark.sql.Dataset +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.types.StructType --- End diff -- unused imports after `MyModel` is removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r79958909 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala --- @@ -39,7 +40,7 @@ trait DefaultReadWriteTest extends TempDirectory { self: Suite => * @tparam T ML instance type * @return Instance loaded from file */ - def testDefaultReadWrite[T <: Params with MLWritable]( + def testDefaultReadWrite[T <: Params with MLWritable, M <: Model[M]]( --- End diff -- what is this for? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r79947128 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/DefaultReadWriteTest.scala --- @@ -107,24 +133,34 @@ trait DefaultReadWriteTest extends TempDirectory { self: Suite => } val model = estimator.fit(dataset) -// Test Estimator save/load -val estimator2 = testDefaultReadWrite(estimator) -testParams.foreach { case (p, v) => - val param = estimator.getParam(p) - assert(estimator.get(param).get === estimator2.get(param).get) +val testFunctions = if (testParams.contains("initialModel")) { + Map(("initialModel", checkModelData.asInstanceOf[(Any, Any) => Unit])) +} else { + Map.empty[String, (Any, Any) => Unit] } +// Test Estimator save/load +val estimator2 = testDefaultReadWrite(estimator, testParams = false) +compareParamsWithComplexTypes(estimator, estimator2, testParams, testFunctions) + // Test Model save/load -val model2 = testDefaultReadWrite(model) -testParams.foreach { case (p, v) => - val param = model.getParam(p) - assert(model.get(param).get === model2.get(param).get) -} +val model2 = testDefaultReadWrite(model, testParams = false) +compareParamsWithComplexTypes(model, model2, testParams, testFunctions) checkModelData(model, model2) } } +class MyModel extends Model[MyModel] { --- End diff -- This is never used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r79936136 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -81,11 +81,23 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe def getInitSteps: Int = $(initSteps) /** + * Param for KMeansModel to use for warm start. + * @group param + */ + final val initialModel: Param[KMeansModel] = +new Param[KMeansModel](this, "initialModel", "A KMeansModel for warm start.") + + /** * Validates and transforms the input schema. * @param schema input schema * @return output schema */ protected def validateAndTransformSchema(schema: StructType): StructType = { +if (isDefined(initialModel)) { + val kOfInitialModel = $(initialModel).parentModel.clusterCenters.length + require(kOfInitialModel == $(k), +s"mismatched cluster count, ${$(k)} cluster centers required but $kOfInitialModel found.") +} --- End diff -- Why do you need to check this? This should already be checked in training. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r78690235 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -139,16 +146,72 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR val kmeans = new KMeans() testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) } + + test("Initialize using given cluster centers") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) +val oneIterModel = kmeans.fit(dataset) +val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 2))).fit(dataset) +val oneMoreIterModel = kmeans.setInitialModel(oneIterModel).fit(dataset) + +twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters) + .foreach { case (center1, center2) => assert(center1 ~== center2 absTol 1E-8) } + } + + test("Initialize using wrong model") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(10) +val wrongTypeModel = new KMeansSuite.MockModel() + +withClue("The type of an initial model should only be a KMeansModel.") { + intercept[IllegalArgumentException] { +kmeans.setInitialModel(wrongTypeModel).fit(dataset) + } +} + +val wrongKModel = KMeansSuite.generateRandomKMeansModel(3, k + 1) +withClue("The number of clusters set in the given model should be the same with the one set" + + " in the KMeans estimator.") { + intercept[IllegalArgumentException] { +kmeans.setInitialModel(wrongKModel).fit(dataset) + } +} + +val wrongDimModel = KMeansSuite.generateRandomKMeansModel(4, k) +withClue("The dimension of points in the model should be the same with the dimension of the" + --- End diff -- `withClue` just prepends the provided message to the error message when the exception is not thrown as expected. I think we should check that the correct exception is thrown: scala val thrown = intercept[IllegalArgumentException] { ... } assert(thrown.getMessage.contains("mismatched dimension")) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r78689714 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +322,29 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: KMeansModel): this.type = set(initialModel, value) + + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: Model[_]): this.type = { --- End diff -- As a follow on, we could eliminate the setter `def setInitialModel(value: Model[_])`. To have better documentation, we could leave the param as abstract in the `HasInitialModel` trait: scala def hasInitialModel: Param[T] Then, when we add this to new models, we implement the param there. So, in KMeansParams: scala /** * Param for KMeansModel to use for warm start". * @group param */ final val hasInitialModel: Param[KMeansModel] = new Param[KMeansModel](this, "initialModel", "A KMeansModel to use for warm start") That way the params are explicit in what type of model is used for initial model and the documentation is more clear. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r78682701 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +322,29 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: KMeansModel): this.type = set(initialModel, value) + + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: Model[_]): this.type = { --- End diff -- +1 on using `KMeansModel.fromCenters(centers)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r78669003 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +322,29 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: KMeansModel): this.type = set(initialModel, value) + + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: Model[_]): this.type = { --- End diff -- Is the reason we provide this method just so we can throw a better error message? I am concerned about providing _three different_ setter methods for this param, particularly, are we going to have to do this every time? There are ways to provide smarter error messages and more specific param docs, which may be better than adding extra setters. I see why we need to provide a way to set the initial model with just cluster centers, but I think we should limit the "convenience" methods we add, since we plan to extend this design to many other models. For instance, in logistic regression do we add: scala def setInitialModel(coefficients: Vector, intercept: Double) = ... def setInitialModel(coefficients: Vector) = setInitialModel(coefficients, 0.0) def setInitialModel(coefficients: Matrix, intercept: Vector) = ... def setInitialModel(coefficients: Matrix) = ... I think we can get away with only specifying one setter method, as we do in the other params. To allow users to specify the model from centers we could add a method like `KMeansModel.fromCenters(centers)` and users can use that. I appreciate others' thoughts on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r78610830 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -139,16 +145,32 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR val kmeans = new KMeans() testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) } + --- End diff -- +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r78604875 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -203,10 +219,13 @@ object KMeansModel extends MLReadable[KMeansModel] { /** [[MLWriter]] instance for [[KMeansModel]] */ private[KMeansModel] class KMeansModelWriter(instance: KMeansModel) extends MLWriter { +import org.json4s.JsonDSL._ --- End diff -- Not used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r77834310 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -139,16 +146,61 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR val kmeans = new KMeans() testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) } + + test("Initialize using given cluster centers") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) +val oneIterModel = kmeans.fit(dataset) +val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 2))).fit(dataset) +val oneMoreIterModel = kmeans.setInitialModel(oneIterModel).fit(dataset) + +twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters) + .foreach { case (center1, center2) => assert(center1 ~== center2 absTol 1E-8) } + } + + test("Initialize using wrong model") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(10) +val wrongTypeModel = new KMeansSuite.MockModel() + assert(!kmeans.setInitialModel(wrongTypeModel).isSet(kmeans.initialModel)) + +val wrongKModel = KMeansSuite.generateKMeansModel(3, k + 1) +intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongKModel).fit(dataset) +} + +val wrongDimModel = KMeansSuite.generateKMeansModel(4, k) +intercept[IllegalArgumentException] { + kmeans.setInitialModel(wrongDimModel).fit(dataset) +} + } } object KMeansSuite { + + class MockModel(override val uid: String) extends Model[MockModel] { + +def this() = this(Identifiable.randomUID("mockModel")) + +override def copy(extra: ParamMap): MockModel = throw new NotImplementedError() + +override def transform(dataset: Dataset[_]): DataFrame = throw new NotImplementedError() + +override def transformSchema(schema: StructType): StructType = throw new NotImplementedError() + } + def generateKMeansData(spark: SparkSession, rows: Int, dim: Int, k: Int): DataFrame = { val sc = spark.sparkContext val rdd = sc.parallelize(1 to rows).map(i => Vectors.dense(Array.fill(dim)((i % k).toDouble))) - .map(v => new TestRow(v)) + .map(v => TestRow(v)) spark.createDataFrame(rdd) } + def generateKMeansModel(dim: Int, k: Int, seed: Int = 42): KMeansModel = { --- End diff -- can we call it `generateRandomKMeansModel`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r77758918 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -137,6 +138,17 @@ class KMeansModel private[ml] ( @Since("1.6.0") override def write: MLWriter = new KMeansModel.KMeansModelWriter(this) + override def hashCode(): Int = { +(Array(this.getClass, uid) ++ clusterCenters) --- End diff -- @yinxusen Correct me if I'm wrong, but I believe you override the equals method is because the params are checked for equality in the read/write tests. Just thinking ahead, we will have to do this for every model we use as an initial model. We can avoid this by adding some handling inside the read/write params test, and then checking the initial model equality for read/write inside the `checkModelData` method. I guess I'd prefer not to randomly overwrite some models equals methods, and not others, especially since the reasoning behind won't be clear. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r77737542 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -446,6 +463,20 @@ private[ml] object DefaultParamsReader { val cls = Utils.classForName(metadata.className) cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path) } + + def loadAndSetInitialModel[M <: Model[M]]( + instance: HasInitialModel[M], metadata: Metadata, path: String, sc: SparkContext): Unit = { +implicit val format = DefaultFormats +// Try to load the initial model +if (metadata.metadata \ "initialModel" != JNothing) { + val hasInitialModel = (metadata.metadata \ "initialModel").extract[Boolean] + if (hasInitialModel) { +val initialModelPath = new Path(path, "initialModel").toString +val initialModel = loadParamsInstance[Model[M]](initialModelPath, sc) --- End diff -- why not `loadParamsInstance[M]`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r77737428 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -446,6 +463,20 @@ private[ml] object DefaultParamsReader { val cls = Utils.classForName(metadata.className) cls.getMethod("read").invoke(null).asInstanceOf[MLReader[T]].load(path) } + + def loadAndSetInitialModel[M <: Model[M]]( + instance: HasInitialModel[M], metadata: Metadata, path: String, sc: SparkContext): Unit = { +implicit val format = DefaultFormats +// Try to load the initial model +if (metadata.metadata \ "initialModel" != JNothing) { + val hasInitialModel = (metadata.metadata \ "initialModel").extract[Boolean] + if (hasInitialModel) { +val initialModelPath = new Path(path, "initialModel").toString +val initialModel = loadParamsInstance[Model[M]](initialModelPath, sc) +instance.set(instance.getParam("initialModel"), initialModel) --- End diff -- can access the param directly here instead of `getParam` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r77737335 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -139,16 +146,61 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR val kmeans = new KMeans() testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) } + + test("Initialize using given cluster centers") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(1) +val oneIterModel = kmeans.fit(dataset) +val twoIterModel = kmeans.copy(ParamMap(ParamPair(kmeans.maxIter, 2))).fit(dataset) +val oneMoreIterModel = kmeans.setInitialModel(oneIterModel).fit(dataset) + +twoIterModel.clusterCenters.zip(oneMoreIterModel.clusterCenters) + .foreach { case (center1, center2) => assert(center1 ~== center2 absTol 1E-8) } + } + + test("Initialize using wrong model") { +val kmeans = new KMeans().setK(k).setSeed(1).setMaxIter(10) +val wrongTypeModel = new KMeansSuite.MockModel() + assert(!kmeans.setInitialModel(wrongTypeModel).isSet(kmeans.initialModel)) + +val wrongKModel = KMeansSuite.generateKMeansModel(3, k + 1) +intercept[IllegalArgumentException] { --- End diff -- In other places with similar checks in ML, we typically check that the error message contains the expected message. I'd prefer to keep with that pattern here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r77737026 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -303,6 +322,29 @@ class KMeans @Since("1.5.0") ( @Since("1.5.0") def setSeed(value: Long): this.type = set(seed, value) + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: KMeansModel): this.type = set(initialModel, value) + + /** @group setParam */ + @Since("2.1.0") + def setInitialModel(value: Model[_]): this.type = { +value match { + case m: KMeansModel => setInitialModel(m) + case other => +logWarning(s"KMeansModel required but ${other.getClass.getSimpleName} found.") --- End diff -- I prefer an error here. Other parameters throw errors when invalid values are set, so I don't see any reason to deviate from that behavior here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r77734912 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -318,6 +327,14 @@ private[ml] object DefaultParamsWriter { val metadataJson: String = compact(render(metadata)) metadataJson } + + def saveInitialModel[T <: HasInitialModel[_ <: MLWritable]](instance: T, path: String): Unit = { +if (instance.isDefined(instance.getParam("initialModel"))) { + val initialModelPath = new Path(path, "initialModel").toString + val initialModel = instance.getOrDefault(instance.getParam("initialModel")) + initialModel.asInstanceOf[MLWritable].save(initialModelPath) --- End diff -- No need for the `asInstanceOf` anymore. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r77734849 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -318,6 +327,14 @@ private[ml] object DefaultParamsWriter { val metadataJson: String = compact(render(metadata)) metadataJson } + + def saveInitialModel[T <: HasInitialModel[_ <: MLWritable]](instance: T, path: String): Unit = { +if (instance.isDefined(instance.getParam("initialModel"))) { --- End diff -- Since we specify the type parameters, you can access the param directly `instance.isDefined(instance.initialModel)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #11119: [SPARK-10780][ML] Add an initial model to kmeans
Github user yinxusen commented on a diff in the pull request: https://github.com/apache/spark/pull/9#discussion_r77414688 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala --- @@ -139,16 +145,32 @@ class KMeansSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultR val kmeans = new KMeans() testEstimatorAndModelReadWrite(kmeans, dataset, KMeansSuite.allParamSettings, checkModelData) } + + test("Initialize using given cluster centers") { --- End diff -- I think the current test is OK to assert the right behavior of initialModel. And it's more economic to test with only one or two iterations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org