subject:"\[GitHub\] spark pull request\: \[SPARK\-5565\] \[ML\] LDA wrapper for Pipelines AP..."

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-12 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-156221541
  
Hm, good point.  OK I'll try that & ping you on the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-10 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155610446
  
Got OK from @mengxr offline, so merging with master and branch-1.6


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-10 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/9513


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-10 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155610833
  
I'll see about sending a follow-up with the subclassing.  let me know if 
there's anything else I'm forgetting.

Thanks @feynmanliang  and @hhbyyh for reviewing!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-10 Thread feynmanliang

Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155376552
  
+1 on the renames

On Tue, Nov 10, 2015, 02:48 Apache Spark QA 
wrote:

> *Test build #2025 has finished
> 
*
> for PR 9513 at commit 16a061c
> 

> .
>
>- This patch passes all tests.
>- This patch merges cleanly.
>- This patch adds the following public classes *(experimental)*:\n * 
class
>LDA @Since(\"1.6.0\") (\n
>
> â
>
>
> Reply to this email directly or view it on GitHub
> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-10 Thread feynmanliang

Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155630055
  
I still think it's wrong for a `LocalLDAModel` to *optionally* have a 
`OldLocalLDAModel` when it's basically a wrapper for `OldLocalLDAModel`. 
Forking the inheritance structure could avoid that by making the 
`Option[OldLocalLDAModel]` localized to `DistributedLDAModel` (and we can still 
have the `copy` iff `collect`ed already semantics) while also removing the 
`case Some(...) => ... case None => /* should never happen */`s 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-10 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155625846
  
Actually, @feynmanliang I realized as I was trying to rewrite this that 
using a lazy val for DistributedLDAModel.oldLocalModel prevents us from one 
important optimization in DistributedLDAModel.copy, which is called every time 
we call model.transform:
* Currently: We copy the local model if it has been instantiated (involving 
collecting the topicsMatrix to the driver).
* With a lazy val: I don't see a good way to ensure the collect only 
happens once.

Given that this could mean considerable overhead, including several copies 
of topicsMatrix on the driver, I'd prefer to keep the current class structure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-10 Thread feynmanliang

Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155627493
  
Oh wait I see what you're saying


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-10 Thread feynmanliang

Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155627361
  
@jkbradley Not sure I understand, if `lazy val oldModel = 
*something*.collect()` then `collect()` will only be called once on the first 
reference to `oldModel` and every subsequent reference to `oldModel` will use 
the `Array[...]` materialized by `collect()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44332122
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44341246
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44331528
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,731 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters) to infer. Must be > 1. 
Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of topics (clusters) to 
infer",
+ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Concentration parameter (commonly named "beta" or "eta") for the 
prior placed on topics'
+   * distributions over terms.
+   *
+   * This is the parameter to a symmetric Dirichlet

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44341643
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44332592
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155257021
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155256986
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155266291
  
**[Test build #2025 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2025/consoleFull)**
 for PR 9513 at commit 
[`16a061c`](https://github.com/apache/spark/commit/16a061ca4df6abb59b1cff6695debac7492260ab).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class LDA @Since(\"1.6.0\") (`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155255905
  
I'm going to go ahead and change the tau0, kappa names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155259224
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45452/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155265706
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45463/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155265703
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155265584
  
**[Test build #45463 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45463/consoleFull)**
 for PR 9513 at commit 
[`8eaa596`](https://github.com/apache/spark/commit/8eaa596346451ee8a9f8685d2be82ef9a81f2e4e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class LDA @Since(\"1.6.0\") (`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155257687
  
**[Test build #45463 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45463/consoleFull)**
 for PR 9513 at commit 
[`8eaa596`](https://github.com/apache/spark/commit/8eaa596346451ee8a9f8685d2be82ef9a81f2e4e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155152154
  
**[Test build #45388 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45388/consoleFull)**
 for PR 9513 at commit 
[`3acc9e0`](https://github.com/apache/spark/commit/3acc9e0643d1c023113d763a7132a5fccc3b9d2c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44311160
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155149943
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155149909
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44328340
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,731 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters) to infer. Must be > 1. 
Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of topics (clusters) to 
infer",
+ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Concentration parameter (commonly named "beta" or "eta") for the 
prior placed on topics'
+   * distributions over terms.
+   *
+   * This is the parameter to a symmetric Dirichlet

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44322164
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44322120
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44323247
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44324308
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155166627
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155166554
  
**[Test build #45388 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45388/consoleFull)**
 for PR 9513 at commit 
[`3acc9e0`](https://github.com/apache/spark/commit/3acc9e0643d1c023113d763a7132a5fccc3b9d2c).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class LDA @Since(\"1.6.0\") (`\n  * `sealed abstract class LDAOptimizer 
extends Params `\n  * `class EMLDAOptimizer @Since(\"1.6.0\") (`\n  * `class 
OnlineLDAOptimizer @Since(\"1.6.0\") (`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44323843
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44323496
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang

Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155185664
  
Second pass. Most significant comments are about completely removing 
`Vector` from the public API and debating `DistributedLDAModel < LDAModel` vs 
`abstract class LDAModel`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155170628
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44323084
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44324500
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155165981
  
I had not written tests very carefully, so they had some bugs.  Updated now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155166751
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155166791
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44321879
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
--- End diff --

Ideally `Either[Double,Vector]` would be best but I'm not sure if param's 
can be `Either`s. If not, what you proposed sounds good


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155258957
  
**[Test build #45452 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45452/consoleFull)**
 for PR 9513 at commit 
[`a55de6d`](https://github.com/apache/spark/commit/a55de6dca8a900d91861fad526477973589d9024).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class LDA @Since(\"1.6.0\") (`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155259222
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44342591
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155227746
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155248167
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155248084
  
@feynmanliang I think that's the last fix.

Thinking more about it, I'm on board with changing LDAModel to be abstract, 
as long as it's a minor change.  I'll see about making it in a follow-up PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155248180
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44354690
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,668 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters) to infer. Must be > 1. 
Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of topics (clusters) to 
infer",
+ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Concentration parameter (commonly named "beta" or "eta") for the 
prior placed on topics'
+   * distributions over terms.
+   *
+   * This is the parameter to a symmetric Dirichlet

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44354638
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,668 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters) to infer. Must be > 1. 
Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of topics (clusters) to 
infer",
+ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Concentration parameter (commonly named "beta" or "eta") for the 
prior placed on topics'
+   * distributions over terms.
+   *
+   * This is the parameter to a symmetric Dirichlet

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44354764
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,668 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters) to infer. Must be > 1. 
Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of topics (clusters) to 
infer",
+ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Concentration parameter (commonly named "beta" or "eta") for the 
prior placed on topics'
+   * distributions over terms.
+   *
+   * This is the parameter to a symmetric Dirichlet

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155249388
  
**[Test build #45452 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45452/consoleFull)**
 for PR 9513 at commit 
[`a55de6d`](https://github.com/apache/spark/commit/a55de6dca8a900d91861fad526477973589d9024).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155240086
  
**[Test build #2025 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2025/consoleFull)**
 for PR 9513 at commit 
[`16a061c`](https://github.com/apache/spark/commit/16a061ca4df6abb59b1cff6695debac7492260ab).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang

Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155243816
  
LGTM

If we do decide to change the inheritance structure it should be done 
before 1.6 release to prevent breaking public APIs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155224281
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155224308
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155225172
  
Updated per all comments, including changing optimizer to be a String 
Param.  Ready for final review.

The only remaining issue to debate is whether to make LDAModel abstract.  I 
agree with @feynmanliang that it's cleaner in terms of implementing a lazy val. 
 The main cons I see are:
* Will users be confused about getting an abstract type when they look at 
the docs?  Probably not, but worth mentioning.
* I do not want us to require beginners users to cast to LocalLDAModel or 
DistributedLDAModel.  That means we would need to make sure not to add 
functionality to LocalLDAModel without requiring it in the abstraction LDAModel.

If we decide to make this change, we can do it in a follow-up PR.  Feel 
free to discuss below.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44238864
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
--- End diff --

I'm neutral on this.


---
If your project is

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44238520
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
--- End diff --

Number of topics to infer ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44239868
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44239874
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44239862
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44239852
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
--- End diff --

Hm, yeah, I neglected to mention this in the

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44239869
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44239863
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44239846
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
--- End diff --

ok


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44239877
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44239860
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44239866
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44239881
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44239848
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
--- End diff --

How about:
* Default is specified by not having the Param set.
* A single vector gets replicated.
* A Vector of length k is OK too.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-154913375
  
@feynmanliang @hhbyyh Thank you for the comments!  I've updated the code 
but haven't pushed it yet in case @hhbyyh is still reviewing.

@hhbyyh Can you please say when you're done reviewing (so I don't clobber 
your Github comments)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44239878
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44239875
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44239857
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44239865
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta"

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread hhbyyh

Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44240090
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named "beta" or

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-08 Thread hhbyyh

Github user hhbyyh commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-154933249
  
Yes, please go ahead. @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202059
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202038
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
--- End diff --

-1 on having both `alpha` and

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44203476
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44203542
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202262
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44203245
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44203695
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-154590891
  
Made a pass


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202348
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202796
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44203600
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202645
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202165
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202952
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202889
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202541
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202484
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
--- End diff --

IMO this validation logic is quite confusing and was there for backwards 
compatibility. Since we have this opportunity to implement a new API, I suggest:
 * Ditching the singleton vector option, requiring the user to specify a 
length `k` vector
 * Keeping the automatic init as the default, making the API easy for 
novice users

The only feature that is lost is replication of `docConcentration` > 0 to a 
symmetric prior


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44203006
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang

Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202982
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentration]]
+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration
+
+  /**
+   * Concentration parameter (commonly named

1 2 >

1 - 100 of 113 matches

Mail list logo