[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647470
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -1358,6 +1363,41 @@ private[spark] abstract class SerDeBase {
   }
 }.toJavaRDD()
   }
+
+  /**
+   * Convert a DStream of Java objects to a DStream of serialized Python 
objects, that is usable by
+   * PySpark.
+   */
+  def javaToPython(jDStream: JavaDStream[Any]): JavaDStream[Array[Byte]] = 
{
+val dStream = jDStream.dstream.mapPartitions { iter =>
+  initialize()  // let it called in executor
+  new SerDeUtil.AutoBatchedPickler(iter)
+}
+new JavaDStream[Array[Byte]](dStream)
+  }
+
+  /**
+   * Convert a DStream of serialized Python objects to a DStream of 
objects, that is usable by
+   * PySpark.
+   */
+  def pythonToJava(pyDStream: JavaDStream[Array[Byte]], batched: Boolean): 
JavaDStream[Any] = {
+val dStream = pyDStream.dstream.mapPartitions { iter =>
+  initialize()  // let it called in executor
+  val unpickle = new Unpickler
+  iter.flatMap { row =>
+val obj = unpickle.loads(row)
+if (batched) {
+  obj match {
+case list: JArrayList[_] => list.asScala
+case arr: Array[_] => arr
+  }
+} else {
+Seq(obj)
+}
+  }
+}
+new JavaDStream[Any](dStream)
--- End diff --

No need for this constructor, just return `dStream` (in fact, don't even 
need the `dStream` variable)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647311
  
--- Diff: examples/src/main/python/mllib/streaming_test_example.py ---
@@ -0,0 +1,57 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Create a DStream that contains several RDDs to show the StreamingTest of 
PySpark.
+"""
+import time
+import tempfile
+from shutil import rmtree
+
+from pyspark import SparkContext
+from pyspark.streaming import StreamingContext
+from pyspark.mllib.stat.test import BinarySample, StreamingTest
+
+if __name__ == "__main__":
+
+sc = SparkContext(appName="PythonStreamingTestExample")
+ssc = StreamingContext(sc, 1)
+
+checkpoint_path = tempfile.mkdtemp()
+ssc.checkpoint(checkpoint_path)
+
+# Create the queue through which RDDs can be pushed to a 
QueueInputDStream.
+rdd_queue = []
+for i in range(5):
+rdd_queue += [ssc.sparkContext.parallelize(
+[BinarySample(True, j) for j in range(1, 1001)], 10)]
+
+# Create the QueueInputDStream and use it do some processing.
+input_stream = ssc.queueStream(rdd_queue)
+
+model = StreamingTest()
+test_result = model.registerStream(input_stream)
+
+test_result.pprint()
+
+ssc.start()
+time.sleep(12)
--- End diff --

Is this necessary? Doesn't seem to be required for `streaming_kmeans` 
example


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647567
  
--- Diff: python/pyspark/mllib/stat/test.py ---
@@ -80,3 +85,118 @@ class KolmogorovSmirnovTestResult(TestResult):
 """
 Contains test results for the Kolmogorov-Smirnov test.
 """
+
+
+class BinarySample(namedtuple("BinarySample", ["isExperiment", "value"])):
+"""
+Represents a (isExperiment, value) tuple.
+
+.. versionadded:: 2.0.0
+"""
+
+def __reduce__(self):
+return BinarySample, (bool(self.isExperiment), float(self.value))
+
+
+class StreamingTestResult(namedtuple("StreamingTestResult",
+ ["pValue", "degreesOfFreedom", 
"statistic", "method",
+  "nullHypothesis"])):
+"""
+Contains test results for StreamingTest.
+
+.. versionadded:: 2.0.0
+"""
+
+def __reduce__(self):
+return StreamingTestResult, (float(self.pValue),
+ float(self.degreesOfFreedom), 
float(self.statistic),
+ str(self.method), 
str(self.nullHypothesis))
+
+
+class StreamingTest(object):
+"""
+.. note:: Experimental
+
+Online 2-sample significance testing for a stream of (Boolean, Double) 
pairs. The Boolean
+identifies which sample each observation comes from, and the Double is 
the numeric value of the
+observation.
+
+To address novelty affects, the `peacePeriod` specifies a set number 
of initial RDD batches of
+the DStream to be dropped from significance testing.
+
+The `windowSize` sets the number of batches each significance test is 
to be performed over. The
+window is sliding with a stride length of 1 batch. Setting windowSize 
to 0 will perform
+cumulative processing, using all batches seen so far.
+
+Different tests may be used for assessing statistical significance 
depending on assumptions
+satisfied by data. For more details, see StreamingTestMethod. The 
`testMethod` specifies
+which test will be used.
+
+.. versionadded:: 2.0.0
+"""
+
+def __init__(self):
+self._peacePeriod = 0
+self._windowSize = 0
+self._testMethod = "welch"
+
+@since('2.0.0')
+def setPeacePeriod(self, peacePeriod):
+"""
+Update peacePeriod
+:param peacePeriod:
+  Set number of initial RDD batches of the DStream to be dropped 
from significance testing.
+"""
+self._peacePeriod = peacePeriod
+
+@since('2.0.0')
+def setWindowSize(self, windowSize):
+"""
+Update windowSize
+:param windowSize:
+  Set the number of batches each significance test is to be 
performed over.
+"""
+self._windowSize = windowSize
+
+@since('2.0.0')
+def setTestMethod(self, testMethod):
+"""
+Update test method
+:param testMethod:
+  Currently supported tests: `welch`, `student`.
+"""
+assert(testMethod in ("welch", "student"),
+   "Currently supported tests: \"welch\", \"student\"")
+self._testMethod = testMethod
+
+@since('2.0.0')
+def registerStream(self, data):
--- End diff --

nit: data -> dstream


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647915
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/stat/test/TestResult.scala ---
@@ -114,7 +114,7 @@ class KolmogorovSmirnovTestResult private[stat] (
  * Object containing the test results for streaming testing.
  */
 @Since("1.6.0")
-private[stat] class StreamingTestResult @Since("1.6.0") (
+class StreamingTestResult @Since("1.6.0") (
--- End diff --

Does this need to be public? Java API doesn't seem to require it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647656
  
--- Diff: python/pyspark/mllib/tests.py ---
@@ -1688,6 +1689,44 @@ def test_binary_term_freqs(self):
": expected " + str(expected[i]) + ", 
got " + str(output[i]))
 
 
+class StreamingTestTest(PySparkStreamingTestCase):
+def test_streaming_test_result_and_model(self):
+"""
+Assert the StreamingTest return valid result, and the set method 
of it.
+"""
+
+checkpoint_path = tempfile.mkdtemp()
+self.ssc.checkpoint(checkpoint_path)
--- End diff --

Is this necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647664
  
--- Diff: python/pyspark/mllib/tests.py ---
@@ -1688,6 +1689,44 @@ def test_binary_term_freqs(self):
": expected " + str(expected[i]) + ", 
got " + str(output[i]))
 
 
+class StreamingTestTest(PySparkStreamingTestCase):
+def test_streaming_test_result_and_model(self):
+"""
+Assert the StreamingTest return valid result, and the set method 
of it.
+"""
+
+checkpoint_path = tempfile.mkdtemp()
+self.ssc.checkpoint(checkpoint_path)
+
+# Create the queue through which RDDs can be pushed to a 
QueueInputDStream.
+rdd_queue = []
+for i in range(5):
+rdd_queue += [self.ssc.sparkContext.parallelize(
+[BinarySample(True, j) for j in range(1, 1001)], 10)]
+
+# Create the QueueInputDStream and use it do some processing.
+input_stream = self.ssc.queueStream(rdd_queue)
+
+model = StreamingTest()
+model.setPeacePeriod(1)
--- End diff --

Can we break this into another test just for model params like 
https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/python/pyspark/mllib/tests.py#L1165?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647454
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -1358,6 +1363,41 @@ private[spark] abstract class SerDeBase {
   }
 }.toJavaRDD()
   }
+
+  /**
+   * Convert a DStream of Java objects to a DStream of serialized Python 
objects, that is usable by
+   * PySpark.
+   */
+  def javaToPython(jDStream: JavaDStream[Any]): JavaDStream[Array[Byte]] = 
{
+val dStream = jDStream.dstream.mapPartitions { iter =>
+  initialize()  // let it called in executor
+  new SerDeUtil.AutoBatchedPickler(iter)
--- End diff --

This should be an `Array[Byte]`, so you don't need to assign to `dStream` 
and then call the constructor on L1376


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647843
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -1576,17 +1616,49 @@ private[spark] object SerDe extends SerDeBase with 
Serializable {
 }
   }
 
+  private[python] class BinarySamplePickler extends 
BasePickler[BinarySample] {
+def saveState(obj: Object, out: OutputStream, pickler: Pickler): Unit 
= {
+  val binarySample = obj.asInstanceOf[BinarySample]
+  saveObjects(out, pickler, binarySample.isExperiment, 
binarySample.value)
+}
+
+def construct(args: Array[AnyRef]): AnyRef = {
+  if (args.length != 2) {
+throw new PickleException("should be 2")
+  }
+  BinarySample(args(0).asInstanceOf[Boolean], 
args(1).asInstanceOf[Double])
+}
+  }
+
+  private[python] class StreamingTestResultPickler extends 
BasePickler[StreamingTestResult] {
--- End diff --

Do we need to test these in `PythonMLLibAPISuite`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647607
  
--- Diff: python/pyspark/mllib/stat/test.py ---
@@ -80,3 +85,118 @@ class KolmogorovSmirnovTestResult(TestResult):
 """
 Contains test results for the Kolmogorov-Smirnov test.
 """
+
+
+class BinarySample(namedtuple("BinarySample", ["isExperiment", "value"])):
+"""
+Represents a (isExperiment, value) tuple.
+
+.. versionadded:: 2.0.0
+"""
+
+def __reduce__(self):
+return BinarySample, (bool(self.isExperiment), float(self.value))
+
+
+class StreamingTestResult(namedtuple("StreamingTestResult",
+ ["pValue", "degreesOfFreedom", 
"statistic", "method",
+  "nullHypothesis"])):
+"""
+Contains test results for StreamingTest.
+
+.. versionadded:: 2.0.0
+"""
+
+def __reduce__(self):
+return StreamingTestResult, (float(self.pValue),
+ float(self.degreesOfFreedom), 
float(self.statistic),
+ str(self.method), 
str(self.nullHypothesis))
+
+
+class StreamingTest(object):
+"""
+.. note:: Experimental
+
+Online 2-sample significance testing for a stream of (Boolean, Double) 
pairs. The Boolean
+identifies which sample each observation comes from, and the Double is 
the numeric value of the
+observation.
+
+To address novelty affects, the `peacePeriod` specifies a set number 
of initial RDD batches of
+the DStream to be dropped from significance testing.
+
+The `windowSize` sets the number of batches each significance test is 
to be performed over. The
+window is sliding with a stride length of 1 batch. Setting windowSize 
to 0 will perform
+cumulative processing, using all batches seen so far.
+
+Different tests may be used for assessing statistical significance 
depending on assumptions
+satisfied by data. For more details, see StreamingTestMethod. The 
`testMethod` specifies
+which test will be used.
+
+.. versionadded:: 2.0.0
+"""
+
+def __init__(self):
+self._peacePeriod = 0
+self._windowSize = 0
+self._testMethod = "welch"
+
+@since('2.0.0')
+def setPeacePeriod(self, peacePeriod):
+"""
+Update peacePeriod
+:param peacePeriod:
+  Set number of initial RDD batches of the DStream to be dropped 
from significance testing.
+"""
+self._peacePeriod = peacePeriod
+
+@since('2.0.0')
+def setWindowSize(self, windowSize):
+"""
+Update windowSize
+:param windowSize:
+  Set the number of batches each significance test is to be 
performed over.
+"""
+self._windowSize = windowSize
+
+@since('2.0.0')
+def setTestMethod(self, testMethod):
+"""
+Update test method
+:param testMethod:
+  Currently supported tests: `welch`, `student`.
+"""
+assert(testMethod in ("welch", "student"),
+   "Currently supported tests: \"welch\", \"student\"")
+self._testMethod = testMethod
+
+@since('2.0.0')
+def registerStream(self, data):
+"""
+Register a data stream to get its test result.
+
+:param data:
+  The input data stream, each element is a BinarySample instance.
+"""
+self._validate(data)
+sc = SparkContext._active_spark_context
+
+streamingTest = 
sc._jvm.org.apache.spark.mllib.stat.test.StreamingTest()
--- End diff --

Why did we not define a `StreamingTestModel` and assign an instance to 
`_model` (like in streaming K means)?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647232
  
--- Diff: examples/src/main/python/mllib/streaming_test_example.py ---
@@ -0,0 +1,57 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Create a DStream that contains several RDDs to show the StreamingTest of 
PySpark.
+"""
+import time
+import tempfile
+from shutil import rmtree
+
+from pyspark import SparkContext
+from pyspark.streaming import StreamingContext
+from pyspark.mllib.stat.test import BinarySample, StreamingTest
+
+if __name__ == "__main__":
+
--- End diff --

nit: don't include newline here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647320
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -44,7 +44,7 @@ import org.apache.spark.mllib.regression._
 import org.apache.spark.mllib.stat.{KernelDensity, 
MultivariateStatisticalSummary, Statistics}
 import org.apache.spark.mllib.stat.correlation.CorrelationNames
 import org.apache.spark.mllib.stat.distribution.MultivariateGaussian
-import org.apache.spark.mllib.stat.test.{ChiSqTestResult, 
KolmogorovSmirnovTestResult}
+import org.apache.spark.mllib.stat.test._
--- End diff --

Why are we using a wildcard import here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647221
  
--- Diff: examples/src/main/python/mllib/streaming_test_example.py ---
@@ -0,0 +1,57 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Create a DStream that contains several RDDs to show the StreamingTest of 
PySpark.
+"""
+import time
+import tempfile
+from shutil import rmtree
+
+from pyspark import SparkContext
+from pyspark.streaming import StreamingContext
+from pyspark.mllib.stat.test import BinarySample, StreamingTest
+
+if __name__ == "__main__":
+
+sc = SparkContext(appName="PythonStreamingTestExample")
+ssc = StreamingContext(sc, 1)
+
+checkpoint_path = tempfile.mkdtemp()
--- End diff --

Is this necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647632
  
--- Diff: python/pyspark/mllib/tests.py ---
@@ -1688,6 +1689,44 @@ def test_binary_term_freqs(self):
": expected " + str(expected[i]) + ", 
got " + str(output[i]))
 
 
+class StreamingTestTest(PySparkStreamingTestCase):
--- End diff --

Why not `MLLibStreamingTestCase`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647207
  
--- Diff: examples/src/main/python/mllib/streaming_test_example.py ---
@@ -0,0 +1,57 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Create a DStream that contains several RDDs to show the StreamingTest of 
PySpark.
+"""
--- End diff --

Seems like other examples are including a `from __future__ import 
print_function` here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647534
  
--- Diff: python/pyspark/mllib/stat/test.py ---
@@ -15,10 +15,15 @@
 # limitations under the License.
 #
 
+from collections import namedtuple
+
+from pyspark import SparkContext, since
 from pyspark.mllib.common import inherit_doc, JavaModelWrapper
+from pyspark.streaming.dstream import DStream
 
 
-__all__ = ["ChiSqTestResult", "KolmogorovSmirnovTestResult"]
+__all__ = ["ChiSqTestResult", "KolmogorovSmirnovTestResult", 
"BinarySample", "StreamingTest",
--- End diff --

nit: alphabetize


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647905
  
--- Diff: python/pyspark/mllib/stat/test.py ---
@@ -80,3 +85,118 @@ class KolmogorovSmirnovTestResult(TestResult):
 """
 Contains test results for the Kolmogorov-Smirnov test.
 """
+
+
+class BinarySample(namedtuple("BinarySample", ["isExperiment", "value"])):
+"""
+Represents a (isExperiment, value) tuple.
+
+.. versionadded:: 2.0.0
+"""
+
+def __reduce__(self):
+return BinarySample, (bool(self.isExperiment), float(self.value))
+
+
+class StreamingTestResult(namedtuple("StreamingTestResult",
+ ["pValue", "degreesOfFreedom", 
"statistic", "method",
+  "nullHypothesis"])):
+"""
+Contains test results for StreamingTest.
+
+.. versionadded:: 2.0.0
+"""
+
+def __reduce__(self):
+return StreamingTestResult, (float(self.pValue),
+ float(self.degreesOfFreedom), 
float(self.statistic),
+ str(self.method), 
str(self.nullHypothesis))
+
+
+class StreamingTest(object):
+"""
+.. note:: Experimental
+
+Online 2-sample significance testing for a stream of (Boolean, Double) 
pairs. The Boolean
+identifies which sample each observation comes from, and the Double is 
the numeric value of the
+observation.
+
+To address novelty affects, the `peacePeriod` specifies a set number 
of initial RDD batches of
+the DStream to be dropped from significance testing.
+
+The `windowSize` sets the number of batches each significance test is 
to be performed over. The
+window is sliding with a stride length of 1 batch. Setting windowSize 
to 0 will perform
+cumulative processing, using all batches seen so far.
+
+Different tests may be used for assessing statistical significance 
depending on assumptions
+satisfied by data. For more details, see StreamingTestMethod. The 
`testMethod` specifies
+which test will be used.
+
+.. versionadded:: 2.0.0
+"""
+
+def __init__(self):
+self._peacePeriod = 0
+self._windowSize = 0
+self._testMethod = "welch"
+
+@since('2.0.0')
+def setPeacePeriod(self, peacePeriod):
+"""
+Update peacePeriod
+:param peacePeriod:
+  Set number of initial RDD batches of the DStream to be dropped 
from significance testing.
+"""
+self._peacePeriod = peacePeriod
+
+@since('2.0.0')
+def setWindowSize(self, windowSize):
+"""
+Update windowSize
+:param windowSize:
+  Set the number of batches each significance test is to be 
performed over.
+"""
+self._windowSize = windowSize
+
+@since('2.0.0')
+def setTestMethod(self, testMethod):
+"""
+Update test method
+:param testMethod:
+  Currently supported tests: `welch`, `student`.
+"""
+assert(testMethod in ("welch", "student"),
+   "Currently supported tests: \"welch\", \"student\"")
+self._testMethod = testMethod
+
+@since('2.0.0')
+def registerStream(self, data):
+"""
+Register a data stream to get its test result.
+
+:param data:
+  The input data stream, each element is a BinarySample instance.
+"""
+self._validate(data)
+sc = SparkContext._active_spark_context
+
+streamingTest = 
sc._jvm.org.apache.spark.mllib.stat.test.StreamingTest()
+streamingTest.setPeacePeriod(self._peacePeriod)
+streamingTest.setWindowSize(self._windowSize)
+streamingTest.setTestMethod(self._testMethod)
+
+javaDStream = sc._jvm.SerDe.pythonToJava(data._jdstream, True)
+testResult = streamingTest.registerStream(javaDStream)
--- End diff --

Why do we need `pythonToJava` and `javaToPython`; its not used for 
streaming K means 
https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/python/pyspark/mllib/clustering.py#L773


---
If your project

[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647272
  
--- Diff: examples/src/main/python/mllib/streaming_test_example.py ---
@@ -0,0 +1,57 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Create a DStream that contains several RDDs to show the StreamingTest of 
PySpark.
+"""
+import time
+import tempfile
+from shutil import rmtree
+
+from pyspark import SparkContext
+from pyspark.streaming import StreamingContext
+from pyspark.mllib.stat.test import BinarySample, StreamingTest
--- End diff --

`$example on$` and `$example off` appear to be used in other examples, 
though I'm not sure why myself


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11374: [SPARK-12042] Python API for mllib.stat.test.Stre...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11374#discussion_r85647289
  
--- Diff: examples/src/main/python/mllib/streaming_test_example.py ---
@@ -0,0 +1,57 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Create a DStream that contains several RDDs to show the StreamingTest of 
PySpark.
+"""
+import time
+import tempfile
+from shutil import rmtree
+
+from pyspark import SparkContext
+from pyspark.streaming import StreamingContext
+from pyspark.mllib.stat.test import BinarySample, StreamingTest
+
+if __name__ == "__main__":
+
+sc = SparkContext(appName="PythonStreamingTestExample")
+ssc = StreamingContext(sc, 1)
+
+checkpoint_path = tempfile.mkdtemp()
+ssc.checkpoint(checkpoint_path)
+
+# Create the queue through which RDDs can be pushed to a 
QueueInputDStream.
+rdd_queue = []
--- End diff --

use `camelCase` to be consistent with other examples


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11374: [SPARK-12042] Python API for mllib.stat.test.StreamingTe...

2016-10-29 Thread feynmanliang
Github user feynmanliang commented on the issue:

https://github.com/apache/spark/pull/11374
  
Apologies for the delay, I am traveling but I'll get this done this 
weekend. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11374: [SPARK-12042] Python API for mllib.stat.test.StreamingTe...

2016-10-28 Thread feynmanliang
Github user feynmanliang commented on the issue:

https://github.com/apache/spark/pull/11374
  
I'll review this tonight 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-19 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172811882
  
@dbtsai validating coefficients with R will be harder than expected, 
`glmnet` requires feature dimension >= 2 and `glm` doesn't yield +/- Infinity 
intercepts when given all 0/1 datasets...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r50006902
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") (
 val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
 
-if (numInvalid != 0) {
-  val msg = s"Classification labels should be in {0 to ${numClasses - 
1} " +
-s"Found $numInvalid invalid labels."
-  logError(msg)
-  throw new SparkException(msg)
-}
-
-if (numClasses > 2) {
-  val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
-s"binary classification. Found $numClasses in the input dataset."
-  logError(msg)
-  throw new SparkException(msg)
-}
+val (coefficients, intercept, objectiveHistory) = {
+  if (numInvalid != 0) {
+val msg = s"Classification labels should be in {0 to ${numClasses 
- 1} " +
+  s"Found $numInvalid invalid labels."
+logError(msg)
+throw new SparkException(msg)
+  }
 
-val featuresMean = summarizer.mean.toArray
-val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  if (numClasses > 2) {
+val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
+  s"binary classification. Found $numClasses in the input dataset."
+logError(msg)
+throw new SparkException(msg)
+  } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 
0.0) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be positive infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
+  } else if ($(fitIntercept) && numClasses == 1) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be negative infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  } else {
+val featuresMean = summarizer.mean.toArray
+val featuresStd = summarizer.variance.toArray.map(math.sqrt)
 
-val regParamL1 = $(elasticNetParam) * $(regParam)
-val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
+val regParamL1 = $(elasticNetParam) * $(regParam)
+val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept), $(standardization),
-  featuresStd, featuresMean, regParamL2)
+val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
+  $(standardization), featuresStd, featuresMean, regParamL2)
 
-val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) {
-  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
-} else {
-  def regParamL1Fun = (index: Int) => {
-// Remove the L1 penalization on the intercept
-if (index == numFeatures) {
-  0.0
+val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
+  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
-  if ($(standardization)) {
-regParamL1
-  } else {
-// If `standardization` is false, we still standardize the data
-// to improve the rate of convergence; as a result, we have to
-// perform this reverse standardization by penalizing each 
component
-// differently to get effectively the same objective function 
when
-// the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) 
else 0.0
+  def regParamL1Fun = (index: Int) => {
+// Remove the L1 penalization on the intercept
+if (index == numFeatures) {
+  0.0
+} else {
+  if ($(standardization)) {
+regParamL1
+  } else {
+// If `standardization` is false, we still standardize the 
data
+// to improve the rate of convergence; as a result, we 
h

[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r50006858
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") (
 val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
 
-if (numInvalid != 0) {
-  val msg = s"Classification labels should be in {0 to ${numClasses - 
1} " +
-s"Found $numInvalid invalid labels."
-  logError(msg)
-  throw new SparkException(msg)
-}
-
-if (numClasses > 2) {
-  val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
-s"binary classification. Found $numClasses in the input dataset."
-  logError(msg)
-  throw new SparkException(msg)
-}
+val (coefficients, intercept, objectiveHistory) = {
+  if (numInvalid != 0) {
+val msg = s"Classification labels should be in {0 to ${numClasses 
- 1} " +
+  s"Found $numInvalid invalid labels."
+logError(msg)
+throw new SparkException(msg)
+  }
 
-val featuresMean = summarizer.mean.toArray
-val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  if (numClasses > 2) {
+val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
+  s"binary classification. Found $numClasses in the input dataset."
+logError(msg)
+throw new SparkException(msg)
+  } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 
0.0) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be positive infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
+  } else if ($(fitIntercept) && numClasses == 1) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r50007441
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -883,6 +884,27 @@ class LogisticRegressionSuite
 assert(model1a0.intercept ~== model1b.intercept absTol 1E-3)
   }
 
+  test("logistic regression with fitIntercept=true and all labels the 
same") {
+val lr = new LogisticRegression()
+  .setFitIntercept(true)
+  .setMaxIter(3)
+val sameLabels = dataset
+  .withColumn("zeroLabel", lit(0.0))
+  .withColumn("oneLabel", lit(1.0))
+
+val allZeroModel = lr
+  .setLabelCol("zeroLabel")
+  .fit(sameLabels)
+assert(allZeroModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(allZeroModel.intercept === Double.NegativeInfinity)
+
+val allOneModel = lr
+  .setLabelCol("oneLabel")
+  .fit(sameLabels)
+assert(allOneModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(allOneModel.intercept === Double.PositiveInfinity)
+  }
+
--- End diff --

OK


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172556919
  
@dbtsai Added `fitIntercept=false` tests and fixed comments/`logWarning` 
messages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r50041066
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") (
 val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
 
-if (numInvalid != 0) {
-  val msg = s"Classification labels should be in {0 to ${numClasses - 
1} " +
-s"Found $numInvalid invalid labels."
-  logError(msg)
-  throw new SparkException(msg)
-}
-
-if (numClasses > 2) {
-  val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
-s"binary classification. Found $numClasses in the input dataset."
-  logError(msg)
-  throw new SparkException(msg)
-}
+val (coefficients, intercept, objectiveHistory) = {
+  if (numInvalid != 0) {
+val msg = s"Classification labels should be in {0 to ${numClasses 
- 1} " +
+  s"Found $numInvalid invalid labels."
+logError(msg)
+throw new SparkException(msg)
+  }
 
-val featuresMean = summarizer.mean.toArray
-val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  if (numClasses > 2) {
+val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
+  s"binary classification. Found $numClasses in the input dataset."
+logError(msg)
+throw new SparkException(msg)
+  } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 
0.0) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be positive infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
+  } else if ($(fitIntercept) && numClasses == 1) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be negative infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  } else {
+val featuresMean = summarizer.mean.toArray
+val featuresStd = summarizer.variance.toArray.map(math.sqrt)
 
-val regParamL1 = $(elasticNetParam) * $(regParam)
-val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
+val regParamL1 = $(elasticNetParam) * $(regParam)
+val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept), $(standardization),
-  featuresStd, featuresMean, regParamL2)
+val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
+  $(standardization), featuresStd, featuresMean, regParamL2)
 
-val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) {
-  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
-} else {
-  def regParamL1Fun = (index: Int) => {
-// Remove the L1 penalization on the intercept
-if (index == numFeatures) {
-  0.0
+val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
+  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
-  if ($(standardization)) {
-regParamL1
-  } else {
-// If `standardization` is false, we still standardize the data
-// to improve the rate of convergence; as a result, we have to
-// perform this reverse standardization by penalizing each 
component
-// differently to get effectively the same objective function 
when
-// the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) 
else 0.0
+  def regParamL1Fun = (index: Int) => {
+// Remove the L1 penalization on the intercept
+if (index == numFeatures) {
+  0.0
+} else {
+  if ($(standardization)) {
+regParamL1
+  } else {
+// If `standardization` is false, we still standardize the 
data
+// to improve the rate of convergence; as a result, we 
h

[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r50041068
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") (
 val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
 
-if (numInvalid != 0) {
-  val msg = s"Classification labels should be in {0 to ${numClasses - 
1} " +
-s"Found $numInvalid invalid labels."
-  logError(msg)
-  throw new SparkException(msg)
-}
-
-if (numClasses > 2) {
-  val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
-s"binary classification. Found $numClasses in the input dataset."
-  logError(msg)
-  throw new SparkException(msg)
-}
+val (coefficients, intercept, objectiveHistory) = {
+  if (numInvalid != 0) {
+val msg = s"Classification labels should be in {0 to ${numClasses 
- 1} " +
+  s"Found $numInvalid invalid labels."
+logError(msg)
+throw new SparkException(msg)
+  }
 
-val featuresMean = summarizer.mean.toArray
-val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  if (numClasses > 2) {
+val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
+  s"binary classification. Found $numClasses in the input dataset."
+logError(msg)
+throw new SparkException(msg)
+  } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 
0.0) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be positive infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
+  } else if ($(fitIntercept) && numClasses == 1) {
+logWarning(s"All labels are zero and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be negative infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  } else {
+val featuresMean = summarizer.mean.toArray
+val featuresStd = summarizer.variance.toArray.map(math.sqrt)
 
-val regParamL1 = $(elasticNetParam) * $(regParam)
-val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
+val regParamL1 = $(elasticNetParam) * $(regParam)
+val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept), $(standardization),
-  featuresStd, featuresMean, regParamL2)
+val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
+  $(standardization), featuresStd, featuresMean, regParamL2)
 
-val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) {
-  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
-} else {
-  def regParamL1Fun = (index: Int) => {
-// Remove the L1 penalization on the intercept
-if (index == numFeatures) {
-  0.0
+val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
+  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
-  if ($(standardization)) {
-regParamL1
-  } else {
-// If `standardization` is false, we still standardize the data
-// to improve the rate of convergence; as a result, we have to
-// perform this reverse standardization by penalizing each 
component
-// differently to get effectively the same objective function 
when
-// the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) 
else 0.0
+  def regParamL1Fun = (index: Int) => {
+// Remove the L1 penalization on the intercept
+if (index == numFeatures) {
+  0.0
+} else {
+  if ($(standardization)) {
+regParamL1
+  } else {
+// If `standardization` is false, we still standardize the 
data
+// to improve the rate of convergence; as a result, we 
h

[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49946338
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -883,6 +884,22 @@ class LogisticRegressionSuite
 assert(model1a0.intercept ~== model1b.intercept absTol 1E-3)
   }
 
+  test("logistic regression with all labels the same") {
+val lr = new LogisticRegression()
+  .setFitIntercept(true)
+  .setMaxIter(3)
+val sameLabels = dataset
+  .withColumn("zeroLabel", lit(0.0))
+  .withColumn("oneLabel", lit(1.0))
+
+val model = lr
+  .setLabelCol("oneLabel")
+  .fit(sameLabels)
+
+assert(model.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(model.intercept === Double.PositiveInfinity)
--- End diff --

Thanks for pointing that out!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49946694
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -339,9 +339,11 @@ class LogisticRegression @Since("1.2.0") (
  b = \log{P(1) / P(0)} = \log{count_1 / count_0}
  }}}
*/
-  initialCoefficientsWithIntercept.toArray(numFeatures)
-= math.log(histogram(1) / histogram(0))
-}
+   if (histogram.length >= 2) { // check to make sure indexing into 
histogram(1) is safe
+ initialCoefficientsWithIntercept.toArray(numFeatures) = math.log(
+   histogram(1) / histogram(0))
--- End diff --

Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172328753
  
@dbtsai @jkbradley ready for second review.

The big diff is because I grouped the same label cases with the normal case 
to generate `coefficients`, `intercept`, and `objectiveTrace` all in the same 
block. This is to avoid repeated code when generating the model summary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-13 Thread feynmanliang
GitHub user feynmanliang opened a pull request:

https://github.com/apache/spark/pull/10743

[SPARK-12804][ML] Fix LogisticRegression with FitIntercept on all same 
label training data



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/feynmanliang/spark SPARK-12804

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10743.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10743


commit fbf6b5cab51544c9230567e9479528a9bd8960c5
Author: Feynman Liang <feynman.li...@gmail.com>
Date:   2016-01-13T17:52:56Z

Initial fix and println unit test

commit e4c13d4a89abc8160f1c2fa906cb3e3d1affd473
Author: Feynman Liang <feynman.li...@gmail.com>
Date:   2016-01-13T19:23:49Z

Cleans up test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7128][ML] Bagging (bootstrap aggregatin...

2016-01-05 Thread feynmanliang
Github user feynmanliang closed the pull request at:

https://github.com/apache/spark/pull/8618


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-26 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9974#issuecomment-159869256
  
LGTM, though I didn't know what was wrong with the old PR so this should 
get a second set of eyes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11960][MLlib][Doc]User guide for stream...

2015-11-26 Thread feynmanliang
GitHub user feynmanliang opened a pull request:

https://github.com/apache/spark/pull/10005

[SPARK-11960][MLlib][Doc]User guide for streaming tests

CC @jkbradley @mengxr @josepablocam

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/feynmanliang/spark streaming-test-user-guide

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10005.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10005


commit d1f015354305b67689801f19ba368a480e421d50
Author: Feynman Liang <feynman.li...@gmail.com>
Date:   2015-11-26T13:41:22Z

Adds streaming test user guide




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-20 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9722#issuecomment-158329286
  
Jenkins test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-20 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9722#issuecomment-158329258
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-19 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9722#discussion_r45314139
  
--- Diff: docs/ml-clustering.md ---
@@ -0,0 +1,25 @@
+---
+layout: global
+title: Clustering - ML
+displayTitle: ML - Clustering
+---
+
+In `spark.ml`, we implement the corresponding pipeline API for 
--- End diff --

Sgtm

On Thu, Nov 19, 2015, 01:56 Yuhao Yang <notificati...@github.com> wrote:

> In docs/ml-clustering.md
> <https://github.com/apache/spark/pull/9722#discussion_r45290684>:
>
> > @@ -0,0 +1,25 @@
> > +---
> > +layout: global
> > +title: Clustering - ML
> > +displayTitle: ML - Clustering
> > +---
> > +
> > +In `spark.ml`, we implement the corresponding pipeline API for
>
> How about, "In this section, we introduce the pipeline API for clustering
> in mllib <http://mllib-clustering.html>" ?
>
> —
> Reply to this email directly or view it on GitHub
> <https://github.com/apache/spark/pull/9722/files#r45290684>.
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-19 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9722#discussion_r45382273
  
--- Diff: docs/ml-guide.md ---
@@ -40,6 +40,7 @@ Also, some algorithms have additional capabilities in the 
`spark.ml` API; e.g.,
 provide class probabilities, and linear models provide model summaries.
 
 * [Feature extraction, transformation, and selection](ml-features.html)
+* [Clustering](ml-clustering.html)
--- End diff --

Just noticed that "Feature extraction..." is not alphabetized (sorry about 
my earlier comment!). If you don't mind, could you move that down under 
"Ensembles"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-19 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9722#discussion_r45381957
  
--- Diff: docs/ml-clustering.md ---
@@ -0,0 +1,31 @@
+---
+layout: global
+title: Clustering - ML
+displayTitle: ML - Clustering
+---
+
+In this section, we introduce the corresponding pipeline API for
--- End diff --

I thought this was supposed to read "In this section, we introduce the 
pipeline API for clustering in mllib"? The pipelines API doesn't correspond to 
`mllib` since it's missing a lot of features


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-19 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9722#discussion_r45382669
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaLDAExample.java ---
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml;
+
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.ml.clustering.LDA;
+import org.apache.spark.ml.clustering.LDAModel;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.VectorUDT;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.sql.DataFrame;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.catalyst.expressions.GenericRow;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.regex.Pattern;
+
+/**
+ * An example demonstrating LDA
+ * Run with
+ * 
+ * bin/run-example ml.JavaLDAExample  
+ * 
+ */
+public class JavaLDAExample {
+
+  private static class ParseVector implements Function<String, Row> {
+private static final Pattern separator = Pattern.compile(" ");
+
+@Override
+public Row call(String line) {
+  String[] tok = separator.split(line);
+  double[] point = new double[tok.length];
+  for (int i = 0; i < tok.length; ++i) {
+point[i] = Double.parseDouble(tok[i]);
--- End diff --

I think it might be confusing when a reader of the docs gets two different 
examples after flipping between languages. I'm really sorry, but do you mind 
changing it back so that they match (we can keep the examples using count 
vectors).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-19 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9722#issuecomment-158153601
  
LGTM after changes


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-19 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9722#discussion_r45382120
  
--- Diff: docs/ml-clustering.md ---
@@ -0,0 +1,31 @@
+---
+layout: global
+title: Clustering - ML
+displayTitle: ML - Clustering
+---
+
+In this section, we introduce the corresponding pipeline API for
+[clustering in mllib](mllib-clustering.html).
+
+## Latent Dirichlet allocation (LDA)
+
+`LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` 
and `OnlineLDAOptimizer`,
+and generates a `LDAModel`, as the base models. Expert users may cast a 
`LDAModel` generated by 
--- End diff --

"`LDAModel`, as the base models" -> "`LDAModel` as the base model"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9722#discussion_r45206456
  
--- Diff: docs/mllib-guide.md ---
@@ -73,6 +73,7 @@ concepts. It also contains sections on using algorithms 
within the Pipelines API
 * [Ensembles](ml-ensembles.html)
 * [Linear methods with elastic net regularization](ml-linear-methods.html)
 * [Multilayer perceptron classifier](ml-ann.html)
+* [Clustering](ml-clustering.html)
--- End diff --

Ditto on sorting


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9722#discussion_r45206436
  
--- Diff: docs/ml-guide.md ---
@@ -45,6 +45,7 @@ provide class probabilities, and linear models provide 
model summaries.
 * [Linear methods with elastic net regularization](ml-linear-methods.html)
 * [Multilayer perceptron classifier](ml-ann.html)
 * [Survival Regression](ml-survival-regression.html)
+* [Clustering](ml-clustering.html)
--- End diff --

The other links in this section are sorted alphabetically


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9722#discussion_r45207331
  
--- Diff: docs/ml-clustering.md ---
@@ -0,0 +1,25 @@
+---
+layout: global
+title: Clustering - ML
+displayTitle: ML - Clustering
+---
+
+In `spark.ml`, we implement the corresponding pipeline API for 
+[clustering in mllib](mllib-clustering.html).
+
+## Latent Dirichlet allocation (LDA)
+
+`LDA` is implemented as an Estimator that supports both `EMLDAOptimizer` 
and `OnlineLDAOptimizer`,
+and generates `LocalLDAModel` and `DistributedLDAModel` respectively, as 
the base models.
--- End diff --

This isn't true; `LDA` generates a `LDAModel` which is the superclass of 
wrappers to the two `mllib` classes. I would just say "generates a `LDAModel`".

We can also mention that expert users may cast a `LDAModel` generated by 
`EMLDAOptimizer` to a `DistributedLDAModel` if needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9722#discussion_r45207931
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaLDAExample.java ---
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml;
+
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.ml.clustering.LDA;
+import org.apache.spark.ml.clustering.LDAModel;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.VectorUDT;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.sql.DataFrame;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.catalyst.expressions.GenericRow;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.regex.Pattern;
+
+/**
+ * An example demonstrating LDA
+ * Run with
+ * 
+ * bin/run-example ml.JavaLDAExample  
+ * 
+ */
+public class JavaLDAExample {
+
+  private static class ParseVector implements Function<String, Row> {
+private static final Pattern separator = Pattern.compile(" ");
+
+@Override
+public Row call(String line) {
+  String[] tok = separator.split(line);
+  double[] point = new double[tok.length];
+  for (int i = 0; i < tok.length; ++i) {
+point[i] = Double.parseDouble(tok[i]);
--- End diff --

We're expecting a text file containing count vectors here? Seems a bit odd. 
IMO an example taking a document of text and using pipelines to generate the 
features would be more natural, e.g. 
https://gist.github.com/feynmanliang/3b6555758a27adcb527d


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9722#discussion_r45207539
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaLDAExample.java ---
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml;
+
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.ml.clustering.LDA;
+import org.apache.spark.ml.clustering.LDAModel;
+import org.apache.spark.mllib.linalg.Vector;
+import org.apache.spark.mllib.linalg.VectorUDT;
+import org.apache.spark.mllib.linalg.Vectors;
+import org.apache.spark.sql.DataFrame;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.catalyst.expressions.GenericRow;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.regex.Pattern;
+
+/**
+ * An example demonstrating LDA
+ * Run with
+ * 
+ * bin/run-example ml.JavaLDAExample  
+ * 
+ */
+public class JavaLDAExample {
+
+  private static class ParseVector implements Function<String, Row> {
--- End diff --

nit: `ParseVector` -> `ParseStringToVector`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9722#discussion_r45206861
  
--- Diff: docs/ml-clustering.md ---
@@ -0,0 +1,25 @@
+---
+layout: global
+title: Clustering - ML
+displayTitle: ML - Clustering
+---
+
+In `spark.ml`, we implement the corresponding pipeline API for 
--- End diff --

nit: The pipelines API is only a subset of `mllib.clustering` (at least the 
documentation certainly is), so I found this language a bit confusing. Can we 
omit it altogether at least until we have implemented/documented more than a 
single feature?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9722#discussion_r45206909
  
--- Diff: docs/ml-clustering.md ---
@@ -0,0 +1,25 @@
+---
+layout: global
+title: Clustering - ML
+displayTitle: ML - Clustering
+---
+
+In `spark.ml`, we implement the corresponding pipeline API for 
+[clustering in mllib](mllib-clustering.html).
+
+## Latent Dirichlet allocation (LDA)
+
+`LDA` is implemented as an Estimator that supports both `EMLDAOptimizer` 
and `OnlineLDAOptimizer`,
--- End diff --

\`Estimator\`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9722#discussion_r45207955
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/LDAExample.scala ---
@@ -0,0 +1,82 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.ml
+
+import org.apache.spark.{SparkContext, SparkConf}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors}
+// $example on$
+import org.apache.spark.ml.clustering.LDA
+import org.apache.spark.sql.{Row, SQLContext}
+import org.apache.spark.sql.types.{StructField, StructType}
+// $example off$
+
+/**
+ * An example demonstrating a LDA of ML pipeline.
+ * Run with
+ * {{{
+ * bin/run-example ml.LDAExample  
+ * }}}
+ */
+object LDAExample {
+
+  final val FEATURES_COL = "features"
+
+  def main(args: Array[String]): Unit = {
+if (args.length != 2) {
+  // scalastyle:off println
+  System.err.println("Usage: ml.LDAExample  ")
+  // scalastyle:on println
+  System.exit(1)
+}
+val input = args(0)
+val k = args(1).toInt
+
+// Creates a Spark context and a SQL context
+val conf = new 
SparkConf().setAppName(s"${this.getClass.getSimpleName}")
+val sc = new SparkContext(conf)
+val sqlContext = new SQLContext(sc)
+
+// $example on$
+// Loads data
+val rowRDD = sc.textFile(input).filter(_.nonEmpty)
+  .map(_.split(" ").map(_.toDouble)).map(Vectors.dense).map(Row(_))
--- End diff --

Ditto about input format being a text file of count vectors


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9722#discussion_r45207421
  
--- Diff: docs/ml-clustering.md ---
@@ -0,0 +1,25 @@
+---
+layout: global
+title: Clustering - ML
+displayTitle: ML - Clustering
+---
+
+In `spark.ml`, we implement the corresponding pipeline API for 
+[clustering in mllib](mllib-clustering.html).
+
+## Latent Dirichlet allocation (LDA)
+
+`LDA` is implemented as an Estimator that supports both `EMLDAOptimizer` 
and `OnlineLDAOptimizer`,
+and generates `LocalLDAModel` and `DistributedLDAModel` respectively, as 
the base models.
+
+
+
--- End diff --

Please link API docs for each language in code tab before example


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-18 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9722#issuecomment-157738214
  
Made a pass


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11689] [ML] Add user guide and example ...

2015-11-18 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9722#discussion_r45206678
  
--- Diff: docs/mllib-guide.md ---
@@ -73,6 +73,7 @@ concepts. It also contains sections on using algorithms 
within the Pipelines API
 * [Ensembles](ml-ensembles.html)
 * [Linear methods with elastic net regularization](ml-linear-methods.html)
 * [Multilayer perceptron classifier](ml-ann.html)
+* [Clustering](ml-clustering.html)
--- End diff --

Actually, do we even need to link it here? I read this section as examples 
of items in `ml` but not `mllib`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11712] [ML] Make spark.ml LDAModel be a...

2015-11-12 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9678#discussion_r44720208
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -468,7 +452,37 @@ class LDAModel private[ml] (
 /**
  * :: Experimental ::
  *
- * Distributed model fitted by [[LDA]] using Expectation-Maximization (EM).
+ * Local (non-distributed) model fitted by [[LDA]].
+ *
+ * This model stores the inferred topics only; it does not store info 
about the training dataset.
+ */
+@Since("1.6.0")
+@Experimental
+class LocalLDAModel private[ml] (
+uid: String,
+vocabSize: Int,
--- End diff --

Since annotations


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11712] [ML] Make spark.ml LDAModel be a...

2015-11-12 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9678#issuecomment-156249049
  
LGTM, all comments minor / optional


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11712] [ML] Make spark.ml LDAModel be a...

2015-11-12 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9678#discussion_r44720409
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -314,31 +314,31 @@ private[clustering] trait LDAParams extends Params 
with HasFeaturesCol with HasM
  * Model fitted by [[LDA]].
  *
  * @param vocabSize  Vocabulary size (number of terms or terms in the 
vocabulary)
- * @param oldLocalModel  Underlying spark.mllib model.
- *   If this model was produced by Online LDA, then 
this is the
- *   only model representation.
- *   If this model was produced by EM, then this local
- *   representation may be built lazily.
  * @param sqlContext  Used to construct local DataFrames for returning 
query results
  */
 @Since("1.6.0")
 @Experimental
-class LDAModel private[ml] (
+sealed abstract class LDAModel private[ml] (
 @Since("1.6.0") override val uid: String,
 @Since("1.6.0") val vocabSize: Int,
-@Since("1.6.0") protected var oldLocalModel: Option[OldLocalLDAModel],
 @Since("1.6.0") @transient protected val sqlContext: SQLContext)
   extends Model[LDAModel] with LDAParams with Logging {
 
-  /** Returns underlying spark.mllib model */
+  // NOTE to developers:
+  //  This abstraction should contain all important functionality for 
basic LDA usage.
+  //  Specializations of this class can contain expert-only functionality.
+
+  /**
+   * Underlying spark.mllib model.
+   * If this model was produced by Online LDA, then this is the only model 
representation.
+   * If this model was produced by EM, then this local representation may 
be built lazily.
+   */
   @Since("1.6.0")
-  protected def getModel: OldLDAModel = oldLocalModel match {
-case Some(m) => m
-case None =>
-  // Should never happen.
-  throw new RuntimeException("LDAModel required local model format," +
-" but the underlying model is missing.")
-  }
+  protected def oldLocalModel: OldLocalLDAModel
+
+  /** Returns underlying spark.mllib model, which may be local or 
distributed */
+  @Since("1.6.0")
+  protected def getModel: OldLDAModel
--- End diff --

Should we warn that this (any any method that uses this) will trigger a 
`collect` in `DistributedLDAModel`, or can we expect users of 
`DistributedLDAModel` understand that already?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11712] [ML] Make spark.ml LDAModel be a...

2015-11-12 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9678#discussion_r44721007
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -480,58 +494,38 @@ class DistributedLDAModel private[ml] (
 vocabSize: Int,
 private val oldDistributedModel: OldDistributedLDAModel,
 sqlContext: SQLContext)
-  extends LDAModel(uid, vocabSize, None, sqlContext) {
+  extends LDAModel(uid, vocabSize, sqlContext) {
+
+  /** Used to implement [[oldLocalModel]] as a lazy val, but with cheap 
[[copy()]] */
+  private var oldLocalModelOption: Option[OldLocalLDAModel] = None
+
+  override protected def oldLocalModel: OldLocalLDAModel = {
+if (oldLocalModelOption.isEmpty) {
+  oldLocalModelOption = Some(oldDistributedModel.toLocal)
+}
+oldLocalModelOption.get
+  }
+
+  override protected def getModel: OldLDAModel = oldDistributedModel
 
   /**
* Convert this distributed model to a local representation.  This 
discards info about the
* training dataset.
*/
   @Since("1.6.0")
-  def toLocal: LDAModel = {
-if (oldLocalModel.isEmpty) {
-  oldLocalModel = Some(oldDistributedModel.toLocal)
-}
-new LDAModel(uid, vocabSize, oldLocalModel, sqlContext)
-  }
-
-  @Since("1.6.0")
-  override protected def getModel: OldLDAModel = oldDistributedModel
+  def toLocal: LocalLDAModel = new LocalLDAModel(uid, vocabSize, 
oldLocalModel, sqlContext)
 
   @Since("1.6.0")
   override def copy(extra: ParamMap): DistributedLDAModel = {
 val copied = new DistributedLDAModel(uid, vocabSize, 
oldDistributedModel, sqlContext)
-if (oldLocalModel.nonEmpty) copied.oldLocalModel = oldLocalModel
+copied.oldLocalModelOption = oldLocalModelOption
--- End diff --

Any harm in making `oldLocalModelOption` part of the constructor and 
avoiding this private var access?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11712] [ML] Make spark.ml LDAModel be a...

2015-11-12 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9678#discussion_r44720740
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -468,7 +452,37 @@ class LDAModel private[ml] (
 /**
  * :: Experimental ::
  *
- * Distributed model fitted by [[LDA]] using Expectation-Maximization (EM).
+ * Local (non-distributed) model fitted by [[LDA]].
+ *
+ * This model stores the inferred topics only; it does not store info 
about the training dataset.
+ */
+@Since("1.6.0")
+@Experimental
+class LocalLDAModel private[ml] (
+uid: String,
--- End diff --

We annotated the constructor args to the `private` constructor for 
`LDAModel` but not here; why?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11712] [ML] Make spark.ml LDAModel be a...

2015-11-12 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9678#discussion_r44720656
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -468,7 +452,37 @@ class LDAModel private[ml] (
 /**
  * :: Experimental ::
  *
- * Distributed model fitted by [[LDA]] using Expectation-Maximization (EM).
+ * Local (non-distributed) model fitted by [[LDA]].
+ *
+ * This model stores the inferred topics only; it does not store info 
about the training dataset.
+ */
+@Since("1.6.0")
+@Experimental
+class LocalLDAModel private[ml] (
+uid: String,
+vocabSize: Int,
+@Since("1.6.0") override protected val oldLocalModel: OldLocalLDAModel,
--- End diff --

Does a `protected val` in a private constructor need a Since annotation? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11712] [ML] Make spark.ml LDAModel be a...

2015-11-12 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9678#discussion_r44720555
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -480,58 +494,38 @@ class DistributedLDAModel private[ml] (
 vocabSize: Int,
--- End diff --

Do these need Since annotations?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11262][ML] Unit test for gradient, loss...

2015-11-11 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9229#issuecomment-155714828
  
@avulanov On [this 
page](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45410/consoleFull)
 search for "*** FAILED ***"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-10 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155376552
  
+1 on the renames

On Tue, Nov 10, 2015, 02:48 Apache Spark QA <notificati...@github.com>
wrote:

> *Test build #2025 has finished
> 
<https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2025/consoleFull>*
> for PR 9513 at commit 16a061c
> 
<https://github.com/apache/spark/commit/16a061ca4df6abb59b1cff6695debac7492260ab>
> .
>
>- This patch passes all tests.
>- This patch merges cleanly.
>- This patch adds the following public classes *(experimental)*:\n * 
class
>LDA @Since(\"1.6.0\") (\n
>
> —
>
>
> Reply to this email directly or view it on GitHub
> <https://github.com/apache/spark/pull/9513#issuecomment-155266291>.
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10393] use ML pipeline in LDA example

2015-11-10 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/8551#issuecomment-155376709
  
Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-10 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155630055
  
I still think it's wrong for a `LocalLDAModel` to *optionally* have a 
`OldLocalLDAModel` when it's basically a wrapper for `OldLocalLDAModel`. 
Forking the inheritance structure could avoid that by making the 
`Option[OldLocalLDAModel]` localized to `DistributedLDAModel` (and we can still 
have the `copy` iff `collect`ed already semantics) while also removing the 
`case Some(...) => ... case None => /* should never happen */`s 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-10 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155627493
  
Oh wait I see what you're saying


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-10 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155627361
  
@jkbradley Not sure I understand, if `lazy val oldModel = 
*something*.collect()` then `collect()` will only be called once on the first 
reference to `oldModel` and every subsequent reference to `oldModel` will use 
the `Array[...]` materialized by `collect()`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44341246
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44341643
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44322164
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44322120
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44323247
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44324308
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44323843
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44323496
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155185664
  
Second pass. Most significant comments are about completely removing 
`Vector` from the public API and debating `DistributedLDAModel < LDAModel` vs 
`abstract class LDAModel`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44323084
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44324500
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44321879
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
--- End diff --

Ideally `Either[Double,Vector]` would be best but I'm not sure if param's 
can be `Either`s. If not, what you proposed sounds good


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44354690
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,668 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters) to infer. Must be > 1. 
Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of topics (clusters) to 
infer",
+ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Con

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44354638
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,668 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters) to infer. Must be > 1. 
Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of topics (clusters) to 
infer",
+ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Con

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-09 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-155243816
  
LGTM

If we do decide to change the inheritance structure it should be done 
before 1.6 release to prevent breaking public APIs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202059
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202038
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[get

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44203476
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44203542
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202262
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44203245
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44203695
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on the pull request:

https://github.com/apache/spark/pull/9513#issuecomment-154590891
  
Made a pass


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202348
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202796
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44203600
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202645
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202165
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202952
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202889
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202541
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202484
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
--- End diff --

IMO this validation logic is quite confusing and was there for backwards 
compatibility. Since we have this opportunity to implement a new API, I suggest:
 * Ditching the singleton vector option, requiring the user to specify a 
length `k` vector
 * Keeping the automatic init as the default, making the API easy for 
novice users

The only feature that is lost is replication of `docConcentration` > 0 to a 
symmetric prior


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44203006
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

[GitHub] spark pull request: [SPARK-5565] [ML] LDA wrapper for Pipelines AP...

2015-11-06 Thread feynmanliang
Github user feynmanliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/9513#discussion_r44202982
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -0,0 +1,740 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.clustering
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param.shared.{HasCheckpointInterval, 
HasFeaturesCol, HasSeed, HasMaxIter}
+import org.apache.spark.ml.param._
+import org.apache.spark.mllib.clustering.{DistributedLDAModel => 
OldDistributedLDAModel,
+EMLDAOptimizer => OldEMLDAOptimizer, LDA => OldLDA, LDAModel => 
OldLDAModel,
+LDAOptimizer => OldLDAOptimizer, LocalLDAModel => OldLocalLDAModel,
+OnlineLDAOptimizer => OldOnlineLDAOptimizer}
+import org.apache.spark.mllib.linalg.{VectorUDT, Vectors, Matrix, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.functions.{col, monotonicallyIncreasingId, udf}
+import org.apache.spark.sql.types.StructType
+
+
+private[clustering] trait LDAParams extends Params with HasFeaturesCol 
with HasMaxIter
+  with HasSeed with HasCheckpointInterval {
+
+  /**
+   * Param for the number of topics (clusters). Must be > 1. Default: 10.
+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", 
ParamValidators.gt(1))
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getK: Int = $(k)
+
+  /**
+   * Concentration parameter (commonly named "alpha") for the prior placed 
on documents'
+   * distributions over topics ("theta").
+   *
+   * This is the parameter to a Dirichlet distribution, where larger 
values mean more smoothing
+   * (more regularization).
+   *
+   * If set to a singleton vector [-1], then docConcentration is set 
automatically. If set to
+   * singleton vector [alpha] where alpha != -1, then alpha is replicated 
to a vector of
+   * length k in fitting. Otherwise, the [[docConcentration]] vector must 
be length k.
+   * (default = [-1] = automatic)
+   *
+   * Optimizer-specific parameter settings:
+   *  - EM
+   * - Currently only supports symmetric distributions, so all values 
in the vector should be
+   *   the same.
+   * - Values should be > 1.0
+   * - default = uniformly (50 / k) + 1, where 50/k is common in LDA 
libraries and +1 follows
+   *   from Asuncion et al. (2009), who recommend a +1 adjustment for 
EM.
+   *  - Online
+   * - Values should be >= 0
+   * - default = uniformly (1.0 / k), following the implementation from
+   *   [[https://github.com/Blei-Lab/onlineldavb]].
+   * @group param
+   */
+  @Since("1.6.0")
+  final val docConcentration = new DoubleArrayParam(this, 
"docConcentration",
+"Concentration parameter (commonly named \"alpha\") for the prior 
placed on documents'" +
+  " distributions over topics (\"theta\").", validDocConcentration)
+
+  /** Check that the docConcentration is valid, independently of other 
Params */
+  private def validDocConcentration(alpha: Array[Double]): Boolean = {
+if (alpha.length == 1) {
+  alpha(0) == -1 || alpha(0) >= 1.0
+} else if (alpha.length > 1) {
+  alpha.forall(_ >= 1.0)
+} else {
+  false
+}
+  }
+
+  /** @group getParam */
+  @Since("1.6.0")
+  def getDocConcentration: Array[Double] = $(docConcentration)
+
+  /**
+   * Alias for [[getDocConcentrat

  1   2   3   4   5   6   7   8   9   10   >