[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-16 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r181731252
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala ---
@@ -81,32 +81,37 @@ object KolmogorovSmirnovTest {
* Java-friendly version of `test(dataset: DataFrame, sampleCol: String, 
cdf: Double => Double)`
*/
   @Since("2.4.0")
-  def test(dataset: DataFrame, sampleCol: String,
-cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
-val f: Double => Double = x => cdf.call(x)
-test(dataset, sampleCol, f)
+  def test(
+  dataset: Dataset[_],
+  sampleCol: String,
+  cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
+test(dataset, sampleCol, (x: Double) => cdf.call(x))
--- End diff --

That's not a very scalable fix...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-12 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r181270142
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala ---
@@ -81,32 +81,37 @@ object KolmogorovSmirnovTest {
* Java-friendly version of `test(dataset: DataFrame, sampleCol: String, 
cdf: Double => Double)`
*/
   @Since("2.4.0")
-  def test(dataset: DataFrame, sampleCol: String,
-cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
-val f: Double => Double = x => cdf.call(x)
-test(dataset, sampleCol, f)
+  def test(
+  dataset: Dataset[_],
+  sampleCol: String,
+  cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
+test(dataset, sampleCol, (x: Double) => cdf.call(x))
--- End diff --

I can build locally against scala-2.12 to check it. :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r181230965
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala ---
@@ -81,32 +81,37 @@ object KolmogorovSmirnovTest {
* Java-friendly version of `test(dataset: DataFrame, sampleCol: String, 
cdf: Double => Double)`
*/
   @Since("2.4.0")
-  def test(dataset: DataFrame, sampleCol: String,
-cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
-val f: Double => Double = x => cdf.call(x)
-test(dataset, sampleCol, f)
+  def test(
+  dataset: Dataset[_],
+  sampleCol: String,
+  cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
+test(dataset, sampleCol, (x: Double) => cdf.call(x))
--- End diff --

This is the 2nd time I've merged something which broke the Scala 2.12 
build.  If we care about Scala 2.12, do you know if there are any efforts to 
add PR builder tests for 2.12?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-12 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r181018223
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala ---
@@ -81,32 +81,37 @@ object KolmogorovSmirnovTest {
* Java-friendly version of `test(dataset: DataFrame, sampleCol: String, 
cdf: Double => Double)`
*/
   @Since("2.4.0")
-  def test(dataset: DataFrame, sampleCol: String,
-cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
-val f: Double => Double = x => cdf.call(x)
-test(dataset, sampleCol, f)
+  def test(
+  dataset: Dataset[_],
+  sampleCol: String,
+  cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
+test(dataset, sampleCol, (x: Double) => cdf.call(x))
--- End diff --

[SPARK-23751][FOLLOW-UP] fix build for scala-2.12


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-12 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r181015525
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala ---
@@ -81,32 +81,37 @@ object KolmogorovSmirnovTest {
* Java-friendly version of `test(dataset: DataFrame, sampleCol: String, 
cdf: Double => Double)`
*/
   @Since("2.4.0")
-  def test(dataset: DataFrame, sampleCol: String,
-cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
-val f: Double => Double = x => cdf.call(x)
-test(dataset, sampleCol, f)
+  def test(
+  dataset: Dataset[_],
+  sampleCol: String,
+  cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
+test(dataset, sampleCol, (x: Double) => cdf.call(x))
--- End diff --

ah it's my fault. Sorry!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-12 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r181006772
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala ---
@@ -81,32 +81,37 @@ object KolmogorovSmirnovTest {
* Java-friendly version of `test(dataset: DataFrame, sampleCol: String, 
cdf: Double => Double)`
*/
   @Since("2.4.0")
-  def test(dataset: DataFrame, sampleCol: String,
-cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
-val f: Double => Double = x => cdf.call(x)
-test(dataset, sampleCol, f)
+  def test(
+  dataset: Dataset[_],
+  sampleCol: String,
+  cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
+test(dataset, sampleCol, (x: Double) => cdf.call(x))
--- End diff --

Seems like this broke Scala 2.12 build again. Can you fix it or shall I? 
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-10 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20904


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r180228711
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -127,13 +113,86 @@ class Correlation(object):
 def corr(dataset, column, method="pearson"):
 """
 Compute the correlation matrix with specified method using dataset.
+
+:param dataset:
+  A Dataset or a DataFrame.
+:param column:
+  The name of the column of vectors for which the correlation 
coefficient needs
+  to be computed. This must be a column of the dataset, and it 
must contain
+  Vector objects.
+:param method:
+  String specifying the method to use for computing correlation.
+  Supported: `pearson` (default), `spearman`.
+:return:
+  A DataFrame that contains the correlation matrix of the column 
of vectors. This
+  DataFrame contains a single row and a single column of name
+  '$METHODNAME($COLUMN)'.
 """
 sc = SparkContext._active_spark_context
 javaCorrObj = _jvm().org.apache.spark.ml.stat.Correlation
 args = [_py2java(sc, arg) for arg in (dataset, column, method)]
 return _java2py(sc, javaCorrObj.corr(*args))
 
 
+class KolmogorovSmirnovTest(object):
+"""
+.. note:: Experimental
+
+Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled 
from a continuous
+distribution.
+
+By comparing the largest difference between the empirical cumulative
+distribution of the sample data and the theoretical distribution we 
can provide a test for the
+the null hypothesis that the sample data comes from that theoretical 
distribution.
+
+>>> from pyspark.ml.stat import KolmogorovSmirnovTest
--- End diff --

Thanks for moving the method-specific documentation.  These doctests are 
method-specific too, though, so can you please move them as well?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r180245120
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -127,13 +113,86 @@ class Correlation(object):
 def corr(dataset, column, method="pearson"):
 """
 Compute the correlation matrix with specified method using dataset.
+
+:param dataset:
+  A Dataset or a DataFrame.
+:param column:
+  The name of the column of vectors for which the correlation 
coefficient needs
+  to be computed. This must be a column of the dataset, and it 
must contain
+  Vector objects.
+:param method:
+  String specifying the method to use for computing correlation.
+  Supported: `pearson` (default), `spearman`.
+:return:
+  A DataFrame that contains the correlation matrix of the column 
of vectors. This
+  DataFrame contains a single row and a single column of name
+  '$METHODNAME($COLUMN)'.
 """
 sc = SparkContext._active_spark_context
 javaCorrObj = _jvm().org.apache.spark.ml.stat.Correlation
 args = [_py2java(sc, arg) for arg in (dataset, column, method)]
 return _java2py(sc, javaCorrObj.corr(*args))
 
 
+class KolmogorovSmirnovTest(object):
+"""
+.. note:: Experimental
+
+Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled 
from a continuous
+distribution.
+
+By comparing the largest difference between the empirical cumulative
+distribution of the sample data and the theoretical distribution we 
can provide a test for the
+the null hypothesis that the sample data comes from that theoretical 
distribution.
+
+>>> from pyspark.ml.stat import KolmogorovSmirnovTest
+>>> dataset = [[-1.0], [0.0], [1.0]]
+>>> dataset = spark.createDataFrame(dataset, ['sample'])
+>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 
0.0, 1.0).first()
+>>> round(ksResult.pValue, 3)
+1.0
+>>> round(ksResult.statistic, 3)
+0.175
+>>> dataset = [[2.0], [3.0], [4.0]]
+>>> dataset = spark.createDataFrame(dataset, ['sample'])
+>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 
3.0, 1.0).first()
+>>> round(ksResult.pValue, 3)
+1.0
+>>> round(ksResult.statistic, 3)
+0.175
+
+.. versionadded:: 2.4.0
+
+"""
+@staticmethod
+@since("2.4.0")
+def test(dataset, sampleCol, distName, *params):
+"""
+Perform a Kolmogorov-Smirnov test using dataset.
--- End diff --

Can you please make this match the text in the Scala doc?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-06 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r179831482
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -134,6 +134,65 @@ def corr(dataset, column, method="pearson"):
 return _java2py(sc, javaCorrObj.corr(*args))
 
 
+class KolmogorovSmirnovTest(object):
+"""
+.. note:: Experimental
+
+Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled 
from a continuous
+distribution.
+
+By comparing the largest difference between the empirical cumulative
+distribution of the sample data and the theoretical distribution we 
can provide a test for the
+the null hypothesis that the sample data comes from that theoretical 
distribution.
+
+:param dataset:
+  a dataset or a dataframe containing the sample of data to test.
+:param sampleCol:
+  Name of sample column in dataset, of any numerical type.
+:param distName:
+  a `string` name for a theoretical distribution, currently only 
support "norm".
+:param params:
+  a list of `Double` values specifying the parameters to be used for 
the theoretical
+  distribution
+:return:
+  A dataframe that contains the Kolmogorov-Smirnov test result for the 
input sampled data.
+  This DataFrame will contain a single Row with the following fields:
+  - `pValue: Double`
+  - `statistic: Double`
+
+>>> from pyspark.ml.stat import KolmogorovSmirnovTest
+>>> dataset = [[-1.0], [0.0], [1.0]]
+>>> dataset = spark.createDataFrame(dataset, ['sample'])
+>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 
0.0, 1.0).collect()[0]
--- End diff --

nit: use first() instead of collect()[0]


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-06 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r179833156
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -134,6 +134,65 @@ def corr(dataset, column, method="pearson"):
 return _java2py(sc, javaCorrObj.corr(*args))
 
 
+class KolmogorovSmirnovTest(object):
+"""
+.. note:: Experimental
+
+Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled 
from a continuous
+distribution.
+
+By comparing the largest difference between the empirical cumulative
+distribution of the sample data and the theoretical distribution we 
can provide a test for the
+the null hypothesis that the sample data comes from that theoretical 
distribution.
+
+:param dataset:
+  a dataset or a dataframe containing the sample of data to test.
+:param sampleCol:
+  Name of sample column in dataset, of any numerical type.
+:param distName:
+  a `string` name for a theoretical distribution, currently only 
support "norm".
+:param params:
+  a list of `Double` values specifying the parameters to be used for 
the theoretical
--- End diff --

I realized we should list what the parameters are, both here and in the 
Scala docs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-06 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r179824556
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala ---
@@ -102,10 +102,11 @@ object KolmogorovSmirnovTest {
*/
   @Since("2.4.0")
   @varargs
-  def test(dataset: DataFrame, sampleCol: String, distName: String, 
params: Double*): DataFrame = {
+  def test(dataset: Dataset[_], sampleCol: String, distName: String, 
params: Double*)
--- End diff --

nit: This doesn't fit scala style; please get familiar with the style we 
use for multi-line function headers!  Just check out other parts of MLlib for 
examples.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-06 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r179832593
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala ---
@@ -81,7 +81,7 @@ object KolmogorovSmirnovTest {
* Java-friendly version of `test(dataset: DataFrame, sampleCol: String, 
cdf: Double => Double)`
*/
   @Since("2.4.0")
-  def test(dataset: DataFrame, sampleCol: String,
+  def test(dataset: Dataset[_], sampleCol: String,
 cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = {
--- End diff --

I guess I missed this before.  Would you mind fixing the scala style here 
too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-06 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r179830986
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -134,6 +134,65 @@ def corr(dataset, column, method="pearson"):
 return _java2py(sc, javaCorrObj.corr(*args))
 
 
+class KolmogorovSmirnovTest(object):
+"""
+.. note:: Experimental
+
+Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled 
from a continuous
+distribution.
+
+By comparing the largest difference between the empirical cumulative
+distribution of the sample data and the theoretical distribution we 
can provide a test for the
+the null hypothesis that the sample data comes from that theoretical 
distribution.
+
+:param dataset:
+  a dataset or a dataframe containing the sample of data to test.
--- End diff --

nit: dataset -> Dataset, dataframe -> DataFrame (It's nice to write class 
names the way they are defined.)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-06 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r179832114
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -134,6 +134,65 @@ def corr(dataset, column, method="pearson"):
 return _java2py(sc, javaCorrObj.corr(*args))
 
 
+class KolmogorovSmirnovTest(object):
+"""
+.. note:: Experimental
+
+Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled 
from a continuous
+distribution.
+
+By comparing the largest difference between the empirical cumulative
+distribution of the sample data and the theoretical distribution we 
can provide a test for the
+the null hypothesis that the sample data comes from that theoretical 
distribution.
+
+:param dataset:
--- End diff --

I see you're following the example of ChiSquareTest, but this Param 
documentation belongs with the test method, not the class.  Could you please 
shift it?  (Feel free to correct it for ChiSquareTest here or in another PR.)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-06 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r179824228
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala ---
@@ -59,7 +59,7 @@ object KolmogorovSmirnovTest {
* distribution of the sample data and the theoretical distribution we 
can provide a test for the
* the null hypothesis that the sample data comes from that theoretical 
distribution.
*
-   * @param dataset a `DataFrame` containing the sample of data to test
+   * @param dataset A dataset or a dataframe containing the sample of data 
to test
--- End diff --

nit: It's nicer to keep single back quotes ``` `DataFrame` ``` to make 
these show up as code in docs for clarity.  No need to get rid of that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-05 Thread yogeshg
Github user yogeshg commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r179639255
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -134,6 +134,63 @@ def corr(dataset, column, method="pearson"):
 return _java2py(sc, javaCorrObj.corr(*args))
 
 
+class KolmogorovSmirnovTest(object):
+"""
+.. note:: Experimental
+
+Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled 
from a
+continuous distribution. By comparing the largest difference between 
the empirical cumulative
+distribution of the sample data and the theoretical distribution we 
can provide a test for the
+the null hypothesis that the sample data comes from that theoretical 
distribution.
+
+:param dataset:
+  a dataset or a dataframe containing the sample of data to test.
+:param sampleCol:
+  Name of sample column in dataset, of any numerical type.
+:param distName:
+  a `string` name for a theoretical distribution, currently only 
support "norm".
+:param params:
+  a list of `Double` values specifying the parameters to be used for 
the theoretical
+  distribution
+:return:
+  A dataframe that contains the Kolmogorov-Smirnov test result for the 
input sampled data.
+  This DataFrame will contain a single Row with the following fields:
+  - `pValue: Double`
+  - `statistic: Double`
+
+>>> from pyspark.ml.stat import KolmogorovSmirnovTest
+>>> dataset = [[-1.0], [0.0], [1.0]]
+>>> dataset = spark.createDataFrame(dataset, ['sample'])
+>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 
0.0, 1.0).collect()[0]
+>>> round(ksResult.pValue, 3)
+1.0
+>>> round(ksResult.statistic, 3)
+0.175
+>>> dataset = [[2.0], [3.0], [4.0]]
+>>> dataset = spark.createDataFrame(dataset, ['sample'])
+>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 
3.0, 1.0).collect()[0]
+>>> round(ksResult.pValue, 3)
+1.0
+>>> round(ksResult.statistic, 3)
+0.175
+
+.. versionadded:: 2.4.0
+
+"""
+@staticmethod
+@since("2.4.0")
+def test(dataset, sampleCol, distName, *params):
+"""
+Perform a Kolmogorov-Smirnov test using dataset.
+"""
+sc = SparkContext._active_spark_context
+javaTestObj = _jvm().org.apache.spark.ml.stat.KolmogorovSmirnovTest
+dataset = _py2java(sc, dataset)
+params = [float(param) for param in params]
+return _java2py(sc, javaTestObj.test(dataset, sampleCol, distName,
--- End diff --

thanks for checking this out! current usage sounds fair!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-04 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r179311446
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -134,6 +134,63 @@ def corr(dataset, column, method="pearson"):
 return _java2py(sc, javaCorrObj.corr(*args))
 
 
+class KolmogorovSmirnovTest(object):
+"""
+.. note:: Experimental
+
+Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled 
from a
+continuous distribution. By comparing the largest difference between 
the empirical cumulative
+distribution of the sample data and the theoretical distribution we 
can provide a test for the
+the null hypothesis that the sample data comes from that theoretical 
distribution.
+
+:param dataset:
+  a dataset or a dataframe containing the sample of data to test.
+:param sampleCol:
+  Name of sample column in dataset, of any numerical type.
+:param distName:
+  a `string` name for a theoretical distribution, currently only 
support "norm".
+:param params:
+  a list of `Double` values specifying the parameters to be used for 
the theoretical
+  distribution
+:return:
+  A dataframe that contains the Kolmogorov-Smirnov test result for the 
input sampled data.
+  This DataFrame will contain a single Row with the following fields:
+  - `pValue: Double`
+  - `statistic: Double`
+
+>>> from pyspark.ml.stat import KolmogorovSmirnovTest
+>>> dataset = [[-1.0], [0.0], [1.0]]
+>>> dataset = spark.createDataFrame(dataset, ['sample'])
+>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 
0.0, 1.0).collect()[0]
+>>> round(ksResult.pValue, 3)
+1.0
+>>> round(ksResult.statistic, 3)
+0.175
+>>> dataset = [[2.0], [3.0], [4.0]]
+>>> dataset = spark.createDataFrame(dataset, ['sample'])
+>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 
3.0, 1.0).collect()[0]
+>>> round(ksResult.pValue, 3)
+1.0
+>>> round(ksResult.statistic, 3)
+0.175
+
+.. versionadded:: 2.4.0
+
+"""
+@staticmethod
+@since("2.4.0")
+def test(dataset, sampleCol, distName, *params):
+"""
+Perform a Kolmogorov-Smirnov test using dataset.
+"""
+sc = SparkContext._active_spark_context
+javaTestObj = _jvm().org.apache.spark.ml.stat.KolmogorovSmirnovTest
+dataset = _py2java(sc, dataset)
+params = [float(param) for param in params]
+return _java2py(sc, javaTestObj.test(dataset, sampleCol, distName,
--- End diff --

- I think for primitive type, we can leave out `_py2java` when passing 
args. (also there's some examples in pyspark).

- Have you tried `callJavaFunc` ? I tried but seems some error occur. Not 
sure about it .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-04 Thread yogeshg
Github user yogeshg commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r179278641
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -134,6 +134,63 @@ def corr(dataset, column, method="pearson"):
 return _java2py(sc, javaCorrObj.corr(*args))
 
 
+class KolmogorovSmirnovTest(object):
+"""
+.. note:: Experimental
+
+Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled 
from a
+continuous distribution. By comparing the largest difference between 
the empirical cumulative
+distribution of the sample data and the theoretical distribution we 
can provide a test for the
+the null hypothesis that the sample data comes from that theoretical 
distribution.
+
+:param dataset:
+  a dataset or a dataframe containing the sample of data to test.
+:param sampleCol:
+  Name of sample column in dataset, of any numerical type.
+:param distName:
+  a `string` name for a theoretical distribution, currently only 
support "norm".
+:param params:
+  a list of `Double` values specifying the parameters to be used for 
the theoretical
+  distribution
+:return:
+  A dataframe that contains the Kolmogorov-Smirnov test result for the 
input sampled data.
+  This DataFrame will contain a single Row with the following fields:
+  - `pValue: Double`
+  - `statistic: Double`
+
+>>> from pyspark.ml.stat import KolmogorovSmirnovTest
+>>> dataset = [[-1.0], [0.0], [1.0]]
+>>> dataset = spark.createDataFrame(dataset, ['sample'])
+>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 
0.0, 1.0).collect()[0]
+>>> round(ksResult.pValue, 3)
+1.0
+>>> round(ksResult.statistic, 3)
+0.175
+>>> dataset = [[2.0], [3.0], [4.0]]
+>>> dataset = spark.createDataFrame(dataset, ['sample'])
+>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 
3.0, 1.0).collect()[0]
+>>> round(ksResult.pValue, 3)
+1.0
+>>> round(ksResult.statistic, 3)
+0.175
+
+.. versionadded:: 2.4.0
+
+"""
+@staticmethod
+@since("2.4.0")
+def test(dataset, sampleCol, distName, *params):
+"""
+Perform a Kolmogorov-Smirnov test using dataset.
+"""
+sc = SparkContext._active_spark_context
+javaTestObj = _jvm().org.apache.spark.ml.stat.KolmogorovSmirnovTest
+dataset = _py2java(sc, dataset)
+params = [float(param) for param in params]
+return _java2py(sc, javaTestObj.test(dataset, sampleCol, distName,
--- End diff --

do we need to do `_py2java(sc, *)` for `sampleCol` and/or `distName` ? I 
noticed that in `ChiSquaredTest` but I'm not sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-04 Thread yogeshg
Github user yogeshg commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r179283700
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -134,6 +134,63 @@ def corr(dataset, column, method="pearson"):
 return _java2py(sc, javaCorrObj.corr(*args))
 
 
+class KolmogorovSmirnovTest(object):
+"""
+.. note:: Experimental
+
+Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled 
from a
+continuous distribution. By comparing the largest difference between 
the empirical cumulative
+distribution of the sample data and the theoretical distribution we 
can provide a test for the
+the null hypothesis that the sample data comes from that theoretical 
distribution.
+
+:param dataset:
+  a dataset or a dataframe containing the sample of data to test.
+:param sampleCol:
+  Name of sample column in dataset, of any numerical type.
+:param distName:
+  a `string` name for a theoretical distribution, currently only 
support "norm".
+:param params:
+  a list of `Double` values specifying the parameters to be used for 
the theoretical
+  distribution
+:return:
+  A dataframe that contains the Kolmogorov-Smirnov test result for the 
input sampled data.
+  This DataFrame will contain a single Row with the following fields:
+  - `pValue: Double`
+  - `statistic: Double`
+
+>>> from pyspark.ml.stat import KolmogorovSmirnovTest
+>>> dataset = [[-1.0], [0.0], [1.0]]
+>>> dataset = spark.createDataFrame(dataset, ['sample'])
+>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 
0.0, 1.0).collect()[0]
+>>> round(ksResult.pValue, 3)
+1.0
+>>> round(ksResult.statistic, 3)
+0.175
+>>> dataset = [[2.0], [3.0], [4.0]]
+>>> dataset = spark.createDataFrame(dataset, ['sample'])
+>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 
3.0, 1.0).collect()[0]
+>>> round(ksResult.pValue, 3)
+1.0
+>>> round(ksResult.statistic, 3)
+0.175
+
+.. versionadded:: 2.4.0
+
+"""
+@staticmethod
+@since("2.4.0")
+def test(dataset, sampleCol, distName, *params):
+"""
+Perform a Kolmogorov-Smirnov test using dataset.
+"""
+sc = SparkContext._active_spark_context
+javaTestObj = _jvm().org.apache.spark.ml.stat.KolmogorovSmirnovTest
+dataset = _py2java(sc, dataset)
+params = [float(param) for param in params]
+return _java2py(sc, javaTestObj.test(dataset, sampleCol, distName,
--- End diff --

Looks like we should be able to use `pyspark.ml.common.callJavaFunc`
```python
from pyspark.ml.common import _java2py, _py2java, callJavaFunc
...

return callJavaFunc(sc, javaTestObj, dataset, sampleCol, distName, *params)
```

but I am not fully sure if `_jvm().PythonUtils.toSeq()` method will still 
be required. 🤞 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-04-04 Thread yogeshg
Github user yogeshg commented on a diff in the pull request:

https://github.com/apache/spark/pull/20904#discussion_r179267921
  
--- Diff: python/pyspark/ml/stat.py ---
@@ -134,6 +134,63 @@ def corr(dataset, column, method="pearson"):
 return _java2py(sc, javaCorrObj.corr(*args))
 
 
+class KolmogorovSmirnovTest(object):
+"""
+.. note:: Experimental
+
+Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled 
from a
--- End diff --

We can add a one line description and then a paragraph full of description, 
this way we'll get 
[docs](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.stat.package)
 consistent with the rest of the tests in the package.

```scala
/**
...
 * Conduct two-sided Kolmogorov Smirnov (KS) test for data sampled from a 
continuous distribution.
 * 
 * By comparing the largest difference between the empirical cumulative
...
 */
```




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...

2018-03-26 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/20904

[SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff test Python API in pyspark.ml

## What changes were proposed in this pull request?

Kolmogorov-Smirnoff test Python API in `pyspark.ml`

**Note**  API with `CDF` is a little difficult to support in python. We can 
add it in following PR.

## How was this patch tested?

doctest


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark ks-test-py

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20904.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20904


commit 6978780dbd0bd55d3bda535f84267353ea0bbfde
Author: WeichenXu 
Date:   2018-03-26T09:43:49Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org