[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r181731252 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala --- @@ -81,32 +81,37 @@ object KolmogorovSmirnovTest { * Java-friendly version of `test(dataset: DataFrame, sampleCol: String, cdf: Double => Double)` */ @Since("2.4.0") - def test(dataset: DataFrame, sampleCol: String, -cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { -val f: Double => Double = x => cdf.call(x) -test(dataset, sampleCol, f) + def test( + dataset: Dataset[_], + sampleCol: String, + cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { +test(dataset, sampleCol, (x: Double) => cdf.call(x)) --- End diff -- That's not a very scalable fix... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r181270142 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala --- @@ -81,32 +81,37 @@ object KolmogorovSmirnovTest { * Java-friendly version of `test(dataset: DataFrame, sampleCol: String, cdf: Double => Double)` */ @Since("2.4.0") - def test(dataset: DataFrame, sampleCol: String, -cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { -val f: Double => Double = x => cdf.call(x) -test(dataset, sampleCol, f) + def test( + dataset: Dataset[_], + sampleCol: String, + cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { +test(dataset, sampleCol, (x: Double) => cdf.call(x)) --- End diff -- I can build locally against scala-2.12 to check it. :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r181230965 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala --- @@ -81,32 +81,37 @@ object KolmogorovSmirnovTest { * Java-friendly version of `test(dataset: DataFrame, sampleCol: String, cdf: Double => Double)` */ @Since("2.4.0") - def test(dataset: DataFrame, sampleCol: String, -cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { -val f: Double => Double = x => cdf.call(x) -test(dataset, sampleCol, f) + def test( + dataset: Dataset[_], + sampleCol: String, + cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { +test(dataset, sampleCol, (x: Double) => cdf.call(x)) --- End diff -- This is the 2nd time I've merged something which broke the Scala 2.12 build. If we care about Scala 2.12, do you know if there are any efforts to add PR builder tests for 2.12? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r181018223 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala --- @@ -81,32 +81,37 @@ object KolmogorovSmirnovTest { * Java-friendly version of `test(dataset: DataFrame, sampleCol: String, cdf: Double => Double)` */ @Since("2.4.0") - def test(dataset: DataFrame, sampleCol: String, -cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { -val f: Double => Double = x => cdf.call(x) -test(dataset, sampleCol, f) + def test( + dataset: Dataset[_], + sampleCol: String, + cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { +test(dataset, sampleCol, (x: Double) => cdf.call(x)) --- End diff -- [SPARK-23751][FOLLOW-UP] fix build for scala-2.12 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r181015525 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala --- @@ -81,32 +81,37 @@ object KolmogorovSmirnovTest { * Java-friendly version of `test(dataset: DataFrame, sampleCol: String, cdf: Double => Double)` */ @Since("2.4.0") - def test(dataset: DataFrame, sampleCol: String, -cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { -val f: Double => Double = x => cdf.call(x) -test(dataset, sampleCol, f) + def test( + dataset: Dataset[_], + sampleCol: String, + cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { +test(dataset, sampleCol, (x: Double) => cdf.call(x)) --- End diff -- ah it's my fault. Sorry! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r181006772 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala --- @@ -81,32 +81,37 @@ object KolmogorovSmirnovTest { * Java-friendly version of `test(dataset: DataFrame, sampleCol: String, cdf: Double => Double)` */ @Since("2.4.0") - def test(dataset: DataFrame, sampleCol: String, -cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { -val f: Double => Double = x => cdf.call(x) -test(dataset, sampleCol, f) + def test( + dataset: Dataset[_], + sampleCol: String, + cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { +test(dataset, sampleCol, (x: Double) => cdf.call(x)) --- End diff -- Seems like this broke Scala 2.12 build again. Can you fix it or shall I? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20904 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r180228711 --- Diff: python/pyspark/ml/stat.py --- @@ -127,13 +113,86 @@ class Correlation(object): def corr(dataset, column, method="pearson"): """ Compute the correlation matrix with specified method using dataset. + +:param dataset: + A Dataset or a DataFrame. +:param column: + The name of the column of vectors for which the correlation coefficient needs + to be computed. This must be a column of the dataset, and it must contain + Vector objects. +:param method: + String specifying the method to use for computing correlation. + Supported: `pearson` (default), `spearman`. +:return: + A DataFrame that contains the correlation matrix of the column of vectors. This + DataFrame contains a single row and a single column of name + '$METHODNAME($COLUMN)'. """ sc = SparkContext._active_spark_context javaCorrObj = _jvm().org.apache.spark.ml.stat.Correlation args = [_py2java(sc, arg) for arg in (dataset, column, method)] return _java2py(sc, javaCorrObj.corr(*args)) +class KolmogorovSmirnovTest(object): +""" +.. note:: Experimental + +Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous +distribution. + +By comparing the largest difference between the empirical cumulative +distribution of the sample data and the theoretical distribution we can provide a test for the +the null hypothesis that the sample data comes from that theoretical distribution. + +>>> from pyspark.ml.stat import KolmogorovSmirnovTest --- End diff -- Thanks for moving the method-specific documentation. These doctests are method-specific too, though, so can you please move them as well? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r180245120 --- Diff: python/pyspark/ml/stat.py --- @@ -127,13 +113,86 @@ class Correlation(object): def corr(dataset, column, method="pearson"): """ Compute the correlation matrix with specified method using dataset. + +:param dataset: + A Dataset or a DataFrame. +:param column: + The name of the column of vectors for which the correlation coefficient needs + to be computed. This must be a column of the dataset, and it must contain + Vector objects. +:param method: + String specifying the method to use for computing correlation. + Supported: `pearson` (default), `spearman`. +:return: + A DataFrame that contains the correlation matrix of the column of vectors. This + DataFrame contains a single row and a single column of name + '$METHODNAME($COLUMN)'. """ sc = SparkContext._active_spark_context javaCorrObj = _jvm().org.apache.spark.ml.stat.Correlation args = [_py2java(sc, arg) for arg in (dataset, column, method)] return _java2py(sc, javaCorrObj.corr(*args)) +class KolmogorovSmirnovTest(object): +""" +.. note:: Experimental + +Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous +distribution. + +By comparing the largest difference between the empirical cumulative +distribution of the sample data and the theoretical distribution we can provide a test for the +the null hypothesis that the sample data comes from that theoretical distribution. + +>>> from pyspark.ml.stat import KolmogorovSmirnovTest +>>> dataset = [[-1.0], [0.0], [1.0]] +>>> dataset = spark.createDataFrame(dataset, ['sample']) +>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 0.0, 1.0).first() +>>> round(ksResult.pValue, 3) +1.0 +>>> round(ksResult.statistic, 3) +0.175 +>>> dataset = [[2.0], [3.0], [4.0]] +>>> dataset = spark.createDataFrame(dataset, ['sample']) +>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 3.0, 1.0).first() +>>> round(ksResult.pValue, 3) +1.0 +>>> round(ksResult.statistic, 3) +0.175 + +.. versionadded:: 2.4.0 + +""" +@staticmethod +@since("2.4.0") +def test(dataset, sampleCol, distName, *params): +""" +Perform a Kolmogorov-Smirnov test using dataset. --- End diff -- Can you please make this match the text in the Scala doc? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r179831482 --- Diff: python/pyspark/ml/stat.py --- @@ -134,6 +134,65 @@ def corr(dataset, column, method="pearson"): return _java2py(sc, javaCorrObj.corr(*args)) +class KolmogorovSmirnovTest(object): +""" +.. note:: Experimental + +Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous +distribution. + +By comparing the largest difference between the empirical cumulative +distribution of the sample data and the theoretical distribution we can provide a test for the +the null hypothesis that the sample data comes from that theoretical distribution. + +:param dataset: + a dataset or a dataframe containing the sample of data to test. +:param sampleCol: + Name of sample column in dataset, of any numerical type. +:param distName: + a `string` name for a theoretical distribution, currently only support "norm". +:param params: + a list of `Double` values specifying the parameters to be used for the theoretical + distribution +:return: + A dataframe that contains the Kolmogorov-Smirnov test result for the input sampled data. + This DataFrame will contain a single Row with the following fields: + - `pValue: Double` + - `statistic: Double` + +>>> from pyspark.ml.stat import KolmogorovSmirnovTest +>>> dataset = [[-1.0], [0.0], [1.0]] +>>> dataset = spark.createDataFrame(dataset, ['sample']) +>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 0.0, 1.0).collect()[0] --- End diff -- nit: use first() instead of collect()[0] --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r179833156 --- Diff: python/pyspark/ml/stat.py --- @@ -134,6 +134,65 @@ def corr(dataset, column, method="pearson"): return _java2py(sc, javaCorrObj.corr(*args)) +class KolmogorovSmirnovTest(object): +""" +.. note:: Experimental + +Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous +distribution. + +By comparing the largest difference between the empirical cumulative +distribution of the sample data and the theoretical distribution we can provide a test for the +the null hypothesis that the sample data comes from that theoretical distribution. + +:param dataset: + a dataset or a dataframe containing the sample of data to test. +:param sampleCol: + Name of sample column in dataset, of any numerical type. +:param distName: + a `string` name for a theoretical distribution, currently only support "norm". +:param params: + a list of `Double` values specifying the parameters to be used for the theoretical --- End diff -- I realized we should list what the parameters are, both here and in the Scala docs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r179824556 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala --- @@ -102,10 +102,11 @@ object KolmogorovSmirnovTest { */ @Since("2.4.0") @varargs - def test(dataset: DataFrame, sampleCol: String, distName: String, params: Double*): DataFrame = { + def test(dataset: Dataset[_], sampleCol: String, distName: String, params: Double*) --- End diff -- nit: This doesn't fit scala style; please get familiar with the style we use for multi-line function headers! Just check out other parts of MLlib for examples. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r179832593 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala --- @@ -81,7 +81,7 @@ object KolmogorovSmirnovTest { * Java-friendly version of `test(dataset: DataFrame, sampleCol: String, cdf: Double => Double)` */ @Since("2.4.0") - def test(dataset: DataFrame, sampleCol: String, + def test(dataset: Dataset[_], sampleCol: String, cdf: Function[java.lang.Double, java.lang.Double]): DataFrame = { --- End diff -- I guess I missed this before. Would you mind fixing the scala style here too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r179830986 --- Diff: python/pyspark/ml/stat.py --- @@ -134,6 +134,65 @@ def corr(dataset, column, method="pearson"): return _java2py(sc, javaCorrObj.corr(*args)) +class KolmogorovSmirnovTest(object): +""" +.. note:: Experimental + +Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous +distribution. + +By comparing the largest difference between the empirical cumulative +distribution of the sample data and the theoretical distribution we can provide a test for the +the null hypothesis that the sample data comes from that theoretical distribution. + +:param dataset: + a dataset or a dataframe containing the sample of data to test. --- End diff -- nit: dataset -> Dataset, dataframe -> DataFrame (It's nice to write class names the way they are defined.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r179832114 --- Diff: python/pyspark/ml/stat.py --- @@ -134,6 +134,65 @@ def corr(dataset, column, method="pearson"): return _java2py(sc, javaCorrObj.corr(*args)) +class KolmogorovSmirnovTest(object): +""" +.. note:: Experimental + +Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous +distribution. + +By comparing the largest difference between the empirical cumulative +distribution of the sample data and the theoretical distribution we can provide a test for the +the null hypothesis that the sample data comes from that theoretical distribution. + +:param dataset: --- End diff -- I see you're following the example of ChiSquareTest, but this Param documentation belongs with the test method, not the class. Could you please shift it? (Feel free to correct it for ChiSquareTest here or in another PR.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r179824228 --- Diff: mllib/src/main/scala/org/apache/spark/ml/stat/KolmogorovSmirnovTest.scala --- @@ -59,7 +59,7 @@ object KolmogorovSmirnovTest { * distribution of the sample data and the theoretical distribution we can provide a test for the * the null hypothesis that the sample data comes from that theoretical distribution. * - * @param dataset a `DataFrame` containing the sample of data to test + * @param dataset A dataset or a dataframe containing the sample of data to test --- End diff -- nit: It's nicer to keep single back quotes ``` `DataFrame` ``` to make these show up as code in docs for clarity. No need to get rid of that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user yogeshg commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r179639255 --- Diff: python/pyspark/ml/stat.py --- @@ -134,6 +134,63 @@ def corr(dataset, column, method="pearson"): return _java2py(sc, javaCorrObj.corr(*args)) +class KolmogorovSmirnovTest(object): +""" +.. note:: Experimental + +Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a +continuous distribution. By comparing the largest difference between the empirical cumulative +distribution of the sample data and the theoretical distribution we can provide a test for the +the null hypothesis that the sample data comes from that theoretical distribution. + +:param dataset: + a dataset or a dataframe containing the sample of data to test. +:param sampleCol: + Name of sample column in dataset, of any numerical type. +:param distName: + a `string` name for a theoretical distribution, currently only support "norm". +:param params: + a list of `Double` values specifying the parameters to be used for the theoretical + distribution +:return: + A dataframe that contains the Kolmogorov-Smirnov test result for the input sampled data. + This DataFrame will contain a single Row with the following fields: + - `pValue: Double` + - `statistic: Double` + +>>> from pyspark.ml.stat import KolmogorovSmirnovTest +>>> dataset = [[-1.0], [0.0], [1.0]] +>>> dataset = spark.createDataFrame(dataset, ['sample']) +>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 0.0, 1.0).collect()[0] +>>> round(ksResult.pValue, 3) +1.0 +>>> round(ksResult.statistic, 3) +0.175 +>>> dataset = [[2.0], [3.0], [4.0]] +>>> dataset = spark.createDataFrame(dataset, ['sample']) +>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 3.0, 1.0).collect()[0] +>>> round(ksResult.pValue, 3) +1.0 +>>> round(ksResult.statistic, 3) +0.175 + +.. versionadded:: 2.4.0 + +""" +@staticmethod +@since("2.4.0") +def test(dataset, sampleCol, distName, *params): +""" +Perform a Kolmogorov-Smirnov test using dataset. +""" +sc = SparkContext._active_spark_context +javaTestObj = _jvm().org.apache.spark.ml.stat.KolmogorovSmirnovTest +dataset = _py2java(sc, dataset) +params = [float(param) for param in params] +return _java2py(sc, javaTestObj.test(dataset, sampleCol, distName, --- End diff -- thanks for checking this out! current usage sounds fair! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r179311446 --- Diff: python/pyspark/ml/stat.py --- @@ -134,6 +134,63 @@ def corr(dataset, column, method="pearson"): return _java2py(sc, javaCorrObj.corr(*args)) +class KolmogorovSmirnovTest(object): +""" +.. note:: Experimental + +Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a +continuous distribution. By comparing the largest difference between the empirical cumulative +distribution of the sample data and the theoretical distribution we can provide a test for the +the null hypothesis that the sample data comes from that theoretical distribution. + +:param dataset: + a dataset or a dataframe containing the sample of data to test. +:param sampleCol: + Name of sample column in dataset, of any numerical type. +:param distName: + a `string` name for a theoretical distribution, currently only support "norm". +:param params: + a list of `Double` values specifying the parameters to be used for the theoretical + distribution +:return: + A dataframe that contains the Kolmogorov-Smirnov test result for the input sampled data. + This DataFrame will contain a single Row with the following fields: + - `pValue: Double` + - `statistic: Double` + +>>> from pyspark.ml.stat import KolmogorovSmirnovTest +>>> dataset = [[-1.0], [0.0], [1.0]] +>>> dataset = spark.createDataFrame(dataset, ['sample']) +>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 0.0, 1.0).collect()[0] +>>> round(ksResult.pValue, 3) +1.0 +>>> round(ksResult.statistic, 3) +0.175 +>>> dataset = [[2.0], [3.0], [4.0]] +>>> dataset = spark.createDataFrame(dataset, ['sample']) +>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 3.0, 1.0).collect()[0] +>>> round(ksResult.pValue, 3) +1.0 +>>> round(ksResult.statistic, 3) +0.175 + +.. versionadded:: 2.4.0 + +""" +@staticmethod +@since("2.4.0") +def test(dataset, sampleCol, distName, *params): +""" +Perform a Kolmogorov-Smirnov test using dataset. +""" +sc = SparkContext._active_spark_context +javaTestObj = _jvm().org.apache.spark.ml.stat.KolmogorovSmirnovTest +dataset = _py2java(sc, dataset) +params = [float(param) for param in params] +return _java2py(sc, javaTestObj.test(dataset, sampleCol, distName, --- End diff -- - I think for primitive type, we can leave out `_py2java` when passing args. (also there's some examples in pyspark). - Have you tried `callJavaFunc` ? I tried but seems some error occur. Not sure about it . --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user yogeshg commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r179278641 --- Diff: python/pyspark/ml/stat.py --- @@ -134,6 +134,63 @@ def corr(dataset, column, method="pearson"): return _java2py(sc, javaCorrObj.corr(*args)) +class KolmogorovSmirnovTest(object): +""" +.. note:: Experimental + +Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a +continuous distribution. By comparing the largest difference between the empirical cumulative +distribution of the sample data and the theoretical distribution we can provide a test for the +the null hypothesis that the sample data comes from that theoretical distribution. + +:param dataset: + a dataset or a dataframe containing the sample of data to test. +:param sampleCol: + Name of sample column in dataset, of any numerical type. +:param distName: + a `string` name for a theoretical distribution, currently only support "norm". +:param params: + a list of `Double` values specifying the parameters to be used for the theoretical + distribution +:return: + A dataframe that contains the Kolmogorov-Smirnov test result for the input sampled data. + This DataFrame will contain a single Row with the following fields: + - `pValue: Double` + - `statistic: Double` + +>>> from pyspark.ml.stat import KolmogorovSmirnovTest +>>> dataset = [[-1.0], [0.0], [1.0]] +>>> dataset = spark.createDataFrame(dataset, ['sample']) +>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 0.0, 1.0).collect()[0] +>>> round(ksResult.pValue, 3) +1.0 +>>> round(ksResult.statistic, 3) +0.175 +>>> dataset = [[2.0], [3.0], [4.0]] +>>> dataset = spark.createDataFrame(dataset, ['sample']) +>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 3.0, 1.0).collect()[0] +>>> round(ksResult.pValue, 3) +1.0 +>>> round(ksResult.statistic, 3) +0.175 + +.. versionadded:: 2.4.0 + +""" +@staticmethod +@since("2.4.0") +def test(dataset, sampleCol, distName, *params): +""" +Perform a Kolmogorov-Smirnov test using dataset. +""" +sc = SparkContext._active_spark_context +javaTestObj = _jvm().org.apache.spark.ml.stat.KolmogorovSmirnovTest +dataset = _py2java(sc, dataset) +params = [float(param) for param in params] +return _java2py(sc, javaTestObj.test(dataset, sampleCol, distName, --- End diff -- do we need to do `_py2java(sc, *)` for `sampleCol` and/or `distName` ? I noticed that in `ChiSquaredTest` but I'm not sure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user yogeshg commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r179283700 --- Diff: python/pyspark/ml/stat.py --- @@ -134,6 +134,63 @@ def corr(dataset, column, method="pearson"): return _java2py(sc, javaCorrObj.corr(*args)) +class KolmogorovSmirnovTest(object): +""" +.. note:: Experimental + +Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a +continuous distribution. By comparing the largest difference between the empirical cumulative +distribution of the sample data and the theoretical distribution we can provide a test for the +the null hypothesis that the sample data comes from that theoretical distribution. + +:param dataset: + a dataset or a dataframe containing the sample of data to test. +:param sampleCol: + Name of sample column in dataset, of any numerical type. +:param distName: + a `string` name for a theoretical distribution, currently only support "norm". +:param params: + a list of `Double` values specifying the parameters to be used for the theoretical + distribution +:return: + A dataframe that contains the Kolmogorov-Smirnov test result for the input sampled data. + This DataFrame will contain a single Row with the following fields: + - `pValue: Double` + - `statistic: Double` + +>>> from pyspark.ml.stat import KolmogorovSmirnovTest +>>> dataset = [[-1.0], [0.0], [1.0]] +>>> dataset = spark.createDataFrame(dataset, ['sample']) +>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 0.0, 1.0).collect()[0] +>>> round(ksResult.pValue, 3) +1.0 +>>> round(ksResult.statistic, 3) +0.175 +>>> dataset = [[2.0], [3.0], [4.0]] +>>> dataset = spark.createDataFrame(dataset, ['sample']) +>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 3.0, 1.0).collect()[0] +>>> round(ksResult.pValue, 3) +1.0 +>>> round(ksResult.statistic, 3) +0.175 + +.. versionadded:: 2.4.0 + +""" +@staticmethod +@since("2.4.0") +def test(dataset, sampleCol, distName, *params): +""" +Perform a Kolmogorov-Smirnov test using dataset. +""" +sc = SparkContext._active_spark_context +javaTestObj = _jvm().org.apache.spark.ml.stat.KolmogorovSmirnovTest +dataset = _py2java(sc, dataset) +params = [float(param) for param in params] +return _java2py(sc, javaTestObj.test(dataset, sampleCol, distName, --- End diff -- Looks like we should be able to use `pyspark.ml.common.callJavaFunc` ```python from pyspark.ml.common import _java2py, _py2java, callJavaFunc ... return callJavaFunc(sc, javaTestObj, dataset, sampleCol, distName, *params) ``` but I am not fully sure if `_jvm().PythonUtils.toSeq()` method will still be required. ð¤ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
Github user yogeshg commented on a diff in the pull request: https://github.com/apache/spark/pull/20904#discussion_r179267921 --- Diff: python/pyspark/ml/stat.py --- @@ -134,6 +134,63 @@ def corr(dataset, column, method="pearson"): return _java2py(sc, javaCorrObj.corr(*args)) +class KolmogorovSmirnovTest(object): +""" +.. note:: Experimental + +Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a --- End diff -- We can add a one line description and then a paragraph full of description, this way we'll get [docs](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.stat.package) consistent with the rest of the tests in the package. ```scala /** ... * Conduct two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous distribution. * * By comparing the largest difference between the empirical cumulative ... */ ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20904: [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff te...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/20904 [SPARK-23751][ML][PySpark] Kolmogorov-Smirnoff test Python API in pyspark.ml ## What changes were proposed in this pull request? Kolmogorov-Smirnoff test Python API in `pyspark.ml` **Note** API with `CDF` is a little difficult to support in python. We can add it in following PR. ## How was this patch tested? doctest You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark ks-test-py Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20904.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20904 commit 6978780dbd0bd55d3bda535f84267353ea0bbfde Author: WeichenXu Date: 2018-03-26T09:43:49Z init pr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org