[GitHub] spark pull request: [SPARK-15608][ml][doc] add_isotonic_regression_doc
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/13381#discussion_r65206320 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/IsotonicRegressionExample.scala --- @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +// scalastyle:off println + +package org.apache.spark.examples.ml + +// $example on$ +import org.apache.spark.ml.regression.IsotonicRegression +// $example off$ +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.types.{DoubleType, StructType} + +/** + * An example demonstrating Isotonic Regression. + * Run with + * {{{ + * bin/run-example ml.IsotonicRegressionExample + * }}} + */ +object IsotonicRegressionExample { + + def main(args: Array[String]): Unit = { + +// Creates a SparkSession. +val spark = SparkSession + .builder + .appName(s"${this.getClass.getSimpleName}") + .getOrCreate() +// $example on$ +val ir = new IsotonicRegression().setIsotonic(true) + .setLabelCol("label").setFeaturesCol("features").setWeightCol("weight") + +val dataReader = spark.read +dataReader.schema(new StructType().add("label", DoubleType).add("features", DoubleType)) + +var data = dataReader.csv("data/mllib/sample_isotonic_regression_data.txt") --- End diff -- Ditto, use libsvm format. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15608][ml][doc] add_isotonic_regression_doc
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/13381#discussion_r65207453 --- Diff: docs/ml-classification-regression.md --- @@ -685,6 +685,88 @@ The implementation matches the result from R's survival function +## Isotonic regression +[Isotonic regression](http://en.wikipedia.org/wiki/Isotonic_regression) +belongs to the family of regression algorithms. Formally isotonic regression is a problem where +given a finite set of real numbers `$Y = {y_1, y_2, ..., y_n}$` representing observed responses +and `$X = {x_1, x_2, ..., x_n}$` the unknown response values to be fitted +finding a function that minimises + +`\begin{equation} + f(x) = \sum_{i=1}^n w_i (y_i - x_i)^2 +\end{equation}` + +with respect to complete order subject to +`$x_1\le x_2\le ...\le x_n$` where `$w_i$` are positive weights. +The resulting function is called isotonic regression and it is unique. +It can be viewed as least squares problem under order restriction. +Essentially isotonic regression is a +[monotonic function](http://en.wikipedia.org/wiki/Monotonic_function) +best fitting the original data points. + +`spark.ml` supports a +[pool adjacent violators algorithm](http://doi.org/10.1198/TECH.2010.10111) +which uses an approach to +[parallelizing isotonic regression](http://doi.org/10.1007/978-3-642-99789-1_10). +The training input is a RDD of tuples of three double values that represent +label, feature and weight in this order. Additionally IsotonicRegression algorithm has one +optional parameter called $isotonic$ defaulting to true. +This argument specifies if the isotonic regression is +isotonic (monotonically increasing) or antitonic (monotonically decreasing). + +Training returns an IsotonicRegressionModel that can be used to predict +labels for both known and unknown features. The result of isotonic regression +is treated as piecewise linear function. The rules for prediction therefore are: + +* If the prediction input exactly matches a training feature + then associated prediction is returned. In case there are multiple predictions with the same + feature then one of them is returned. Which one is undefined + (same as java.util.Arrays.binarySearch). +* If the prediction input is lower or higher than all training features + then prediction with lowest or highest feature is returned respectively. + In case there are multiple predictions with the same feature + then the lowest or highest is returned respectively. +* If the prediction input falls between two training features then prediction is treated + as piecewise linear function and interpolated value is calculated from the + predictions of the two closest features. In case there are multiple values + with the same feature then the same rules as in previous point are used. + +### Examples + + + +Data are read from a file where each line has a format label,feature +i.e. 4710.28,500.00. The data are split to training and testing set. +Model is created using the training set and a mean squared error is calculated from the predicted +labels and real labels in the test set. + +Refer to the [`IsotonicRegression` Scala docs](api/scala/index.html#org.apache.spark.ml.regression.IsotonicRegression) for details on the API. + +{% include_example scala/org/apache/spark/examples/ml/IsotonicRegressionExample.scala %} + + +Data are read from a file where each line has a format label,feature +i.e. 4710.28,500.00. The data are split to training and testing set. +Model is created using the training set and a mean squared error is calculated from the predicted +labels and real labels in the test set. + +Refer to the [`IsotonicRegression` Java docs](api/java/org/apache/spark/ml/regression/IsotonicRegression.html) for details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaIsotonicRegressionExample.java %} + + +Data are read from a file where each line has a format label,feature --- End diff -- This section is identity for different languages, so it's better we can move them out of the ```div``` and eliminate repetition. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15608][ml][doc] add_isotonic_regression_doc
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/13381#discussion_r65207910 --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaIsotonicRegressionExample.java --- @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.examples.ml; + +// $example on$ + +import java.util.List; + +import org.apache.spark.ml.regression.IsotonicRegression; +import org.apache.spark.ml.regression.IsotonicRegressionModel; +import org.apache.spark.sql.DataFrameReader; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.functions; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SparkSession; +import org.apache.spark.sql.types.StructType; + +// $example off$ + +public class JavaIsotonicRegressionExample { + public static void main(String[] args) { + // Create a SparkSession. + SparkSession spark = SparkSession + .builder() + .appName("JavaIsotonicRegression") --- End diff -- JavaIsotonicRegressionExample --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15608][ml][doc] add_isotonic_regression_doc
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/13381#discussion_r65207877 --- Diff: examples/src/main/java/org/apache/spark/examples/ml/JavaIsotonicRegressionExample.java --- @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.examples.ml; + +// $example on$ + +import java.util.List; + +import org.apache.spark.ml.regression.IsotonicRegression; +import org.apache.spark.ml.regression.IsotonicRegressionModel; +import org.apache.spark.sql.DataFrameReader; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.functions; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SparkSession; +import org.apache.spark.sql.types.StructType; + +// $example off$ + +public class JavaIsotonicRegressionExample { + public static void main(String[] args) { + // Create a SparkSession. + SparkSession spark = SparkSession + .builder() + .appName("JavaIsotonicRegression") --- End diff -- ```JavaIsotonicRegressionExample``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15605] [ML] [Examples] Remove JavaDeveloperApiExa...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/13353 cc @mengxr @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13590] [ML] [Doc] Document spark.ml LiR, LoR and ...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/12731 ping @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15177] [SparkR] [ML] SparkR 2.0 QA: New R APIs an...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/13023 @vectorijk There is a separate PR focus on updating machine learning section of SparkR users guide. FYI #13285. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15587] [ML] ML 2.0 QA: Scala APIs audit for ml.fe...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/13410#discussion_r65192762 --- Diff: python/pyspark/ml/feature.py --- @@ -1481,6 +1474,10 @@ class StandardScaler(JavaEstimator, HasInputCol, HasOutputCol, JavaMLReadable, J Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. +The "unit std" is computed using the `corrected sample standard deviation \ + <https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation>`_, +which is computed as the square root of the unbiased sample variance. + --- End diff -- Sync the docs with Scala one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13675: [SPARK-15957] [ML] RFormula supports forcing to i...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/13675 [SPARK-15957] [ML] RFormula supports forcing to index label ## What changes were proposed in this pull request? Add param to make users can force to index label whether it is numeric or string. For classification algorithms, we force to index label by setting it with true. ## How was this patch tested? Unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-15957 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13675.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13675 commit 24ef9baa3b2aa110ce447d935959855fcb954a8e Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-06-14T23:58:00Z RFormula supports forcing to index label --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13662: [SPARK-15945] [MLLIB] Conversion between old/new vector ...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/13662 It looks like the merge script is not happy, I will retry later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13662: [SPARK-15945] [MLLIB] Conversion between old/new vector ...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/13662 LGTM, merged into master and branch-2.0. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13675: [SPARK-15957] [ML] RFormula supports forcing to index la...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/13675 cc @mengxr @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13381: [SPARK-15608][ml][examples][doc] add examples and docume...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/13381 Merged into master and branch-2.0. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13381: [SPARK-15608][ml][examples][doc] add examples and docume...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/13381 Merged into master and branch-2.0. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13731: [SPARK-15946] [MLLIB] Conversion between old/new vector ...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/13731 LGTM, merged into master and branch-2.0. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13641: [SPARK-10258][DOC][ML] Add @Since annotations to ml.feat...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/13641 @MLnick I found you did not add ```@Since``` for all params definition, is this as expectedï¼I think we should add them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13641: [SPARK-10258][DOC][ML] Add @Since annotations to ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/13641#discussion_r67245508 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MaxAbsScaler.scala --- @@ -88,7 +91,7 @@ class MaxAbsScaler @Since("2.0.0") (override val uid: String) override def copy(extra: ParamMap): MaxAbsScaler = defaultCopy(extra) } -@Since("1.6.0") +@Since("2.0.0") object MaxAbsScaler extends DefaultParamsReadable[MaxAbsScaler] { @Since("1.6.0") --- End diff -- And in L174 and L177 should also be ```@Since("2.0.0")```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13641: [SPARK-10258][DOC][ML] Add @Since annotations to ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/13641#discussion_r67245105 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MaxAbsScaler.scala --- @@ -88,7 +91,7 @@ class MaxAbsScaler @Since("2.0.0") (override val uid: String) override def copy(extra: ParamMap): MaxAbsScaler = defaultCopy(extra) } -@Since("1.6.0") +@Since("2.0.0") object MaxAbsScaler extends DefaultParamsReadable[MaxAbsScaler] { @Since("1.6.0") --- End diff -- Not due to this PR, but here seems like a typo since ```MaxAbsScaler``` was added after 1.6.0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [M...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/10306#issuecomment-171942083 @mengxr Thanks for the prompt. I will check my environment and re-run the test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [M...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/10806 [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [MLlib] Optimize KMeans implementation * Use BLAS Level 3 matrix-matrix multiplications to compute pairwise distance in k-means. * Remove runs related code completely, it will have no effect after this change. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-8519-new Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10806.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10806 commit 28fa06a449f27e8d51e8e737963d1bf982d731e3 Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-01-18T14:48:06Z Initial draft of KMeans optimization commit e929e355281b4d13dae04650d11b6b510de72e26 Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-01-18T14:51:59Z Disable one test of PIC --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [M...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/10306#issuecomment-172558579 @mengxr I found the misconfiguration of my test environment and updated it, thanks! Now ```gemm``` is about 20-30 times faster than ```axpy/dot``` in the updated test cases. ```Scala println(com.github.fommil.netlib.BLAS.getInstance().getClass.getName) val n = 3000 val count = 10 val random = new Random() val a = Vectors.dense(Array.fill(n)(random.nextDouble())) val aa = Array.fill(n)(a) val b = Vectors.dense(Array.fill(n)(random.nextDouble())) val bb = Array.fill(n)(b) val a1 = new DenseMatrix(n, n, aa.flatMap(_.toArray), true) val b1 = new DenseMatrix(n, n, bb.flatMap(_.toArray), false) val c1 = Matrices.zeros(n, n).asInstanceOf[DenseMatrix] var total1 = 0.0 // Trial runs for (i <- 0 until 10) { gemm(2.0, a1, b1, 2.0, c1) } for (i <- 0 until count) { val start = System.nanoTime() gemm(2.0, a1, b1, 2.0, c1) total1 += (System.nanoTime() - start)/1e9 } total1 = total1 / count println("gemm elapsed time: = %.3f".format(total1) + " seconds.") // Trial runs for (m <- 0 until 10) { for (i <- 0 until n; j <- 0 until n) { dot(bb(j), aa(i)) } } var total2 = 0.0 for (m <- 0 until count) { val start = System.nanoTime() for (i <- 0 until n; j <- 0 until n) { // axpy(1.0, bb(j), aa(i)) dot(bb(j), aa(i)) } total2 += (System.nanoTime() - start)/1e9 } total2 = total2 / count println("dot elapsed time: = %.3f".format(total2) + " seconds.") ``` The output is: ``` com.github.fommil.netlib.NativeSystemBLAS gemm elapsed time: = 1.022 seconds. dot elapsed time: = 29.017 seconds. ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [M...
Github user yanboliang closed the pull request at: https://github.com/apache/spark/pull/10306 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [M...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/10306#issuecomment-172562607 @mengxr I have a new and advanced implementation for this issue at #10806 , let's move the discussion there. I will close this PR now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12645] [SparkR] SparkR support hash fun...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/10597#issuecomment-171910873 @shivaram Just like @felixcheung commented, the ```hash``` function was added only in 2.0.0. So revert it from branch 1.6 will fix the broken test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12905] [ML] [PySpark] PCAModel return e...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/10830 [SPARK-12905] [ML] [PySpark] PCAModel return eigenvalues for PySpark ```PCAModel``` return eigenvalues for PySpark You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-12905 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10830.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10830 commit 0fedf916676d447bc23b36e60c323fbfae94a1e1 Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-01-19T08:52:28Z PCAModel return eigenvalues for PySpark --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12903] [SparkR] Add covar_samp and cova...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/10829#issuecomment-172784554 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12903] [SparkR] Add covar_samp and cova...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/10829 [SPARK-12903] [SparkR] Add covar_samp and covar_pop for SparkR Add ```covar_samp``` and ```covar_pop``` for SparkR. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-12903 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10829.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10829 commit 7c3e718f1bce57baea852b909e6f50500343dd36 Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-01-19T08:08:57Z Add covar_samp and covar_pop for SparkR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/13378 @MLnick I have updated the new deprecations in the [JIRA] (https://issues.apache.org/jira/browse/SPARK-15643?focusedCommentId=15343059=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15343059) in this PR. To the vector conversions issue, I think it fits more to add them in your section. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13935: [SPARK-16242] [MLlib] [PySpark] Conversion betwee...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/13935 [SPARK-16242] [MLlib] [PySpark] Conversion between old/new matrix columns in a DataFrame (Python) ## What changes were proposed in this pull request? This PR implements python wrappers for #13888 to convert old/new matrix columns in a DataFrame. ## How was this patch tested? Doctest in python. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-16242 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13935.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13935 commit 11789339b0eab023bca61e24ac5e73f715a2d97a Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-06-28T03:09:58Z Conversion between old/new matrix columns in a DataFrame (Python) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13935: [SPARK-16242] [MLlib] [PySpark] Conversion between old/n...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/13935 cc @hhbyyh @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13937: [SPARK-16245] [ML] model loading backward compatibility ...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/13937 cc @hhbyyh @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13937: [SPARK-16245] [ML] model loading backward compati...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/13937#discussion_r68697383 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala --- @@ -206,24 +206,21 @@ object PCAModel extends MLReadable[PCAModel] { override def load(path: String): PCAModel = { val metadata = DefaultParamsReader.loadMetadata(path, sc, className) - // explainedVariance field is not present in Spark <= 1.6 - val versionRegex = "([0-9]+)\\.([0-9]+).*".r - val hasExplainedVariance = metadata.sparkVersion match { -case versionRegex(major, minor) => - major.toInt >= 2 || (major.toInt == 1 && minor.toInt > 6) -case _ => false - } + val versionRegex = "([0-9]+)\\.(.+)".r + val versionRegex(major, _) = metadata.sparkVersion val dataPath = new Path(path, "data").toString - val model = if (hasExplainedVariance) { + val model = if (major.toInt >= 2) { val Row(pc: DenseMatrix, explainedVariance: DenseVector) = sparkSession.read.parquet(dataPath) .select("pc", "explainedVariance") .head() new PCAModel(metadata.uid, pc, explainedVariance) } else { -val Row(pc: DenseMatrix) = sparkSession.read.parquet(dataPath).select("pc").head() -new PCAModel(metadata.uid, pc, Vectors.dense(Array.empty[Double]).asInstanceOf[DenseVector]) +// explainedVariance field is not present and we use the old matrix in Spark <= 2.0 +val Row(pc: OldDenseMatrix) = sparkSession.read.parquet(dataPath).select("pc").head() +new PCAModel(metadata.uid, pc.asML, + Vectors.dense(Array.empty[Double]).asInstanceOf[DenseVector]) --- End diff -- Here we combine the ```explainedVariance``` field issue and the old matrix issue together to handle backward compatibility. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13937: [SPARK-16245] [ML] model loading backward compati...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/13937 [SPARK-16245] [ML] model loading backward compatibility for ml.feature.PCA ## What changes were proposed in this pull request? model loading backward compatibility for ml.feature.PCA. ## How was this patch tested? existing ut and manual test for loading old models. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-16245 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13937.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13937 commit 5246bcfa1ba510c281c456b0f61bf32f70d10174 Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-06-28T04:42:41Z model loading backward compatibility for ml.feature.PCA --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13023: [SPARK-15177] [SparkR] [ML] SparkR 2.0 QA: New R ...
Github user yanboliang closed the pull request at: https://github.com/apache/spark/pull/13023 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13888: [SPARK-16187] [ML] Implement util method for ML M...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/13888#discussion_r68486013 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala --- @@ -309,8 +309,8 @@ object MLUtils extends Logging { } /** - * Converts vector columns in an input Dataset to the [[org.apache.spark.ml.linalg.Vector]] type - * from the new [[org.apache.spark.mllib.linalg.Vector]] type under the `spark.ml` package. + * Converts vector columns in an input Dataset to the [[org.apache.spark.mllib.linalg.Vector]] + * type from the new [[org.apache.spark.ml.linalg.Vector]] type under the `spark.ml` package. * @param dataset input dataset --- End diff -- The original annotation is correct, it does not necessary to revise. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13037][ML][PySpark] PySpark ml.recommen...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11044#issuecomment-180395685 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r52330009 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.Logging +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), + s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + +s"link function.") + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0&quo
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r52331693 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.Logging +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), + s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + +s"link function.") + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0&quo
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/11136 [SPARK-12811] [ML] Estimator for Generalized Linear Models(GLMs) Estimator for Generalized Linear Models(GLMs) You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-12811 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11136.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11136 commit 5af604e180d92a75b18b30717d6516d5ee8135bd Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-02-09T15:37:27Z Initial version of Generalized Linear Regression --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12962] [SQL] [PySpark] PySpark support ...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/10876#issuecomment-180175154 ping @davies --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/10889#issuecomment-183240342 @mengxr Actually, I have already updated this PR after #10216 get merged. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r52610253 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.Logging +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), + s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + +s"link function.") + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0&quo
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r52611868 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.Logging +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), + s"Generalized Linear Regression with ${$(family)} family does not support ${$(link)} " + +s"link function.") + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalized_linear_model]]) + * specified by giving a symbolic description of the linear predictor and + * a description of the error distribution. + */ +@Experimental +@Since("2.0.0") +class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") override val uid: String) + extends Regressor[Vector, GeneralizedLinearRegression, GeneralizedLinearRegressionModel] + with GeneralizedLinearRegressionParams with Logging { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("genLinReg")) + + /** + * Set the name of family which is a description of the error distribution + * to be used in the model. + * @group setParam + */ + @Since("2.0.0") + def setFamily(value: String): this.type = set(family, value) + + /** + * Set the name of the model link function. + * @group setParam + */ + @Since("2.0.0&quo
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r52611593 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,472 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.Logging +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionParams extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"the name of family which is a description of the error distribution to be used in the model", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of the model link function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "the name of the model link function", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + @Since("2.0.0") + override def validateParams(): Unit = { + require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) -> $(link)), --- End diff -- Good point! But we can not check ```isSet(link)``` in the setter for family, because users may set family before set link and it will produce mistake. We can check ```isSet(link)``` at the start of train. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-182905997 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11939] [ML] [PySpark] PySpark support m...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/10469#issuecomment-175472122 @jkbradley Thanks for your comments! I have made ```MLReadable``` and ```MLWritable``` more general and not specific to Java wrappers, addressed all comments except for setting the ```Param.parent```. I left the inline comments in the threads above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11939] [ML] [PySpark] PySpark support m...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10469#discussion_r50950853 --- Diff: python/pyspark/ml/wrapper.py --- @@ -82,13 +71,16 @@ def _transfer_params_to_java(self): pair = self._make_java_param_pair(param, paramMap[param]) self._java_obj.set(pair) -def _transfer_params_from_java(self): +def _transfer_params_from_java(self, withParent=False): --- End diff -- I try to put setting Param.parent field into load(), and we will add such code snippet to ```JavaMLReader.load()```: ```Scala for param in self._instance.params: value = self._instance._paramMap[param] self._instance._paramMap.pop(param) param.parent = self._instance.uid self._instance._paramMap[param] = value ``` It will first pop elements from the dict and then push new one which I think is inefficient. And this code will operate on ```_paramMap``` which is the ML instance's internal variable. So I vote to do this work at ```_transfer_params_from_java``` or we can add a new function there to do the same work. I'm open to hear your thoughts. @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11939] [ML] [PySpark] PySpark support m...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10469#discussion_r50948067 --- Diff: python/pyspark/ml/util.py --- @@ -52,3 +71,141 @@ def _randomUID(cls): concatenates the class name, "_", and 12 random hex chars. """ return cls.__name__ + "_" + uuid.uuid4().hex[12:] + + +@inherit_doc +class MLWriter(object): +""" +Abstract class for utility classes that can save ML instances. + +.. versionadded:: 2.0.0 +""" + +def __init__(self, instance): +self._jwrite = instance._java_obj.write() + +@since("2.0.0") +def save(self, path): +"""Saves the ML instances to the input path.""" +self._jwrite.save(path) + +@since("2.0.0") +def overwrite(self): +"""Overwrites if the output path already exists.""" +self._jwrite.overwrite() +return self + +@since("2.0.0") +def context(self, sqlContext): +"""Sets the SQL context to use for saving.""" +self._jwrite.context(sqlContext._ssql_ctx) +return self + + +@inherit_doc +class MLWritable(object): +""" +Mixin for ML instances that provide MLWriter through their Scala +implementation. + +.. versionadded:: 2.0.0 +""" + +@since("2.0.0") --- End diff -- Yes, I removed the ```Since``` annotations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13047][PYSPARK][ML] Pyspark Params.hasP...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10962#discussion_r51087340 --- Diff: python/pyspark/ml/param/__init__.py --- @@ -152,13 +152,17 @@ def isDefined(self, param): return self.isSet(param) or self.hasDefault(param) @since("1.4.0") -def hasParam(self, paramName): +def hasParam(self, param): """ -Tests whether this instance contains a param with a given -(string) name. +Tests whether this instance contains a param. """ -param = self._resolveParam(paramName) -return param in self.params +if isinstance(param, Param): +return hasattr(self, param.name) --- End diff -- +1 for accepting only strings. If no strong reasons, keep consistent with Scala is the best choice. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13032] [ML] [PySpark] PySpark support m...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/10469#issuecomment-176070210 @jkbradley You PR looks good and get merged, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor] [ML] [PySpark] Cleanup test cases of c...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/10975#issuecomment-177743797 @mengxr I did not found other class has similar test except ```KMeans```, is this deliberately designed or lacks of some test cases? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13035] [ML] [PySpark] PySpark ml.cluste...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10999#discussion_r51375854 --- Diff: python/pyspark/ml/clustering.py --- @@ -69,6 +70,25 @@ class KMeans(JavaEstimator, HasFeaturesCol, HasPredictionCol, HasMaxIter, HasTol True >>> rows[2].prediction == rows[3].prediction True +>>> import os, tempfile --- End diff -- hmm... Here we combine the test and example functions. I do not have strong preference about whether this should live here or in tests file. @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11000#discussion_r51400507 --- Diff: python/pyspark/ml/regression.py --- @@ -447,7 +447,7 @@ def _create_model(self, java_model): @inherit_doc -class DecisionTreeModel(JavaModel): +class DecisionTreeModel(JavaModel, MLWritable, MLReadable): """Abstraction for Decision Tree models. --- End diff -- @Wenpei Please check wether it supports ```save/load``` for the peer Scala implementation. Some algorithms such as ```DecisionTree``` did not support it currently. And you should add doc test that will test the correctness of your modification. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13037][ML][PySpark] PySpark ml.recommen...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11044#discussion_r51690764 --- Diff: python/pyspark/ml/recommendation.py --- @@ -81,6 +82,23 @@ class ALS(JavaEstimator, HasCheckpointInterval, HasMaxIter, HasPredictionCol, Ha Row(user=1, item=0, prediction=2.6258413791656494) >>> predictions[2] Row(user=2, item=0, prediction=-1.5018409490585327) +>>> import os, tempfile +>>> path = tempfile.mkdtemp() +>>> ALS_path = path + "/als" +>>> als.save(ALS_path) +>>> als2 = ALS.load(ALS_path) +>>> als.getMaxIter() +5 +>>> model_path = path + "/als_model" +>>> model.save(model_path) +>>> model2 = ALSModel.load(model_path) +>>> model.rank == model2.rank +True --- End diff -- Can we also add test for ```userFactors``` or ```itemFactors```? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13037][ML][PySpark] PySpark ml.recommen...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11044#discussion_r51690527 --- Diff: python/pyspark/ml/recommendation.py --- @@ -81,6 +82,23 @@ class ALS(JavaEstimator, HasCheckpointInterval, HasMaxIter, HasPredictionCol, Ha Row(user=1, item=0, prediction=2.6258413791656494) >>> predictions[2] Row(user=2, item=0, prediction=-1.5018409490585327) +>>> import os, tempfile +>>> path = tempfile.mkdtemp() +>>> ALS_path = path + "/als" --- End diff -- nit: ```ALS_path``` -> ```als_path``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13035] [ML] [PySpark] PySpark ml.cluste...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/10999 [SPARK-13035] [ML] [PySpark] PySpark ml.clustering support export/import PySpark ml.clustering support export/import. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-13035 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10999.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10999 commit dffafbf67782d4ee16524ad7a6edb4090e26c786 Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-01-31T09:41:15Z PySpark ml.clustering support export/import --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11000#issuecomment-178575141 @Wenpei It looks like ```_transfer_params_from_java``` did not consider the params which do not have default value and we should handle them. Would you mind to create a jira to track this issue? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13153][PySpark] ML persistence failed w...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11043#discussion_r51674846 --- Diff: python/pyspark/ml/wrapper.py --- @@ -79,8 +79,9 @@ def _transfer_params_from_java(self): for param in self.params: if self._java_obj.hasParam(param.name): java_param = self._java_obj.getParam(param.name) -value = _java2py(sc, self._java_obj.getOrDefault(java_param)) -self._paramMap[param] = value +if self._java_obj.hasDefault(java_param): --- End diff -- We should check by ```isDefined``` rather than ```hasDefault``` here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13153][PySpark] ML persistence failed w...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11043#issuecomment-178983977 ping @mengxr @jkbradley Could you add @Wenpei to white list ? This is an obvious bug and we should fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11000#issuecomment-178966434 It should not make all parameters have default value because of some params are not setting default value on purpose. I think we should modify ```_transfer_params_from_java``` to make it not to get the params which do not have default values. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13047][PYSPARK][ML] Pyspark Params.hasP...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10962#discussion_r51075753 --- Diff: python/pyspark/ml/param/__init__.py --- @@ -152,13 +152,17 @@ def isDefined(self, param): return self.isSet(param) or self.hasDefault(param) @since("1.4.0") -def hasParam(self, paramName): +def hasParam(self, param): """ -Tests whether this instance contains a param with a given -(string) name. +Tests whether this instance contains a param. """ -param = self._resolveParam(paramName) -return param in self.params +if isinstance(param, Param): +return hasattr(self, param.name) --- End diff -- If we support ```param``` of type ```Param```, we should not only check the ```hasattr(self, param.name)``` but also check ```self.uid == param.parent```. You can directly call ```_shouldOwn``` to do this work. It means if you provide a ```Param``` to check whether it belongs to the instance, you should both check ```uid``` and ```name```. I vote to only support ```paramName``` that make consistent semantics with Scala. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12962] [SQL] [PySpark] PySpark support ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10876#discussion_r51226739 --- Diff: python/pyspark/sql/functions.py --- @@ -263,6 +263,38 @@ def corr(col1, col2): return Column(sc._jvm.functions.corr(_to_java_column(col1), _to_java_column(col2))) +@since(2.0) +def covar_pop(col1, col2): +"""Returns a new :class:`Column` for the population covariance of ``col1`` +and ``col2``. + +>>> a = [x * x - 2 * x + 3.5 for x in range(20)] +>>> b = range(20) +>>> df = sqlContext.createDataFrame(zip(a, b), ["a", "b"]) +>>> covDf = df.agg(covar_pop("a", "b").alias('c')) +>>> covDf.selectExpr('abs(c - 565.25) < 1e-16 as t').collect() --- End diff -- @davies OK, I will cleanup them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor] [ML] [PySpark] Cleanup test cases of c...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/10975 [Minor] [ML] [PySpark] Cleanup test cases of clustering.py You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark clustering-cleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10975.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10975 commit 0e749a954c27030099809ddc75e7c33fdecf6021 Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-01-29T05:29:34Z Cleanup clustering.py --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11000#issuecomment-188217429 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor] [ML] [Doc] Cleanup dots at the end of ...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11344#issuecomment-188213659 @srowen Some ScalaDoc will end with two dots if we don't fix, you can refer [here](https://github.com/apache/spark/pull/11344/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L367). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11000#issuecomment-188217276 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11000#issuecomment-188216762 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11000#issuecomment-188217094 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor] [ML] [Doc] Cleanup dots at the end of ...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11344#issuecomment-188281271 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53911268 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala --- @@ -0,0 +1,499 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import scala.util.Random + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.MLTestingUtils +import org.apache.spark.mllib.classification.LogisticRegressionSuite._ +import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.random._ +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.{DataFrame, Row} + +class GeneralizedLinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext { + + private val seed: Int = 42 + @transient var datasetGaussianIdentity: DataFrame = _ + @transient var datasetGaussianLog: DataFrame = _ + @transient var datasetGaussianInverse: DataFrame = _ + @transient var datasetBinomial: DataFrame = _ + @transient var datasetPoissonLog: DataFrame = _ + @transient var datasetPoissonIdentity: DataFrame = _ + @transient var datasetPoissonSqrt: DataFrame = _ + @transient var datasetGammaInverse: DataFrame = _ + @transient var datasetGammaIdentity: DataFrame = _ + @transient var datasetGammaLog: DataFrame = _ + + override def beforeAll(): Unit = { +super.beforeAll() + +import GeneralizedLinearRegressionSuite._ + +datasetGaussianIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "identity"), 2)) + +datasetGaussianLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "log"), 2)) + +datasetGaussianInverse = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "gaussian", link = "inverse"), 2)) + +datasetBinomial = { + val nPoints = 1 + val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 2.688191) + val xMean = Array(5.843, 3.057, 3.758, 1.199) + val xVariance = Array(0.6856, 0.1899, 3.116, 0.581) + + val testData = +generateMultinomialLogisticInput(coefficients, xMean, xVariance, true, nPoints, seed) + + sqlContext.createDataFrame(sc.parallelize(testData, 4)) +} + +datasetPoissonLog = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family = "poisson", link = "log"), 2)) + +datasetPoissonIdentity = sqlContext.createDataFrame( + sc.parallelize(generateGeneralizedLinearRegressionInput( +intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = Array(2.9, 10.5), +xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01, +family
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53911596 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,565 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionBase extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * Default is "gaussian". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"The name of family which is a description of the error distribution to be used in the " + + "model. Supported options: gaussian(default), binomial, poisson and gamma.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of link function which provides the relationship + * between the linear predictor and the mean of the distribution function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "The name of link function " + +"which provides the relationship between the linear predictor and the mean of the " + +"distribution function. Supported options: identity, log, inverse, logit, probit, " + +"cloglog and sqrt.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + import GeneralizedLinearRegression._ + protected lazy val familyObj = Family.fromName($(family)) + protected lazy val linkObj = if (isDefined(link)) { +Link.fromName($(link)) + } else { +familyObj.defaultLink + } + protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj) + + @Since("2.0.0") + override def validateParams(): Unit = { +if ($(solver) == "irls") { + setDefault(maxIter -> 25) +} +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains( +familyObj -> linkObj), s"Generalized Linear Regression with ${$(family)} family " + +s"does not support ${$(link)} link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalize
[GitHub] spark pull request: [] [] [] Clean up sharedParams
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/11344 [] [] [] Clean up sharedParams ## What changes were proposed in this pull request? Remove duplicated dot at the end of some sharedParams in ScalaDoc. cc @mengxr @srowen ## How was this patch tested? Documents change, no test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark shared-cleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11344.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11344 commit a12b1ac390fe6cb15386cdc6e052a1e28c22e992 Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-02-24T10:09:50Z Clean up sharedParams --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53909986 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,565 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionBase extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * Default is "gaussian". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"The name of family which is a description of the error distribution to be used in the " + + "model. Supported options: gaussian(default), binomial, poisson and gamma.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of link function which provides the relationship + * between the linear predictor and the mean of the distribution function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "The name of link function " + +"which provides the relationship between the linear predictor and the mean of the " + +"distribution function. Supported options: identity, log, inverse, logit, probit, " + +"cloglog and sqrt.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + import GeneralizedLinearRegression._ + protected lazy val familyObj = Family.fromName($(family)) + protected lazy val linkObj = if (isDefined(link)) { +Link.fromName($(link)) + } else { +familyObj.defaultLink + } + protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj) + + @Since("2.0.0") + override def validateParams(): Unit = { +if ($(solver) == "irls") { + setDefault(maxIter -> 25) +} +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains( +familyObj -> linkObj), s"Generalized Linear Regression with ${$(family)} family " + +s"does not support ${$(link)} link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalize
[GitHub] spark pull request: [SPARK-7106][MLlib][PySpark] Support model sav...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11321#discussion_r53908665 --- Diff: python/pyspark/mllib/fpm.py --- @@ -40,6 +41,11 @@ class FPGrowthModel(JavaModelWrapper): >>> model = FPGrowth.train(rdd, 0.6, 2) >>> sorted(model.freqItemsets().collect()) [FreqItemset(items=[u'a'], freq=4), FreqItemset(items=[u'c'], freq=3), ... +>>> model_path = temp_path + "/fpg_model" --- End diff -- ```/fpm``` is enough, because we only support save/load model under the old MLlib API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53909458 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,565 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionBase extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * Default is "gaussian". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"The name of family which is a description of the error distribution to be used in the " + + "model. Supported options: gaussian(default), binomial, poisson and gamma.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of link function which provides the relationship + * between the linear predictor and the mean of the distribution function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "The name of link function " + +"which provides the relationship between the linear predictor and the mean of the " + +"distribution function. Supported options: identity, log, inverse, logit, probit, " + +"cloglog and sqrt.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + import GeneralizedLinearRegression._ + protected lazy val familyObj = Family.fromName($(family)) + protected lazy val linkObj = if (isDefined(link)) { +Link.fromName($(link)) + } else { +familyObj.defaultLink + } + protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj) + + @Since("2.0.0") + override def validateParams(): Unit = { +if ($(solver) == "irls") { + setDefault(maxIter -> 25) +} +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains( +familyObj -> linkObj), s"Generalized Linear Regression with ${$(family)} family " + +s"does not support ${$(link)} link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalize
[GitHub] spark pull request: [SPARK-13504] [SparkR] Add approxQuantile for ...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/11383 [SPARK-13504] [SparkR] Add approxQuantile for SparkR ## What changes were proposed in this pull request? Add ```approxQuantile``` for SparkR. ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-13504 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11383.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11383 commit 4f17adb6b18d53aef08233248d0b8bf0f294ddf4 Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-02-26T03:53:46Z Add approxQuantile for SparkR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13036][SPARK-13318][SPARK-13319] Add sa...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11203#discussion_r54218770 --- Diff: python/pyspark/ml/feature.py --- @@ -1330,6 +1448,21 @@ class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid): >>> sorted(set([(i[0], str(i[1])) for i in itd.select(itd.id, itd.label2).collect()]), ... key=lambda x: x[0]) [(0, 'a'), (1, 'b'), (2, 'c'), (3, 'a'), (4, 'a'), (5, 'c')] +>>> stringIndexerPath = temp_path + "/string-indexer" +>>> stringIndexer.save(stringIndexerPath) +>>> loadedIndexerModel = StringIndexer.load(stringIndexerPath).fit(stringIndDf) --- End diff -- It does not need to train here. Like others, check ```StringIndexer``` param equality, then save the model trained at L1323, loaded the model and check model equality. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13036][SPARK-13318][SPARK-13319] Add sa...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11203#discussion_r54217956 --- Diff: python/pyspark/ml/feature.py --- @@ -443,6 +477,12 @@ class HashingTF(JavaTransformer, HasInputCol, HasOutputCol, HasNumFeatures): >>> params = {hashingTF.numFeatures: 5, hashingTF.outputCol: "vector"} >>> hashingTF.transform(df, params).head().vector SparseVector(5, {2: 1.0, 3: 1.0, 4: 1.0}) +>>> hashingTFPath = temp_path + "/hashing-tf" +>>> hashingTF.save(hashingTFPath) +>>> loadedHashingTF = HashingTF.load(hashingTFPath) +>>> param = loadedHashingTF.getParam("numFeatures") +>>> loadedHashingTF.getOrDefault(param) == hashingTF.getOrDefault(param) +True --- End diff -- Could you use ```getNumFeatures``` like other transformers in the doc test? It will make your test clean. ```HashingTF``` extends from ```HasNumFeatures```, so it has this method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r54206474 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -157,6 +157,12 @@ private[ml] class WeightedLeastSquares( private[ml] object WeightedLeastSquares { /** + * In order to take the normal equation approach efficiently, [[WeightedLeastSquares]] + * only supports the number of features is no more than 4096. + */ + val MaxNumFeatures: Int = 4096 --- End diff -- OK, I will update it to ```MAX_NUM_FEATURES``` after collecting more comments. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13545] [MLlib] [PySpark] Make MLlib LR'...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/11424 [SPARK-13545] [MLlib] [PySpark] Make MLlib LR's default parameters consistent in Scala and Python ## What changes were proposed in this pull request? Make MLlib LR's default parameters consistent in Scala and Python. ## How was this patch tested? No new tests, should pass current tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-13545 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11424.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11424 commit fc370c0c3fba12af42551d4d71043cb54e3fde71 Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-02-29T05:32:58Z Make MLlib LR's default parameters consistent in Scala and Python --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13322] [ML] AFTSurvivalRegression suppo...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/11365 [SPARK-13322] [ML] AFTSurvivalRegression supports feature standardization ## What changes were proposed in this pull request? AFTSurvivalRegression should support feature standardization, it will improve the convergence rate. ## How was this patch tested? unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-13322 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11365.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11365 commit 0e5efab562d4174908fab5d9d9a788c95fb183e0 Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-02-25T06:27:52Z AFTSurvivalRegression supports feature standardization commit ae28544d9141c8ddbad57ee79873125edbb115ec Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-02-25T07:57:42Z add test case: numerical stability of standardization --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...
Github user yanboliang closed the pull request at: https://github.com/apache/spark/pull/11247 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11247#issuecomment-188668664 @dbtsai I think you convinced me, and I have also checked the R glmnet implementation. The current behavior may be more make sense, so I will close this PR. Thanks for your kindly clarification! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Minor] [ML] [Doc] Cleanup dots at the end of ...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11344#issuecomment-188778682 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13490] [ML] ML LinearRegression should ...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/11367 [SPARK-13490] [ML] ML LinearRegression should cache standardization param value ## What changes were proposed in this pull request? Like [SPARK-13132](https://issues.apache.org/jira/browse/SPARK-13132) for LogisticRegression, when LinearRegression with L1 regularization, the inner functor passed to the quasi-newton optimizer in ```org.apache.spark.ml.regression.LinearRegression#train``` makes repeated calls to ```$(standardization)```. This ultimately involves repeated string interpolation triggered by ```org.apache.spark.ml.param.Param#hashCode```. We should cache the value of the ```standardization``` rather than re-fetching it from the ParamMap for every iteration. ## How was this patch tested? No extra tests are added. It should pass all existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-13490 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11367.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11367 commit f9c79b136ac41bd91f482a4f948acb780f493516 Author: Yanbo Liang <yblia...@gmail.com> Date: 2016-02-25T09:35:57Z ML LinearRegression should cache standardization param value --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11940][PYSPARK] Python API for ml.clust...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10242#discussion_r54395410 --- Diff: python/pyspark/ml/clustering.py --- @@ -291,6 +292,317 @@ def _create_model(self, java_model): return BisectingKMeansModel(java_model) +class LDAModel(JavaModel): +""" A clustering model derived from the LDA method. + +Latent Dirichlet Allocation (LDA), a topic model designed for text documents. +Terminology +- "word" = "term": an element of the vocabulary +- "token": instance of a term appearing in a document +- "topic": multinomial distribution over words representing some concept +References: +- Original LDA paper (journal version): +Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003. +""" + +@since("2.0.0") +def isDistributed(self): +"""Indicates whether this instance is of type DistributedLDAModel""" +return self._call_java("isDistributed") + +@since("2.0.0") +def vocabSize(self): +"""Vocabulary size (number of terms or terms in the vocabulary)""" +return self._call_java("vocabSize") + +@since("2.0.0") +def topicsMatrix(self): +"""Inferred topics, where each topic is represented by a distribution over terms.""" +return self._call_java("topicsMatrix") + +@since("2.0.0") +def logLikelihood(self, dataset): +"""Calculates a lower bound on the log likelihood of the entire corpus.""" +return self._call_java("logLikelihood", dataset) + +@since("2.0.0") +def logPerplexity(self, dataset): +"""Calculate an upper bound bound on perplexity. (Lower is better.)""" +return self._call_java("logPerplexity", dataset) + +@since("2.0.0") +def describeTopics(self, maxTermsPerTopic=10): +"""Return the topics described by weighted terms. + +WARNING: If vocabSize and k are large, this can return a large object! + +:param maxTermsPerTopic: Maximum number of terms to collect for each topic. +(default: vocabulary size) +:return: Array over topics. Each topic is represented as a pair of matching arrays: +(term indices, term weights in topic). +Each topic's terms are sorted in order of decreasing weight. +""" +return self._call_java("describeTopics", maxTermsPerTopic) + + +class DistributedLDAModel(LDAModel): +""" +Model fitted by LDA. + +.. versionadded:: 2.0.0 +""" +def toLocal(self): +return self._call_java("toLocal") + + +class LocalLDAModel(LDAModel): +""" +Model fitted by LDA. + +.. versionadded:: 2.0.0 +""" +pass + + +class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval): +""" A clustering model derived from the LDA method. + +Latent Dirichlet Allocation (LDA), a topic model designed for text documents. +Terminology +- "word" = "term": an element of the vocabulary +- "token": instance of a term appearing in a document +- "topic": multinomial distribution over words representing some concept +References: +- Original LDA paper (journal version): +Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003. + +>>> from pyspark.mllib.linalg import Vectors, SparseVector +>>> from pyspark.ml.clustering import LDA +>>> df = sqlContext.createDataFrame([[1, Vectors.dense([0.0, 1.0])], \ +[2, SparseVector(2, {0: 1.0})],], ["id", "features"]) +>>> lda = LDA(k=2, seed=1, optimizer="em") +>>> model = lda.fit(df) +>>> model.isDistributed() +True +>>> localModel = model.toLocal() +>>> localModel.isDistributed() +False +>>> model.vocabSize() +2 +>>> model.describeTopics().show() ++-+---++ +|topic|termIndices| termWeights| +
[GitHub] spark pull request: [SPARK-11940][PYSPARK] Python API for ml.clust...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10242#discussion_r54394942 --- Diff: python/pyspark/ml/clustering.py --- @@ -291,6 +292,317 @@ def _create_model(self, java_model): return BisectingKMeansModel(java_model) +class LDAModel(JavaModel): +""" A clustering model derived from the LDA method. + +Latent Dirichlet Allocation (LDA), a topic model designed for text documents. +Terminology +- "word" = "term": an element of the vocabulary +- "token": instance of a term appearing in a document +- "topic": multinomial distribution over words representing some concept +References: +- Original LDA paper (journal version): +Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003. +""" + +@since("2.0.0") +def isDistributed(self): +"""Indicates whether this instance is of type DistributedLDAModel""" +return self._call_java("isDistributed") + +@since("2.0.0") +def vocabSize(self): +"""Vocabulary size (number of terms or terms in the vocabulary)""" +return self._call_java("vocabSize") + +@since("2.0.0") +def topicsMatrix(self): +"""Inferred topics, where each topic is represented by a distribution over terms.""" +return self._call_java("topicsMatrix") + +@since("2.0.0") +def logLikelihood(self, dataset): +"""Calculates a lower bound on the log likelihood of the entire corpus.""" +return self._call_java("logLikelihood", dataset) + +@since("2.0.0") +def logPerplexity(self, dataset): +"""Calculate an upper bound bound on perplexity. (Lower is better.)""" +return self._call_java("logPerplexity", dataset) + +@since("2.0.0") +def describeTopics(self, maxTermsPerTopic=10): +"""Return the topics described by weighted terms. + +WARNING: If vocabSize and k are large, this can return a large object! + +:param maxTermsPerTopic: Maximum number of terms to collect for each topic. +(default: vocabulary size) +:return: Array over topics. Each topic is represented as a pair of matching arrays: +(term indices, term weights in topic). +Each topic's terms are sorted in order of decreasing weight. +""" +return self._call_java("describeTopics", maxTermsPerTopic) + + +class DistributedLDAModel(LDAModel): +""" +Model fitted by LDA. + +.. versionadded:: 2.0.0 +""" +def toLocal(self): +return self._call_java("toLocal") + + +class LocalLDAModel(LDAModel): +""" +Model fitted by LDA. + +.. versionadded:: 2.0.0 +""" +pass + + +class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval): +""" A clustering model derived from the LDA method. + +Latent Dirichlet Allocation (LDA), a topic model designed for text documents. +Terminology +- "word" = "term": an element of the vocabulary +- "token": instance of a term appearing in a document +- "topic": multinomial distribution over words representing some concept +References: +- Original LDA paper (journal version): +Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003. + +>>> from pyspark.mllib.linalg import Vectors, SparseVector +>>> from pyspark.ml.clustering import LDA +>>> df = sqlContext.createDataFrame([[1, Vectors.dense([0.0, 1.0])], \ +[2, SparseVector(2, {0: 1.0})],], ["id", "features"]) +>>> lda = LDA(k=2, seed=1, optimizer="em") +>>> model = lda.fit(df) +>>> model.isDistributed() +True +>>> localModel = model.toLocal() +>>> localModel.isDistributed() +False +>>> model.vocabSize() +2 +>>> model.describeTopics().show() ++-+---++ +|topic|termIndices| termWeights| +
[GitHub] spark pull request: [SPARK-11940][PYSPARK] Python API for ml.clust...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10242#discussion_r54395971 --- Diff: python/pyspark/ml/clustering.py --- @@ -291,6 +292,317 @@ def _create_model(self, java_model): return BisectingKMeansModel(java_model) +class LDAModel(JavaModel): +""" A clustering model derived from the LDA method. + +Latent Dirichlet Allocation (LDA), a topic model designed for text documents. +Terminology +- "word" = "term": an element of the vocabulary +- "token": instance of a term appearing in a document +- "topic": multinomial distribution over words representing some concept +References: +- Original LDA paper (journal version): +Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003. +""" --- End diff -- ```.. versionadded:: 2.0.0``` for ```LDAModel```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11940][PYSPARK] Python API for ml.clust...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10242#discussion_r54394622 --- Diff: python/pyspark/ml/clustering.py --- @@ -167,6 +167,200 @@ def getInitSteps(self): return self.getOrDefault(self.initSteps) +class LDAModel(JavaModel): +""" A clustering model derived from the LDA method. + +Latent Dirichlet Allocation (LDA), a topic model designed for text documents. +Terminology +- "word" = "term": an element of the vocabulary +- "token": instance of a term appearing in a document +- "topic": multinomial distribution over words representing some concept +References: +- Original LDA paper (journal version): +Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003. +""" + +@since("1.7.0") +def isDistributed(self): +"""Indicates whether this instance is of type DistributedLDAModel""" +return self._call_java("isDistributed") + +@since("1.7.0") +def vocabSize(self): +"""Vocabulary size (number of terms or terms in the vocabulary)""" +return self._call_java("vocabSize") + +@since("1.7.0") +def topicsMatrix(self): +"""Inferred topics, where each topic is represented by a distribution over terms.""" +return self._call_java("topicsMatrix") + +@since("1.7.0") +def logLikelihood(self, dataset): +"""Calculates a lower bound on the log likelihood of the entire corpus.""" +return self._call_java("logLikelihood", dataset) + +@since("1.7.0") +def logPerplexity(self, dataset): +"""Calculate an upper bound bound on perplexity. (Lower is better.)""" +return self._call_java("logPerplexity", dataset) + +@since("1.7.0") +def describeTopics(self, maxTermsPerTopic=10): +"""Return the topics described by weighted terms. + +WARNING: If vocabSize and k are large, this can return a large object! + +:param maxTermsPerTopic: Maximum number of terms to collect for each topic. +(default: vocabulary size) +:return: Array over topics. Each topic is represented as a pair of matching arrays: +(term indices, term weights in topic). +Each topic's terms are sorted in order of decreasing weight. +""" +return self._call_java("describeTopics", maxTermsPerTopic) + + +class DistributedLDAModel(LDAModel): +""" +Model fitted by LDA. + +.. versionadded:: 1.7.0 +""" +def toLocal(self): +return self._call_java("toLocal") + + +class LocalLDAModel(LDAModel): +""" +Model fitted by LDA. + +.. versionadded:: 1.7.0 +""" +pass + + +class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval): +""" A clustering model derived from the LDA method. + +Latent Dirichlet Allocation (LDA), a topic model designed for text documents. +Terminology +- "word" = "term": an element of the vocabulary +- "token": instance of a term appearing in a document +- "topic": multinomial distribution over words representing some concept +References: +- Original LDA paper (journal version): +Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003. + +>>> from pyspark.mllib.linalg import Vectors, SparseVector +>>> from pyspark.ml.clustering import LDA +>>> df = sqlContext.createDataFrame([[1, Vectors.dense([0.0, 1.0])], \ --- End diff -- Here we usually make the next line start with ```...```, you can refer [here](https://github.com/apache/spark/blob/master/python/pyspark/ml/clustering.py#L58). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11940][PYSPARK] Python API for ml.clust...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10242#discussion_r54394752 --- Diff: python/pyspark/ml/clustering.py --- @@ -291,6 +292,317 @@ def _create_model(self, java_model): return BisectingKMeansModel(java_model) +class LDAModel(JavaModel): +""" A clustering model derived from the LDA method. + +Latent Dirichlet Allocation (LDA), a topic model designed for text documents. +Terminology +- "word" = "term": an element of the vocabulary +- "token": instance of a term appearing in a document +- "topic": multinomial distribution over words representing some concept +References: +- Original LDA paper (journal version): +Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003. +""" + +@since("2.0.0") +def isDistributed(self): +"""Indicates whether this instance is of type DistributedLDAModel""" +return self._call_java("isDistributed") + +@since("2.0.0") +def vocabSize(self): +"""Vocabulary size (number of terms or terms in the vocabulary)""" +return self._call_java("vocabSize") + +@since("2.0.0") +def topicsMatrix(self): +"""Inferred topics, where each topic is represented by a distribution over terms.""" +return self._call_java("topicsMatrix") + +@since("2.0.0") +def logLikelihood(self, dataset): +"""Calculates a lower bound on the log likelihood of the entire corpus.""" +return self._call_java("logLikelihood", dataset) + +@since("2.0.0") +def logPerplexity(self, dataset): +"""Calculate an upper bound bound on perplexity. (Lower is better.)""" +return self._call_java("logPerplexity", dataset) + +@since("2.0.0") +def describeTopics(self, maxTermsPerTopic=10): +"""Return the topics described by weighted terms. + +WARNING: If vocabSize and k are large, this can return a large object! + +:param maxTermsPerTopic: Maximum number of terms to collect for each topic. +(default: vocabulary size) +:return: Array over topics. Each topic is represented as a pair of matching arrays: +(term indices, term weights in topic). +Each topic's terms are sorted in order of decreasing weight. +""" +return self._call_java("describeTopics", maxTermsPerTopic) + + +class DistributedLDAModel(LDAModel): +""" +Model fitted by LDA. + +.. versionadded:: 2.0.0 +""" +def toLocal(self): +return self._call_java("toLocal") + + +class LocalLDAModel(LDAModel): +""" +Model fitted by LDA. + +.. versionadded:: 2.0.0 +""" +pass + + +class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval): +""" A clustering model derived from the LDA method. + +Latent Dirichlet Allocation (LDA), a topic model designed for text documents. +Terminology +- "word" = "term": an element of the vocabulary +- "token": instance of a term appearing in a document +- "topic": multinomial distribution over words representing some concept +References: +- Original LDA paper (journal version): +Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003. + +>>> from pyspark.mllib.linalg import Vectors, SparseVector +>>> from pyspark.ml.clustering import LDA +>>> df = sqlContext.createDataFrame([[1, Vectors.dense([0.0, 1.0])], \ +[2, SparseVector(2, {0: 1.0})],], ["id", "features"]) +>>> lda = LDA(k=2, seed=1, optimizer="em") +>>> model = lda.fit(df) +>>> model.isDistributed() +True +>>> localModel = model.toLocal() +>>> localModel.isDistributed() +False +>>> model.vocabSize() +2 +>>> model.describeTopics().show() ++-+---++ +|topic|termIndices| termWeights| ++-+--
[GitHub] spark pull request: [SPARK-11940][PYSPARK] Python API for ml.clust...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10242#discussion_r54395661 --- Diff: python/pyspark/ml/clustering.py --- @@ -291,6 +292,317 @@ def _create_model(self, java_model): return BisectingKMeansModel(java_model) +class LDAModel(JavaModel): +""" A clustering model derived from the LDA method. + +Latent Dirichlet Allocation (LDA), a topic model designed for text documents. +Terminology +- "word" = "term": an element of the vocabulary +- "token": instance of a term appearing in a document +- "topic": multinomial distribution over words representing some concept +References: +- Original LDA paper (journal version): +Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003. +""" + +@since("2.0.0") +def isDistributed(self): +"""Indicates whether this instance is of type DistributedLDAModel""" +return self._call_java("isDistributed") + +@since("2.0.0") +def vocabSize(self): +"""Vocabulary size (number of terms or terms in the vocabulary)""" +return self._call_java("vocabSize") + +@since("2.0.0") +def topicsMatrix(self): +"""Inferred topics, where each topic is represented by a distribution over terms.""" +return self._call_java("topicsMatrix") + +@since("2.0.0") +def logLikelihood(self, dataset): +"""Calculates a lower bound on the log likelihood of the entire corpus.""" +return self._call_java("logLikelihood", dataset) + +@since("2.0.0") +def logPerplexity(self, dataset): +"""Calculate an upper bound bound on perplexity. (Lower is better.)""" +return self._call_java("logPerplexity", dataset) + +@since("2.0.0") +def describeTopics(self, maxTermsPerTopic=10): +"""Return the topics described by weighted terms. + +WARNING: If vocabSize and k are large, this can return a large object! + +:param maxTermsPerTopic: Maximum number of terms to collect for each topic. +(default: vocabulary size) +:return: Array over topics. Each topic is represented as a pair of matching arrays: +(term indices, term weights in topic). +Each topic's terms are sorted in order of decreasing weight. +""" +return self._call_java("describeTopics", maxTermsPerTopic) + + +class DistributedLDAModel(LDAModel): +""" +Model fitted by LDA. + +.. versionadded:: 2.0.0 +""" +def toLocal(self): +return self._call_java("toLocal") + + +class LocalLDAModel(LDAModel): +""" +Model fitted by LDA. + +.. versionadded:: 2.0.0 +""" +pass + + +class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval): +""" A clustering model derived from the LDA method. + +Latent Dirichlet Allocation (LDA), a topic model designed for text documents. +Terminology +- "word" = "term": an element of the vocabulary +- "token": instance of a term appearing in a document +- "topic": multinomial distribution over words representing some concept +References: +- Original LDA paper (journal version): +Blei, Ng, and Jordan. "Latent Dirichlet Allocation." JMLR, 2003. + +>>> from pyspark.mllib.linalg import Vectors, SparseVector +>>> from pyspark.ml.clustering import LDA +>>> df = sqlContext.createDataFrame([[1, Vectors.dense([0.0, 1.0])], \ +[2, SparseVector(2, {0: 1.0})],], ["id", "features"]) +>>> lda = LDA(k=2, seed=1, optimizer="em") +>>> model = lda.fit(df) +>>> model.isDistributed() +True +>>> localModel = model.toLocal() +>>> localModel.isDistributed() +False +>>> model.vocabSize() +2 +>>> model.describeTopics().show() ++-+---++ +|topic|termIndices| termWeights| +
[GitHub] spark pull request: [SPARK-13506] [MLlib] Fix the wrong parameter ...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11387#issuecomment-189200629 LGTM cc @mengxr --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13036][SPARK-13318][SPARK-13319] Add sa...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11203#issuecomment-189188086 @yinxusen Looks good overall, I left some inline comments. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11000#discussion_r53635317 --- Diff: python/pyspark/ml/regression.py --- @@ -172,6 +172,16 @@ class IsotonicRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredicti 0.0 >>> model.boundaries DenseVector([0.0, 1.0]) +>>> ir_path = temp_path + "/ir" +>>> ir.save(ir_path) +>>> ir2 = IsotonicRegression.load(ir_path) +>>> ir2.getIsotonic() +True +>>> model_path = temp_path + "/ir_model" +>>> model.save(model_path) +>>> model2 = IsotonicRegressionModel.load(model_path) +>>> model.boundaries == model2.boundaries +True --- End diff -- We should also check equality of ```predictions```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11000#issuecomment-187221567 Looks good except minor issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11000#issuecomment-187222806 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11000#discussion_r53635702 --- Diff: python/pyspark/ml/regression.py --- @@ -690,6 +700,18 @@ class AFTSurvivalRegression(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredi | 0.0|(1,[],[])| 0.0| 1.0| +-+-+--+--+ ... +>>> aftsr_path = temp_path + "/aftsr" +>>> aftsr.save(aftsr_path) +>>> aftsr2 = AFTSurvivalRegression.load(aftsr_path) +>>> aftsr2.getMaxIter() +100 +>>> model_path = temp_path + "/aftsr_model" +>>> model.save(model_path) +>>> model2 = AFTSurvivalRegressionModel.load(model_path) +>>> model.coefficients == model2.coefficients +True +>>> model.intercept == model2.intercept +True --- End diff -- We should also check equality of model ```scale```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/11136#discussion_r53913710 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala --- @@ -0,0 +1,565 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import breeze.stats.distributions.{Gaussian => GD} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.feature.Instance +import org.apache.spark.ml.optim._ +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared._ +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.mllib.linalg.{BLAS, Vector} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.{DataFrame, Row} +import org.apache.spark.sql.functions._ + +/** + * Params for Generalized Linear Regression. + */ +private[regression] trait GeneralizedLinearRegressionBase extends PredictorParams + with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with HasWeightCol + with HasSolver with Logging { + + /** + * Param for the name of family which is a description of the error distribution + * to be used in the model. + * Supported options: "gaussian", "binomial", "poisson" and "gamma". + * Default is "gaussian". + * @group param + */ + @Since("2.0.0") + final val family: Param[String] = new Param(this, "family", +"The name of family which is a description of the error distribution to be used in the " + + "model. Supported options: gaussian(default), binomial, poisson and gamma.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getFamily: String = $(family) + + /** + * Param for the name of link function which provides the relationship + * between the linear predictor and the mean of the distribution function. + * Supported options: "identity", "log", "inverse", "logit", "probit", "cloglog" and "sqrt". + * @group param + */ + @Since("2.0.0") + final val link: Param[String] = new Param(this, "link", "The name of link function " + +"which provides the relationship between the linear predictor and the mean of the " + +"distribution function. Supported options: identity, log, inverse, logit, probit, " + +"cloglog and sqrt.", + ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray)) + + /** @group getParam */ + @Since("2.0.0") + def getLink: String = $(link) + + import GeneralizedLinearRegression._ + protected lazy val familyObj = Family.fromName($(family)) + protected lazy val linkObj = if (isDefined(link)) { +Link.fromName($(link)) + } else { +familyObj.defaultLink + } + protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj) + + @Since("2.0.0") + override def validateParams(): Unit = { +if ($(solver) == "irls") { + setDefault(maxIter -> 25) +} +if (isDefined(link)) { + require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains( +familyObj -> linkObj), s"Generalized Linear Regression with ${$(family)} family " + +s"does not support ${$(link)} link function.") +} + } +} + +/** + * :: Experimental :: + * + * Fit a Generalized Linear Model ([[https://en.wikipedia.org/wiki/Generalize
[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/11136#issuecomment-188165176 @mengxr This PR is ready for another pass. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9835] [ML] IterativelyReweightedLeastSq...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10639#discussion_r50390322 --- Diff: mllib/src/main/scala/org/apache/spark/ml/glm/Families.scala --- @@ -0,0 +1,138 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.glm + +import org.apache.spark.rdd.RDD + +/** + * A description of the error distribution and link function to be used in the model. + * @param link a link function instance + */ +private[ml] abstract class Family(val link: Link) extends Serializable { --- End diff -- I think ```Families``` can be used by [SPARK-12811](https://issues.apache.org/jira/browse/SPARK-12811) which provide Estimator interface for GLMs, so I move it to a new folder named ```glm```. Here we have two ways to support GLMs: * Implement ```reweightFunc``` for each ```Family/Link``` directly based on mathematical formula. * Implement the ```Family``` framework like what I have done and a factory method which can output ```reweightFunc``` according to argument. The former one has better execution efficiency, the later one is more easy to understand. Looking forward your comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org