This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new 513f76a [SPARK-30934][ML][DOCS] Update ml-guide and ml-migration-guide for 3.0 release 513f76a is described below commit 513f76ac38f174ca1adf6faff087b95f4d8f9750 Author: Huaxin Gao <huax...@us.ibm.com> AuthorDate: Sat Mar 7 18:09:00 2020 -0600 [SPARK-30934][ML][DOCS] Update ml-guide and ml-migration-guide for 3.0 release ### What changes were proposed in this pull request? Update ml-guide and ml-migration-guide for 3.0. ### Why are the changes needed? This is required for each release. ### Does this PR introduce any user-facing change? Yes. ![image](https://user-images.githubusercontent.com/13592258/75957386-c8699e80-5e6e-11ea-9dec-7295f8f0bf33.png) ![image](https://user-images.githubusercontent.com/13592258/75957406-cef81600-5e6e-11ea-921f-20509771b49b.png) ![image](https://user-images.githubusercontent.com/13592258/75957423-d4edf700-5e6e-11ea-8e75-d41c532c8ba9.png) ![image](https://user-images.githubusercontent.com/13592258/75957434-da4b4180-5e6e-11ea-899b-f4e080b318ff.png) ### How was this patch tested? Manually build and check. Closes #27785 from huaxingao/spark-30934. Authored-by: Huaxin Gao <huax...@us.ibm.com> Signed-off-by: Sean Owen <sro...@gmail.com> --- docs/ml-guide.md | 46 ++++++++++++++++++++--------------------- docs/ml-migration-guide.md | 51 +++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 72 insertions(+), 25 deletions(-) diff --git a/docs/ml-guide.md b/docs/ml-guide.md index 7b4fa4f..2037285 100644 --- a/docs/ml-guide.md +++ b/docs/ml-guide.md @@ -41,8 +41,6 @@ The primary Machine Learning API for Spark is now the [DataFrame](sql-programmin * MLlib will still support the RDD-based API in `spark.mllib` with bug fixes. * MLlib will not add new features to the RDD-based API. * In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API. -* After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated. -* The RDD-based API is expected to be removed in Spark 3.0. *Why is MLlib switching to the DataFrame-based API?* @@ -87,31 +85,31 @@ To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 [^1]: To learn more about the benefits and background of system optimised natives, you may wish to watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/). -# Highlights in 2.3 +# Highlights in 3.0 -The list below highlights some of the new features and enhancements added to MLlib in the `2.3` +The list below highlights some of the new features and enhancements added to MLlib in the `3.0` release of Spark: -* Built-in support for reading images into a `DataFrame` was added -([SPARK-21866](https://issues.apache.org/jira/browse/SPARK-21866)). -* [`OneHotEncoderEstimator`](ml-features.html#onehotencoderestimator) was added, and should be -used instead of the existing `OneHotEncoder` transformer. The new estimator supports -transforming multiple columns. -* Multiple column support was also added to `QuantileDiscretizer` and `Bucketizer` -([SPARK-22397](https://issues.apache.org/jira/browse/SPARK-22397) and -[SPARK-20542](https://issues.apache.org/jira/browse/SPARK-20542)) -* A new [`FeatureHasher`](ml-features.html#featurehasher) transformer was added - ([SPARK-13969](https://issues.apache.org/jira/browse/SPARK-13969)). -* Added support for evaluating multiple models in parallel when performing cross-validation using -[`TrainValidationSplit` or `CrossValidator`](ml-tuning.html) -([SPARK-19357](https://issues.apache.org/jira/browse/SPARK-19357)). -* Improved support for custom pipeline components in Python (see -[SPARK-21633](https://issues.apache.org/jira/browse/SPARK-21633) and -[SPARK-21542](https://issues.apache.org/jira/browse/SPARK-21542)). -* `DataFrame` functions for descriptive summary statistics over vector columns -([SPARK-19634](https://issues.apache.org/jira/browse/SPARK-19634)). -* Robust linear regression with Huber loss -([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)). +* Multiple columns support was added to `Binarizer` ([SPARK-23578](https://issues.apache.org/jira/browse/SPARK-23578)), `StringIndexer` ([SPARK-11215](https://issues.apache.org/jira/browse/SPARK-11215)), `StopWordsRemover` ([SPARK-29808](https://issues.apache.org/jira/browse/SPARK-29808)) and PySpark `QuantileDiscretizer` ([SPARK-22796](https://issues.apache.org/jira/browse/SPARK-22796)). +* Support Tree-Based Feature Transformation was added +([SPARK-13677](https://issues.apache.org/jira/browse/SPARK-13677)). +* Two new evaluators `MultilabelClassificationEvaluator` ([SPARK-16692](https://issues.apache.org/jira/browse/SPARK-16692)) and `RankingEvaluator` ([SPARK-28045](https://issues.apache.org/jira/browse/SPARK-28045)) were added. +* Sample weights support was added in `DecisionTreeClassifier/Regressor` ([SPARK-19591](https://issues.apache.org/jira/browse/SPARK-19591)), `RandomForestClassifier/Regressor` ([SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478)), `GBTClassifier/Regressor` ([SPARK-9612](https://issues.apache.org/jira/browse/SPARK-9612)), `RegressionEvaluator` ([SPARK-24102](https://issues.apache.org/jira/browse/SPARK-24102)), `BinaryClassificationEvaluator` ([SPARK-24103](https://issues.apach [...] +* R API for `PowerIterationClustering` was added +([SPARK-19827](https://issues.apache.org/jira/browse/SPARK-19827)). +* Added Spark ML listener for tracking ML pipeline status +([SPARK-23674](https://issues.apache.org/jira/browse/SPARK-23674)). +* Fit with validation set was added to Gradient Boosted Trees in Python +([SPARK-24333](https://issues.apache.org/jira/browse/SPARK-24333)). +* [`RobustScaler`](ml-features.html#robustscaler) transformer was added +([SPARK-28399](https://issues.apache.org/jira/browse/SPARK-28399)). +* [`Factorization Machines`](ml-classification-regression.html#factorization-machines) classifier and regressor were added +([SPARK-29224](https://issues.apache.org/jira/browse/SPARK-29224)). +* Gaussian Naive Bayes ([SPARK-16872](https://issues.apache.org/jira/browse/SPARK-16872)) and Complement Naive Bayes ([SPARK-29942](https://issues.apache.org/jira/browse/SPARK-29942)) were added. +* ML function parity between Scala and Python +([SPARK-28958](https://issues.apache.org/jira/browse/SPARK-28958)). +* `predictRaw` is made public in all the Classification models. `predictProbability` is made public in all the Classification models except `LinearSVCModel`. +([SPARK-30358](https://issues.apache.org/jira/browse/SPARK-30358)). # Migration Guide diff --git a/docs/ml-migration-guide.md b/docs/ml-migration-guide.md index f3cd762..4e6d68f 100644 --- a/docs/ml-migration-guide.md +++ b/docs/ml-migration-guide.md @@ -33,16 +33,65 @@ Please refer [Migration Guide: SQL, Datasets and DataFrame](sql-migration-guide. * `OneHotEncoder` which is deprecated in 2.3, is removed in 3.0 and `OneHotEncoderEstimator` is now renamed to `OneHotEncoder`. * `org.apache.spark.ml.image.ImageSchema.readImages` which is deprecated in 2.3, is removed in 3.0, use `spark.read.format('image')` instead. +* `org.apache.spark.mllib.clustering.KMeans.train` with param Int `runs` which is deprecated in 2.1, is removed in 3.0. Use `train` method without `runs` instead. +* `org.apache.spark.mllib.classification.LogisticRegressionWithSGD` which is deprecated in 2.0, is removed in 3.0, use `org.apache.spark.ml.classification.LogisticRegression` or `spark.mllib.classification.LogisticRegressionWithLBFGS` instead. +* `org.apache.spark.mllib.feature.ChiSqSelectorModel.isSorted ` which is deprecated in 2.1, is removed in 3.0, is not intended for subclasses to use. +* `org.apache.spark.mllib.regression.RidgeRegressionWithSGD` which is deprecated in 2.0, is removed in 3.0, use `org.apache.spark.ml.regression.LinearRegression` with `elasticNetParam` = 0.0. Note the default `regParam` is 0.01 for `RidgeRegressionWithSGD`, but is 0.0 for `LinearRegression`. +* `org.apache.spark.mllib.regression.LassoWithSGD` which is deprecated in 2.0, is removed in 3.0, use `org.apache.spark.ml.regression.LinearRegression` with `elasticNetParam` = 1.0. Note the default `regParam` is 0.01 for `LassoWithSGD`, but is 0.0 for `LinearRegression`. +* `org.apache.spark.mllib.regression.LinearRegressionWithSGD` which is deprecated in 2.0, is removed in 3.0, use `org.apache.spark.ml.regression.LinearRegression` or `LBFGS` instead. +* `org.apache.spark.mllib.clustering.KMeans.getRuns` and `setRuns` which are deprecated in 2.1, are removed in 3.0, have no effect since Spark 2.0.0. +* `org.apache.spark.ml.LinearSVCModel.setWeightCol` which is deprecated in 2.4, is removed in 3.0, is not intended for users. +* From 3.0, `org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel` extends `MultilayerPerceptronParams` to expose the training params. As a result, `layers` in `MultilayerPerceptronClassificationModel` has been changed from `Array[Int]` to `IntArrayParam`. Users should use `MultilayerPerceptronClassificationModel.getLayers` instead of `MultilayerPerceptronClassificationModel.layers` to retrieve the size of layers. +* `org.apache.spark.ml.classification.GBTClassifier.numTrees` which is deprecated in 2.4.5, is removed in 3.0, use `getNumTrees` instead. +* `org.apache.spark.ml.clustering.KMeansModel.computeCost` which is deprecated in 2.4, is removed in 3.0, use `ClusteringEvaluator` instead. +* The member variable `precision` in `org.apache.spark.mllib.evaluation.MulticlassMetrics` which is deprecated in 2.0, is removed in 3.0. Use `accuracy` instead. +* The member variable `recall` in `org.apache.spark.mllib.evaluation.MulticlassMetrics` which is deprecated in 2.0, is removed in 3.0. Use `accuracy` instead. +* The member variable `fMeasure` in `org.apache.spark.mllib.evaluation.MulticlassMetrics` which is deprecated in 2.0, is removed in 3.0. Use `accuracy` instead. +* `org.apache.spark.ml.util.GeneralMLWriter.context` which is deprecated in 2.0, is removed in 3.0, use `session` instead. +* `org.apache.spark.ml.util.MLWriter.context` which is deprecated in 2.0, is removed in 3.0, use `session` instead. +* `org.apache.spark.ml.util.MLReader.context` which is deprecated in 2.0, is removed in 3.0, use `session` instead. +* `abstract class UnaryTransformer[IN, OUT, T <: UnaryTransformer[IN, OUT, T]]` is changed to `abstract class UnaryTransformer[IN: TypeTag, OUT: TypeTag, T <: UnaryTransformer[IN, OUT, T]]` in 3.0. -### Changes of behavior +### Deprecations and changes of behavior {:.no_toc} +**Deprecations** + +* [SPARK-11215](https://issues.apache.org/jira/browse/SPARK-11215): +`labels` in `StringIndexerModel` is deprecated and will be removed in 3.1.0. Use `labelsArray` instead. +* [SPARK-25758](https://issues.apache.org/jira/browse/SPARK-25758): +`computeCost` in `BisectingKMeansModel` is deprecated and will be removed in future versions. Use `ClusteringEvaluator` instead. + +**Changes of behavior** + * [SPARK-11215](https://issues.apache.org/jira/browse/SPARK-11215): In Spark 2.4 and previous versions, when specifying `frequencyDesc` or `frequencyAsc` as `stringOrderType` param in `StringIndexer`, in case of equal frequency, the order of strings is undefined. Since Spark 3.0, the strings with equal frequency are further sorted by alphabet. And since Spark 3.0, `StringIndexer` supports encoding multiple columns. + * [SPARK-20604](https://issues.apache.org/jira/browse/SPARK-20604): + In prior to 3.0 releases, `Imputer` requires input column to be Double or Float. In 3.0, this + restriction is lifted so `Imputer` can handle all numeric types. +* [SPARK-23469](https://issues.apache.org/jira/browse/SPARK-23469): +In Spark 3.0, the `HashingTF` Transformer uses a corrected implementation of the murmur3 hash +function to hash elements to vectors. `HashingTF` in Spark 3.0 will map elements to +different positions in vectors than in Spark 2. However, `HashingTF` created with Spark 2.x +and loaded with Spark 3.0 will still use the previous hash function and will not change behavior. +* [SPARK-28969](https://issues.apache.org/jira/browse/SPARK-28969): +The `setClassifier` method in PySpark's `OneVsRestModel` has been removed in 3.0 for parity with +the Scala implementation. Callers should not need to set the classifier in the model after +creation. +* [SPARK-25790](https://issues.apache.org/jira/browse/SPARK-25790): + PCA adds the support for more than 65535 column matrix in Spark 3.0. +* [SPARK-28927](https://issues.apache.org/jira/browse/SPARK-28927): + When fitting ALS model on nondeterministic input data, previously if rerun happens, users + would see ArrayIndexOutOfBoundsException caused by mismatch between In/Out user/item blocks. + From 3.0, a SparkException with more clear message will be thrown, and original + ArrayIndexOutOfBoundsException is wrapped. +* [SPARK-29232](https://issues.apache.org/jira/browse/SPARK-29232): + In prior to 3.0 releases, `RandomForestRegressionModel` doesn't update the parameter maps + of the DecisionTreeRegressionModels underneath. This is fixed in 3.0. ## Upgrading from MLlib 2.2 to 2.3 --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org