svn commit: r1696648 - in /spark: mllib/index.md site/mllib/index.html
Author: meng Date: Wed Aug 19 19:11:08 2015 New Revision: 1696648 URL: http://svn.apache.org/r1696648 Log: update MLlib page for 1.5 Modified: spark/mllib/index.md spark/site/mllib/index.html Modified: spark/mllib/index.md URL: http://svn.apache.org/viewvc/spark/mllib/index.md?rev=1696648r1=1696647r2=1696648view=diff == --- spark/mllib/index.md (original) +++ spark/mllib/index.md Wed Aug 19 19:11:08 2015 @@ -14,7 +14,7 @@ subproject: MLlib div class=col-md-7 col-sm-7 h2Ease of Use/h2 p class=lead - Usable in Java, Scala and Python. + Usable in Java, Scala, Python, and SparkR. /p p MLlib fits into a href={{site.url}}Spark/a's @@ -83,22 +83,25 @@ subproject: MLlib div class=col-md-4 col-padded h3Algorithms/h3 p - MLlib 1.3 contains the following algorithms: + MLlib contains the following algorithms and utilities: /p ul class=list-narrow - lilinear SVM and logistic regression/li + lilogistic regression and linear support vector machine (SVM)/li liclassification and regression tree/li lirandom forest and gradient-boosted trees/li - lirecommendation via alternating least squares/li - liclustering via k-means, Gaussian mixtures, and power iteration clustering/li - litopic modeling via latent Dirichlet allocation/li - lisingular value decomposition/li - lilinear regression with Lsub1/sub- and Lsub2/sub-regularization/li + lirecommendation via alternating least squares (ALS)/li + liclustering via k-means, Gaussian mixtures (GMM), and power iteration clustering/li + litopic modeling via latent Dirichlet allocation (LDA)/li + lisingular value decomposition (SVD) and QR decomposition/li + liprincipal component analysis (PCA)/li + lilinear regression with Lsub1/sub, Lsub2/sub, and elastic-net regularization/li liisotonic regression/li - limultinomial naive Bayes/li - lifrequent itemset mining via FP-growth/li - libasic statistics/li + limultinomial/binomial naive Bayes/li + lifrequent itemset mining via FP-growth and association rules/li + lisequential pattern mining via PrefixSpan/li + lisummary statistics and hypothesis testing/li lifeature transformations/li + limodel evaluation and hyper-parameter tuning/li /ul pRefer to the a href={{site.url}}docs/latest/mllib-guide.htmlMLlib guide/a for usage examples./p /div Modified: spark/site/mllib/index.html URL: http://svn.apache.org/viewvc/spark/site/mllib/index.html?rev=1696648r1=1696647r2=1696648view=diff == --- spark/site/mllib/index.html (original) +++ spark/site/mllib/index.html Wed Aug 19 19:11:08 2015 @@ -178,7 +178,7 @@ div class=col-md-7 col-sm-7 h2Ease of Use/h2 p class=lead - Usable in Java, Scala and Python. + Usable in Java, Scala, Python, and SparkR. /p p MLlib fits into a href=/Spark/a's @@ -250,22 +250,25 @@ div class=col-md-4 col-padded h3Algorithms/h3 p - MLlib 1.3 contains the following algorithms: + MLlib contains the following algorithms and utilities: /p ul class=list-narrow - lilinear SVM and logistic regression/li + lilogistic regression and linear support vector machine (SVM)/li liclassification and regression tree/li lirandom forest and gradient-boosted trees/li - lirecommendation via alternating least squares/li - liclustering via k-means, Gaussian mixtures, and power iteration clustering/li - litopic modeling via latent Dirichlet allocation/li - lisingular value decomposition/li - lilinear regression with Lsub1/sub- and Lsub2/sub-regularization/li + lirecommendation via alternating least squares (ALS)/li + liclustering via k-means, Gaussian mixtures (GMM), and power iteration clustering/li + litopic modeling via latent Dirichlet allocation (LDA)/li + lisingular value decomposition (SVD) and QR decomposition/li + liprincipal component analysis (PCA)/li + lilinear regression with Lsub1/sub, Lsub2/sub, and elastic-net regularization/li liisotonic regression/li - limultinomial naive Bayes/li - lifrequent itemset mining via FP-growth/li - libasic statistics/li + limultinomial/binomial naive Bayes/li + lifrequent itemset mining via FP-growth and association rules/li + lisequential pattern mining via PrefixSpan/li + lisummary statistics and hypothesis testing/li lifeature transformations/li + limodel evaluation and hyper-parameter tuning/li /ul pRefer to the a href=/docs/latest/mllib-guide.htmlMLlib guide/a for usage examples./p /div - To unsubscribe, e-mail: commits-unsubscr
spark git commit: [SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clustering
Repository: spark Updated Branches: refs/heads/master d898c33f7 - 5b62bef8c [SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clustering This continues the work from #8256. I removed `since` tags from private/protected/local methods/variables (see https://github.com/apache/spark/commit/72fdeb64630470f6f46cf3eed8ffbfe83a7c4659). MechCoder Closes #8256 Author: Xiangrui Meng m...@databricks.com Author: Xiaoqing Wang spark...@126.com Author: MechCoder manojkumarsivaraj...@gmail.com Closes #8288 from mengxr/SPARK-8918. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5b62bef8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5b62bef8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5b62bef8 Branch: refs/heads/master Commit: 5b62bef8cbf73f910513ef3b1f557aa94b384854 Parents: d898c33 Author: Xiangrui Meng m...@databricks.com Authored: Wed Aug 19 13:17:26 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 19 13:17:26 2015 -0700 -- .../mllib/clustering/GaussianMixture.scala | 56 +++ .../mllib/clustering/GaussianMixtureModel.scala | 32 +++-- .../apache/spark/mllib/clustering/KMeans.scala | 36 +- .../spark/mllib/clustering/KMeansModel.scala| 37 -- .../org/apache/spark/mllib/clustering/LDA.scala | 71 +--- .../spark/mllib/clustering/LDAModel.scala | 64 -- .../spark/mllib/clustering/LDAOptimizer.scala | 12 +++- .../clustering/PowerIterationClustering.scala | 29 +++- .../mllib/clustering/StreamingKMeans.scala | 53 --- 9 files changed, 338 insertions(+), 52 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5b62bef8/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala index e459367..bc27b1f 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala @@ -62,6 +62,7 @@ class GaussianMixture private ( /** * Constructs a default instance. The default parameters are {k: 2, convergenceTol: 0.01, * maxIterations: 100, seed: random}. + * @since 1.3.0 */ def this() = this(2, 0.01, 100, Utils.random.nextLong()) @@ -72,9 +73,11 @@ class GaussianMixture private ( // default random starting point private var initialModel: Option[GaussianMixtureModel] = None - /** Set the initial GMM starting point, bypassing the random initialization. - * You must call setK() prior to calling this method, and the condition - * (model.k == this.k) must be met; failure will result in an IllegalArgumentException + /** + * Set the initial GMM starting point, bypassing the random initialization. + * You must call setK() prior to calling this method, and the condition + * (model.k == this.k) must be met; failure will result in an IllegalArgumentException + * @since 1.3.0 */ def setInitialModel(model: GaussianMixtureModel): this.type = { if (model.k == k) { @@ -85,30 +88,46 @@ class GaussianMixture private ( this } - /** Return the user supplied initial GMM, if supplied */ + /** + * Return the user supplied initial GMM, if supplied + * @since 1.3.0 + */ def getInitialModel: Option[GaussianMixtureModel] = initialModel - /** Set the number of Gaussians in the mixture model. Default: 2 */ + /** + * Set the number of Gaussians in the mixture model. Default: 2 + * @since 1.3.0 + */ def setK(k: Int): this.type = { this.k = k this } - /** Return the number of Gaussians in the mixture model */ + /** + * Return the number of Gaussians in the mixture model + * @since 1.3.0 + */ def getK: Int = k - /** Set the maximum number of iterations to run. Default: 100 */ + /** + * Set the maximum number of iterations to run. Default: 100 + * @since 1.3.0 + */ def setMaxIterations(maxIterations: Int): this.type = { this.maxIterations = maxIterations this } - /** Return the maximum number of iterations to run */ + /** + * Return the maximum number of iterations to run + * @since 1.3.0 + */ def getMaxIterations: Int = maxIterations /** * Set the largest change in log-likelihood at which convergence is * considered to have occurred. + * @since 1.3.0 */ def setConvergenceTol(convergenceTol: Double): this.type = { this.convergenceTol = convergenceTol @@ -118,19 +137,29 @@ class GaussianMixture private ( /** * Return the largest
spark git commit: [SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clustering
Repository: spark Updated Branches: refs/heads/branch-1.5 ba369258d - 8c0a5a248 [SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clustering This continues the work from #8256. I removed `since` tags from private/protected/local methods/variables (see https://github.com/apache/spark/commit/72fdeb64630470f6f46cf3eed8ffbfe83a7c4659). MechCoder Closes #8256 Author: Xiangrui Meng m...@databricks.com Author: Xiaoqing Wang spark...@126.com Author: MechCoder manojkumarsivaraj...@gmail.com Closes #8288 from mengxr/SPARK-8918. (cherry picked from commit 5b62bef8cbf73f910513ef3b1f557aa94b384854) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8c0a5a24 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8c0a5a24 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8c0a5a24 Branch: refs/heads/branch-1.5 Commit: 8c0a5a2485d899e9a58d431b395d2a3f3bf4c5c1 Parents: ba36925 Author: Xiangrui Meng m...@databricks.com Authored: Wed Aug 19 13:17:26 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 19 13:17:34 2015 -0700 -- .../mllib/clustering/GaussianMixture.scala | 56 +++ .../mllib/clustering/GaussianMixtureModel.scala | 32 +++-- .../apache/spark/mllib/clustering/KMeans.scala | 36 +- .../spark/mllib/clustering/KMeansModel.scala| 37 -- .../org/apache/spark/mllib/clustering/LDA.scala | 71 +--- .../spark/mllib/clustering/LDAModel.scala | 64 -- .../spark/mllib/clustering/LDAOptimizer.scala | 12 +++- .../clustering/PowerIterationClustering.scala | 29 +++- .../mllib/clustering/StreamingKMeans.scala | 53 --- 9 files changed, 338 insertions(+), 52 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8c0a5a24/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala index e459367..bc27b1f 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala @@ -62,6 +62,7 @@ class GaussianMixture private ( /** * Constructs a default instance. The default parameters are {k: 2, convergenceTol: 0.01, * maxIterations: 100, seed: random}. + * @since 1.3.0 */ def this() = this(2, 0.01, 100, Utils.random.nextLong()) @@ -72,9 +73,11 @@ class GaussianMixture private ( // default random starting point private var initialModel: Option[GaussianMixtureModel] = None - /** Set the initial GMM starting point, bypassing the random initialization. - * You must call setK() prior to calling this method, and the condition - * (model.k == this.k) must be met; failure will result in an IllegalArgumentException + /** + * Set the initial GMM starting point, bypassing the random initialization. + * You must call setK() prior to calling this method, and the condition + * (model.k == this.k) must be met; failure will result in an IllegalArgumentException + * @since 1.3.0 */ def setInitialModel(model: GaussianMixtureModel): this.type = { if (model.k == k) { @@ -85,30 +88,46 @@ class GaussianMixture private ( this } - /** Return the user supplied initial GMM, if supplied */ + /** + * Return the user supplied initial GMM, if supplied + * @since 1.3.0 + */ def getInitialModel: Option[GaussianMixtureModel] = initialModel - /** Set the number of Gaussians in the mixture model. Default: 2 */ + /** + * Set the number of Gaussians in the mixture model. Default: 2 + * @since 1.3.0 + */ def setK(k: Int): this.type = { this.k = k this } - /** Return the number of Gaussians in the mixture model */ + /** + * Return the number of Gaussians in the mixture model + * @since 1.3.0 + */ def getK: Int = k - /** Set the maximum number of iterations to run. Default: 100 */ + /** + * Set the maximum number of iterations to run. Default: 100 + * @since 1.3.0 + */ def setMaxIterations(maxIterations: Int): this.type = { this.maxIterations = maxIterations this } - /** Return the maximum number of iterations to run */ + /** + * Return the maximum number of iterations to run + * @since 1.3.0 + */ def getMaxIterations: Int = maxIterations /** * Set the largest change in log-likelihood at which convergence is * considered to have occurred. + * @since 1.3.0 */ def setConvergenceTol(convergenceTol: Double): this.type
spark git commit: [SPARK-9895] User Guide for RFormula Feature Transformer
Repository: spark Updated Branches: refs/heads/master b0dbaec4f - 8e0a072f7 [SPARK-9895] User Guide for RFormula Feature Transformer mengxr Author: Eric Liang e...@databricks.com Closes #8293 from ericl/docs-2. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8e0a072f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8e0a072f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8e0a072f Branch: refs/heads/master Commit: 8e0a072f78b4902d5f7ccc6b15232ed202a117f9 Parents: b0dbaec Author: Eric Liang e...@databricks.com Authored: Wed Aug 19 15:43:08 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 19 15:43:08 2015 -0700 -- docs/ml-features.md | 108 +++ .../org/apache/spark/ml/feature/RFormula.scala | 4 +- 2 files changed, 110 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8e0a072f/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index d0e8eeb..6309db9 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1477,3 +1477,111 @@ print(output.select(features, clicked).first()) /div /div +## RFormula + +`RFormula` selects columns specified by an [R model formula](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html). It produces a vector column of features and a double column of labels. Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles. If not already present in the DataFrame, the output label column will be created from the specified response variable in the formula. + +**Examples** + +Assume that we have a DataFrame with the columns `id`, `country`, `hour`, and `clicked`: + +~~~ +id | country | hour | clicked +---|-|--|- + 7 | US| 18 | 1.0 + 8 | CA| 12 | 0.0 + 9 | NZ| 15 | 0.0 +~~~ + +If we use `RFormula` with a formula string of `clicked ~ country + hour`, which indicates that we want to +predict `clicked` based on `country` and `hour`, after transformation we should get the following DataFrame: + +~~~ +id | country | hour | clicked | features | label +---|-|--|-|--|--- + 7 | US| 18 | 1.0 | [0.0, 0.0, 18.0] | 1.0 + 8 | CA| 12 | 0.0 | [0.0, 1.0, 12.0] | 0.0 + 9 | NZ| 15 | 0.0 | [1.0, 0.0, 15.0] | 0.0 +~~~ + +div class=codetabs +div data-lang=scala markdown=1 + +[`RFormula`](api/scala/index.html#org.apache.spark.ml.feature.RFormula) takes an R formula string, and optional parameters for the names of its output columns. + +{% highlight scala %} +import org.apache.spark.ml.feature.RFormula + +val dataset = sqlContext.createDataFrame(Seq( + (7, US, 18, 1.0), + (8, CA, 12, 0.0), + (9, NZ, 15, 0.0) +)).toDF(id, country, hour, clicked) +val formula = new RFormula() + .setFormula(clicked ~ country + hour) + .setFeaturesCol(features) + .setLabelCol(label) +val output = formula.fit(dataset).transform(dataset) +output.select(features, label).show() +{% endhighlight %} +/div + +div data-lang=java markdown=1 + +[`RFormula`](api/java/org/apache/spark/ml/feature/RFormula.html) takes an R formula string, and optional parameters for the names of its output columns. + +{% highlight java %} +import java.util.Arrays; + +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.ml.feature.RFormula; +import org.apache.spark.sql.DataFrame; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.RowFactory; +import org.apache.spark.sql.types.*; +import static org.apache.spark.sql.types.DataTypes.*; + +StructType schema = createStructType(new StructField[] { + createStructField(id, IntegerType, false), + createStructField(country, StringType, false), + createStructField(hour, IntegerType, false), + createStructField(clicked, DoubleType, false) +}); +JavaRDDRow rdd = jsc.parallelize(Arrays.asList( + RowFactory.create(7, US, 18, 1.0), + RowFactory.create(8, CA, 12, 0.0), + RowFactory.create(9, NZ, 15, 0.0) +)); +DataFrame dataset = sqlContext.createDataFrame(rdd, schema); + +RFormula formula = new RFormula() + .setFormula(clicked ~ country + hour) + .setFeaturesCol(features) + .setLabelCol(label); + +DataFrame output = formula.fit(dataset).transform(dataset); +output.select(features, label).show(); +{% endhighlight %} +/div + +div data-lang=python markdown=1 + +[`RFormula`](api/python/pyspark.ml.html#pyspark.ml.feature.RFormula) takes an R formula string, and optional parameters for the names of its output columns. + +{% highlight python %} +from pyspark.ml.feature import RFormula + +dataset = sqlContext.createDataFrame( +[(7
spark git commit: [SPARK-9895] User Guide for RFormula Feature Transformer
Repository: spark Updated Branches: refs/heads/branch-1.5 5c749c82c - 56a37b01f [SPARK-9895] User Guide for RFormula Feature Transformer mengxr Author: Eric Liang e...@databricks.com Closes #8293 from ericl/docs-2. (cherry picked from commit 8e0a072f78b4902d5f7ccc6b15232ed202a117f9) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/56a37b01 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/56a37b01 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/56a37b01 Branch: refs/heads/branch-1.5 Commit: 56a37b01fd07f4f1a8cb4e07b55e1a02cf23a5f7 Parents: 5c749c8 Author: Eric Liang e...@databricks.com Authored: Wed Aug 19 15:43:08 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 19 15:43:15 2015 -0700 -- docs/ml-features.md | 108 +++ .../org/apache/spark/ml/feature/RFormula.scala | 4 +- 2 files changed, 110 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/56a37b01/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index d0e8eeb..6309db9 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -1477,3 +1477,111 @@ print(output.select(features, clicked).first()) /div /div +## RFormula + +`RFormula` selects columns specified by an [R model formula](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html). It produces a vector column of features and a double column of labels. Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles. If not already present in the DataFrame, the output label column will be created from the specified response variable in the formula. + +**Examples** + +Assume that we have a DataFrame with the columns `id`, `country`, `hour`, and `clicked`: + +~~~ +id | country | hour | clicked +---|-|--|- + 7 | US| 18 | 1.0 + 8 | CA| 12 | 0.0 + 9 | NZ| 15 | 0.0 +~~~ + +If we use `RFormula` with a formula string of `clicked ~ country + hour`, which indicates that we want to +predict `clicked` based on `country` and `hour`, after transformation we should get the following DataFrame: + +~~~ +id | country | hour | clicked | features | label +---|-|--|-|--|--- + 7 | US| 18 | 1.0 | [0.0, 0.0, 18.0] | 1.0 + 8 | CA| 12 | 0.0 | [0.0, 1.0, 12.0] | 0.0 + 9 | NZ| 15 | 0.0 | [1.0, 0.0, 15.0] | 0.0 +~~~ + +div class=codetabs +div data-lang=scala markdown=1 + +[`RFormula`](api/scala/index.html#org.apache.spark.ml.feature.RFormula) takes an R formula string, and optional parameters for the names of its output columns. + +{% highlight scala %} +import org.apache.spark.ml.feature.RFormula + +val dataset = sqlContext.createDataFrame(Seq( + (7, US, 18, 1.0), + (8, CA, 12, 0.0), + (9, NZ, 15, 0.0) +)).toDF(id, country, hour, clicked) +val formula = new RFormula() + .setFormula(clicked ~ country + hour) + .setFeaturesCol(features) + .setLabelCol(label) +val output = formula.fit(dataset).transform(dataset) +output.select(features, label).show() +{% endhighlight %} +/div + +div data-lang=java markdown=1 + +[`RFormula`](api/java/org/apache/spark/ml/feature/RFormula.html) takes an R formula string, and optional parameters for the names of its output columns. + +{% highlight java %} +import java.util.Arrays; + +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.ml.feature.RFormula; +import org.apache.spark.sql.DataFrame; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.RowFactory; +import org.apache.spark.sql.types.*; +import static org.apache.spark.sql.types.DataTypes.*; + +StructType schema = createStructType(new StructField[] { + createStructField(id, IntegerType, false), + createStructField(country, StringType, false), + createStructField(hour, IntegerType, false), + createStructField(clicked, DoubleType, false) +}); +JavaRDDRow rdd = jsc.parallelize(Arrays.asList( + RowFactory.create(7, US, 18, 1.0), + RowFactory.create(8, CA, 12, 0.0), + RowFactory.create(9, NZ, 15, 0.0) +)); +DataFrame dataset = sqlContext.createDataFrame(rdd, schema); + +RFormula formula = new RFormula() + .setFormula(clicked ~ country + hour) + .setFeaturesCol(features) + .setLabelCol(label); + +DataFrame output = formula.fit(dataset).transform(dataset); +output.select(features, label).show(); +{% endhighlight %} +/div + +div data-lang=python markdown=1 + +[`RFormula`](api/python/pyspark.ml.html#pyspark.ml.feature.RFormula) takes an R formula string, and optional parameters for the names of its output
spark git commit: [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public
Repository: spark Updated Branches: refs/heads/master 5af3838d2 - dd0614fd6 [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently. Author: Yanbo Liang yblia...@gmail.com Closes #8263 from yanboliang/mlp-public. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dd0614fd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dd0614fd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dd0614fd Branch: refs/heads/master Commit: dd0614fd618ad28cb77aecfbd49bb319b98fdba0 Parents: 5af3838 Author: Yanbo Liang yblia...@gmail.com Authored: Mon Aug 17 23:57:02 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 23:57:02 2015 -0700 -- .../spark/ml/classification/MultilayerPerceptronClassifier.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dd0614fd/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala index c154561..ccca4ec 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala @@ -172,8 +172,8 @@ class MultilayerPerceptronClassifier(override val uid: String) @Experimental class MultilayerPerceptronClassificationModel private[ml] ( override val uid: String, -layers: Array[Int], -weights: Vector) +val layers: Array[Int], +val weights: Vector) extends PredictionModel[Vector, MultilayerPerceptronClassificationModel] with Serializable { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public
Repository: spark Updated Branches: refs/heads/branch-1.5 e5fbe4f24 - 40b89c38a [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently. Author: Yanbo Liang yblia...@gmail.com Closes #8263 from yanboliang/mlp-public. (cherry picked from commit dd0614fd618ad28cb77aecfbd49bb319b98fdba0) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/40b89c38 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/40b89c38 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/40b89c38 Branch: refs/heads/branch-1.5 Commit: 40b89c38ada5edfdd1478dc8f3c983ebcbcc56d5 Parents: e5fbe4f Author: Yanbo Liang yblia...@gmail.com Authored: Mon Aug 17 23:57:02 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 23:57:14 2015 -0700 -- .../spark/ml/classification/MultilayerPerceptronClassifier.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/40b89c38/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala index c154561..ccca4ec 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala @@ -172,8 +172,8 @@ class MultilayerPerceptronClassifier(override val uid: String) @Experimental class MultilayerPerceptronClassificationModel private[ml] ( override val uid: String, -layers: Array[Int], -weights: Vector) +val layers: Array[Int], +val weights: Vector) extends PredictionModel[Vector, MultilayerPerceptronClassificationModel] with Serializable { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9900] [MLLIB] User guide for Association Rules
Repository: spark Updated Branches: refs/heads/branch-1.5 b86378cf2 - 7ff0e5d2f [SPARK-9900] [MLLIB] User guide for Association Rules Updates FPM user guide to include Association Rules. Author: Feynman Liang fli...@databricks.com Closes #8207 from feynmanliang/SPARK-9900-arules. (cherry picked from commit f5ea3912900ccdf23e2eb419a342bfe3c0c0b61b) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7ff0e5d2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7ff0e5d2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7ff0e5d2 Branch: refs/heads/branch-1.5 Commit: 7ff0e5d2fe07d4a9518ade26b09bcc32f418ca1b Parents: b86378c Author: Feynman Liang fli...@databricks.com Authored: Tue Aug 18 12:53:57 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:54:05 2015 -0700 -- docs/mllib-frequent-pattern-mining.md | 130 +-- docs/mllib-guide.md | 1 + .../mllib/fpm/JavaAssociationRulesSuite.java| 2 +- 3 files changed, 118 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7ff0e5d2/docs/mllib-frequent-pattern-mining.md -- diff --git a/docs/mllib-frequent-pattern-mining.md b/docs/mllib-frequent-pattern-mining.md index 8ea4389..6c06550 100644 --- a/docs/mllib-frequent-pattern-mining.md +++ b/docs/mllib-frequent-pattern-mining.md @@ -39,18 +39,30 @@ MLlib's FP-growth implementation takes the following (hyper-)parameters: div class=codetabs div data-lang=scala markdown=1 -[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) implements the -FP-growth algorithm. -It take a `JavaRDD` of transactions, where each transaction is an `Iterable` of items of a generic type. +[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) +implements the FP-growth algorithm. It take an `RDD` of transactions, +where each transaction is an `Iterable` of items of a generic type. Calling `FPGrowth.run` with transactions returns an [`FPGrowthModel`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowthModel) -that stores the frequent itemsets with their frequencies. +that stores the frequent itemsets with their frequencies. The following +example illustrates how to mine frequent itemsets and association rules +(see [Association +Rules](mllib-frequent-pattern-mining.html#association-rules) for +details) from `transactions`. + {% highlight scala %} import org.apache.spark.rdd.RDD import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel} -val transactions: RDD[Array[String]] = ... +val transactions: RDD[Array[String]] = sc.parallelize(Seq( + r z h k p, + z y x w v u t s, + s x o n r, + x z y m t s q e, + z, + x z y r q t p) + .map(_.split( ))) val fpg = new FPGrowth() .setMinSupport(0.2) @@ -60,29 +72,48 @@ val model = fpg.run(transactions) model.freqItemsets.collect().foreach { itemset = println(itemset.items.mkString([, ,, ]) + , + itemset.freq) } + +val minConfidence = 0.8 +model.generateAssociationRules(minConfidence).collect().foreach { rule = + println( +rule.antecedent.mkString([, ,, ]) + + = + rule.consequent .mkString([, ,, ]) + + , + rule.confidence) +} {% endhighlight %} /div div data-lang=java markdown=1 -[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the -FP-growth algorithm. -It take an `RDD` of transactions, where each transaction is an `Array` of items of a generic type. -Calling `FPGrowth.run` with transactions returns an +[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) +implements the FP-growth algorithm. It take a `JavaRDD` of +transactions, where each transaction is an `Array` of items of a generic +type. Calling `FPGrowth.run` with transactions returns an [`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html) -that stores the frequent itemsets with their frequencies. +that stores the frequent itemsets with their frequencies. The following +example illustrates how to mine frequent itemsets and association rules +(see [Association +Rules](mllib-frequent-pattern-mining.html#association-rules) for +details) from `transactions`. {% highlight java %} +import java.util.Arrays; import java.util.List; -import com.google.common.base.Joiner; - import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.mllib.fpm.AssociationRules; import org.apache.spark.mllib.fpm.FPGrowth; import org.apache.spark.mllib.fpm.FPGrowthModel; -JavaRDDListString transactions = ... +JavaRDDListString transactions = sc.parallelize(Arrays.asList( + Arrays.asList(r z h
spark git commit: [SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import
Repository: spark Updated Branches: refs/heads/branch-1.5 ec7079f9c - 9bd2e6f7c [SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import See https://issues.apache.org/jira/browse/SPARK-10085 Author: Piotr Migdal pmig...@gmail.com Closes #8284 from stared/spark-10085. (cherry picked from commit 8bae9015b7e7b4528ca2bc5180771cb95d2aac13) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9bd2e6f7 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9bd2e6f7 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9bd2e6f7 Branch: refs/heads/branch-1.5 Commit: 9bd2e6f7cbff1835f9abefe26dbe445eaa5b004b Parents: ec7079f Author: Piotr Migdal pmig...@gmail.com Authored: Tue Aug 18 12:59:28 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:59:36 2015 -0700 -- docs/mllib-linear-methods.md | 2 -- 1 file changed, 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9bd2e6f7/docs/mllib-linear-methods.md -- diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md index 07655ba..e9b2d27 100644 --- a/docs/mllib-linear-methods.md +++ b/docs/mllib-linear-methods.md @@ -504,7 +504,6 @@ will in the future. {% highlight python %} from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel from pyspark.mllib.regression import LabeledPoint -from numpy import array # Load and parse the data def parsePoint(line): @@ -676,7 +675,6 @@ Note that the Python API does not yet support model save/load but will in the fu {% highlight python %} from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel -from numpy import array # Load and parse the data def parsePoint(line): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import
Repository: spark Updated Branches: refs/heads/master 747c2ba80 - 8bae9015b [SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import See https://issues.apache.org/jira/browse/SPARK-10085 Author: Piotr Migdal pmig...@gmail.com Closes #8284 from stared/spark-10085. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8bae9015 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8bae9015 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8bae9015 Branch: refs/heads/master Commit: 8bae9015b7e7b4528ca2bc5180771cb95d2aac13 Parents: 747c2ba Author: Piotr Migdal pmig...@gmail.com Authored: Tue Aug 18 12:59:28 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:59:28 2015 -0700 -- docs/mllib-linear-methods.md | 2 -- 1 file changed, 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8bae9015/docs/mllib-linear-methods.md -- diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md index 07655ba..e9b2d27 100644 --- a/docs/mllib-linear-methods.md +++ b/docs/mllib-linear-methods.md @@ -504,7 +504,6 @@ will in the future. {% highlight python %} from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel from pyspark.mllib.regression import LabeledPoint -from numpy import array # Load and parse the data def parsePoint(line): @@ -676,7 +675,6 @@ Note that the Python API does not yet support model save/load but will in the fu {% highlight python %} from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel -from numpy import array # Load and parse the data def parsePoint(line): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression user guide
Repository: spark Updated Branches: refs/heads/master f5ea39129 - f4fa61eff [SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression user guide Add Python examples for mllib IsotonicRegression user guide Author: Yanbo Liang yblia...@gmail.com Closes #8225 from yanboliang/spark-10029. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f4fa61ef Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f4fa61ef Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f4fa61ef Branch: refs/heads/master Commit: f4fa61effe34dae2f0eab0bef57b2dee220cf92f Parents: f5ea391 Author: Yanbo Liang yblia...@gmail.com Authored: Tue Aug 18 12:55:36 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:55:36 2015 -0700 -- docs/mllib-isotonic-regression.md | 35 ++ 1 file changed, 35 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f4fa61ef/docs/mllib-isotonic-regression.md -- diff --git a/docs/mllib-isotonic-regression.md b/docs/mllib-isotonic-regression.md index 5732bc4..6aa881f 100644 --- a/docs/mllib-isotonic-regression.md +++ b/docs/mllib-isotonic-regression.md @@ -160,4 +160,39 @@ model.save(sc.sc(), myModelPath); IsotonicRegressionModel sameModel = IsotonicRegressionModel.load(sc.sc(), myModelPath); {% endhighlight %} /div + +div data-lang=python markdown=1 +Data are read from a file where each line has a format label,feature +i.e. 4710.28,500.00. The data are split to training and testing set. +Model is created using the training set and a mean squared error is calculated from the predicted +labels and real labels in the test set. + +{% highlight python %} +import math +from pyspark.mllib.regression import IsotonicRegression, IsotonicRegressionModel + +data = sc.textFile(data/mllib/sample_isotonic_regression_data.txt) + +# Create label, feature, weight tuples from input data with weight set to default value 1.0. +parsedData = data.map(lambda line: tuple([float(x) for x in line.split(',')]) + (1.0,)) + +# Split data into training (60%) and test (40%) sets. +training, test = parsedData.randomSplit([0.6, 0.4], 11) + +# Create isotonic regression model from training data. +# Isotonic parameter defaults to true so it is only shown for demonstration +model = IsotonicRegression.train(training) + +# Create tuples of predicted and real labels. +predictionAndLabel = test.map(lambda p: (model.predict(p[1]), p[0])) + +# Calculate mean squared error between predicted and real labels. +meanSquaredError = predictionAndLabel.map(lambda pl: math.pow((pl[0] - pl[1]), 2)).mean() +print(Mean Squared Error = + str(meanSquaredError)) + +# Save and load model +model.save(sc, myModelPath) +sameModel = IsotonicRegressionModel.load(sc, myModelPath) +{% endhighlight %} +/div /div - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9900] [MLLIB] User guide for Association Rules
Repository: spark Updated Branches: refs/heads/master c1840a862 - f5ea39129 [SPARK-9900] [MLLIB] User guide for Association Rules Updates FPM user guide to include Association Rules. Author: Feynman Liang fli...@databricks.com Closes #8207 from feynmanliang/SPARK-9900-arules. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5ea3912 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5ea3912 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5ea3912 Branch: refs/heads/master Commit: f5ea3912900ccdf23e2eb419a342bfe3c0c0b61b Parents: c1840a8 Author: Feynman Liang fli...@databricks.com Authored: Tue Aug 18 12:53:57 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:53:57 2015 -0700 -- docs/mllib-frequent-pattern-mining.md | 130 +-- docs/mllib-guide.md | 1 + .../mllib/fpm/JavaAssociationRulesSuite.java| 2 +- 3 files changed, 118 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f5ea3912/docs/mllib-frequent-pattern-mining.md -- diff --git a/docs/mllib-frequent-pattern-mining.md b/docs/mllib-frequent-pattern-mining.md index 8ea4389..6c06550 100644 --- a/docs/mllib-frequent-pattern-mining.md +++ b/docs/mllib-frequent-pattern-mining.md @@ -39,18 +39,30 @@ MLlib's FP-growth implementation takes the following (hyper-)parameters: div class=codetabs div data-lang=scala markdown=1 -[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) implements the -FP-growth algorithm. -It take a `JavaRDD` of transactions, where each transaction is an `Iterable` of items of a generic type. +[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) +implements the FP-growth algorithm. It take an `RDD` of transactions, +where each transaction is an `Iterable` of items of a generic type. Calling `FPGrowth.run` with transactions returns an [`FPGrowthModel`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowthModel) -that stores the frequent itemsets with their frequencies. +that stores the frequent itemsets with their frequencies. The following +example illustrates how to mine frequent itemsets and association rules +(see [Association +Rules](mllib-frequent-pattern-mining.html#association-rules) for +details) from `transactions`. + {% highlight scala %} import org.apache.spark.rdd.RDD import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel} -val transactions: RDD[Array[String]] = ... +val transactions: RDD[Array[String]] = sc.parallelize(Seq( + r z h k p, + z y x w v u t s, + s x o n r, + x z y m t s q e, + z, + x z y r q t p) + .map(_.split( ))) val fpg = new FPGrowth() .setMinSupport(0.2) @@ -60,29 +72,48 @@ val model = fpg.run(transactions) model.freqItemsets.collect().foreach { itemset = println(itemset.items.mkString([, ,, ]) + , + itemset.freq) } + +val minConfidence = 0.8 +model.generateAssociationRules(minConfidence).collect().foreach { rule = + println( +rule.antecedent.mkString([, ,, ]) + + = + rule.consequent .mkString([, ,, ]) + + , + rule.confidence) +} {% endhighlight %} /div div data-lang=java markdown=1 -[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the -FP-growth algorithm. -It take an `RDD` of transactions, where each transaction is an `Array` of items of a generic type. -Calling `FPGrowth.run` with transactions returns an +[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) +implements the FP-growth algorithm. It take a `JavaRDD` of +transactions, where each transaction is an `Array` of items of a generic +type. Calling `FPGrowth.run` with transactions returns an [`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html) -that stores the frequent itemsets with their frequencies. +that stores the frequent itemsets with their frequencies. The following +example illustrates how to mine frequent itemsets and association rules +(see [Association +Rules](mllib-frequent-pattern-mining.html#association-rules) for +details) from `transactions`. {% highlight java %} +import java.util.Arrays; import java.util.List; -import com.google.common.base.Joiner; - import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.mllib.fpm.AssociationRules; import org.apache.spark.mllib.fpm.FPGrowth; import org.apache.spark.mllib.fpm.FPGrowthModel; -JavaRDDListString transactions = ... +JavaRDDListString transactions = sc.parallelize(Arrays.asList( + Arrays.asList(r z h k p.split( )), + Arrays.asList(z y x w v u t s.split( )), + Arrays.asList(s x o n r.split( )), + Arrays.asList(x z y m t s
spark git commit: [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree
Repository: spark Updated Branches: refs/heads/branch-1.5 8b0df5a5e - 56f4da263 [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree Added since tags to mllib.tree Author: Bryan Cutler bjcut...@us.ibm.com Closes #7380 from BryanCutler/sinceTag-mllibTree-8924. (cherry picked from commit 1dbffba37a84c62202befd3911d25888f958191d) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/56f4da26 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/56f4da26 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/56f4da26 Branch: refs/heads/branch-1.5 Commit: 56f4da2633aab6d1f25c03b1cf567c2c68374fb5 Parents: 8b0df5a Author: Bryan Cutler bjcut...@us.ibm.com Authored: Tue Aug 18 14:58:30 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 14:58:37 2015 -0700 -- .../apache/spark/mllib/tree/DecisionTree.scala | 13 +++ .../spark/mllib/tree/GradientBoostedTrees.scala | 10 ++ .../apache/spark/mllib/tree/RandomForest.scala | 10 ++ .../spark/mllib/tree/configuration/Algo.scala | 1 + .../tree/configuration/BoostingStrategy.scala | 6 .../mllib/tree/configuration/FeatureType.scala | 1 + .../tree/configuration/QuantileStrategy.scala | 1 + .../mllib/tree/configuration/Strategy.scala | 20 ++- .../spark/mllib/tree/impurity/Entropy.scala | 4 +++ .../apache/spark/mllib/tree/impurity/Gini.scala | 4 +++ .../spark/mllib/tree/impurity/Impurity.scala| 3 ++ .../spark/mllib/tree/impurity/Variance.scala| 4 +++ .../spark/mllib/tree/loss/AbsoluteError.scala | 2 ++ .../apache/spark/mllib/tree/loss/LogLoss.scala | 2 ++ .../org/apache/spark/mllib/tree/loss/Loss.scala | 3 ++ .../apache/spark/mllib/tree/loss/Losses.scala | 6 .../spark/mllib/tree/loss/SquaredError.scala| 2 ++ .../mllib/tree/model/DecisionTreeModel.scala| 22 .../mllib/tree/model/InformationGainStats.scala | 1 + .../apache/spark/mllib/tree/model/Node.scala| 3 ++ .../apache/spark/mllib/tree/model/Predict.scala | 1 + .../apache/spark/mllib/tree/model/Split.scala | 1 + .../mllib/tree/model/treeEnsembleModels.scala | 37 .../org/apache/spark/mllib/tree/package.scala | 1 + 24 files changed, 157 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/56f4da26/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala index cecd1fe..e5200b8 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala @@ -43,6 +43,7 @@ import org.apache.spark.util.random.XORShiftRandom * @param strategy The configuration parameters for the tree algorithm which specify the type * of algorithm (classification, regression, etc.), feature type (continuous, * categorical), depth of the tree, quantile calculation strategy, etc. + * @since 1.0.0 */ @Experimental class DecisionTree (private val strategy: Strategy) extends Serializable with Logging { @@ -53,6 +54,7 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo * Method to train a decision tree model over an RDD * @param input Training data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]] * @return DecisionTreeModel that can be used for prediction + * @since 1.2.0 */ def run(input: RDD[LabeledPoint]): DecisionTreeModel = { // Note: random seed will not be used since numTrees = 1. @@ -62,6 +64,9 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo } } +/** + * @since 1.0.0 + */ object DecisionTree extends Serializable with Logging { /** @@ -79,6 +84,7 @@ object DecisionTree extends Serializable with Logging { * of algorithm (classification, regression, etc.), feature type (continuous, * categorical), depth of the tree, quantile calculation strategy, etc. * @return DecisionTreeModel that can be used for prediction + * @since 1.0.0 */ def train(input: RDD[LabeledPoint], strategy: Strategy): DecisionTreeModel = { new DecisionTree(strategy).run(input) @@ -100,6 +106,7 @@ object DecisionTree extends Serializable with Logging { * @param maxDepth Maximum depth of the tree. * E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. * @return DecisionTreeModel that can be used for prediction + * @since
spark git commit: [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree
Repository: spark Updated Branches: refs/heads/master 492ac1fac - 1dbffba37 [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree Added since tags to mllib.tree Author: Bryan Cutler bjcut...@us.ibm.com Closes #7380 from BryanCutler/sinceTag-mllibTree-8924. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1dbffba3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1dbffba3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1dbffba3 Branch: refs/heads/master Commit: 1dbffba37a84c62202befd3911d25888f958191d Parents: 492ac1f Author: Bryan Cutler bjcut...@us.ibm.com Authored: Tue Aug 18 14:58:30 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 14:58:30 2015 -0700 -- .../apache/spark/mllib/tree/DecisionTree.scala | 13 +++ .../spark/mllib/tree/GradientBoostedTrees.scala | 10 ++ .../apache/spark/mllib/tree/RandomForest.scala | 10 ++ .../spark/mllib/tree/configuration/Algo.scala | 1 + .../tree/configuration/BoostingStrategy.scala | 6 .../mllib/tree/configuration/FeatureType.scala | 1 + .../tree/configuration/QuantileStrategy.scala | 1 + .../mllib/tree/configuration/Strategy.scala | 20 ++- .../spark/mllib/tree/impurity/Entropy.scala | 4 +++ .../apache/spark/mllib/tree/impurity/Gini.scala | 4 +++ .../spark/mllib/tree/impurity/Impurity.scala| 3 ++ .../spark/mllib/tree/impurity/Variance.scala| 4 +++ .../spark/mllib/tree/loss/AbsoluteError.scala | 2 ++ .../apache/spark/mllib/tree/loss/LogLoss.scala | 2 ++ .../org/apache/spark/mllib/tree/loss/Loss.scala | 3 ++ .../apache/spark/mllib/tree/loss/Losses.scala | 6 .../spark/mllib/tree/loss/SquaredError.scala| 2 ++ .../mllib/tree/model/DecisionTreeModel.scala| 22 .../mllib/tree/model/InformationGainStats.scala | 1 + .../apache/spark/mllib/tree/model/Node.scala| 3 ++ .../apache/spark/mllib/tree/model/Predict.scala | 1 + .../apache/spark/mllib/tree/model/Split.scala | 1 + .../mllib/tree/model/treeEnsembleModels.scala | 37 .../org/apache/spark/mllib/tree/package.scala | 1 + 24 files changed, 157 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/1dbffba3/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala index cecd1fe..e5200b8 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala @@ -43,6 +43,7 @@ import org.apache.spark.util.random.XORShiftRandom * @param strategy The configuration parameters for the tree algorithm which specify the type * of algorithm (classification, regression, etc.), feature type (continuous, * categorical), depth of the tree, quantile calculation strategy, etc. + * @since 1.0.0 */ @Experimental class DecisionTree (private val strategy: Strategy) extends Serializable with Logging { @@ -53,6 +54,7 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo * Method to train a decision tree model over an RDD * @param input Training data: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]] * @return DecisionTreeModel that can be used for prediction + * @since 1.2.0 */ def run(input: RDD[LabeledPoint]): DecisionTreeModel = { // Note: random seed will not be used since numTrees = 1. @@ -62,6 +64,9 @@ class DecisionTree (private val strategy: Strategy) extends Serializable with Lo } } +/** + * @since 1.0.0 + */ object DecisionTree extends Serializable with Logging { /** @@ -79,6 +84,7 @@ object DecisionTree extends Serializable with Logging { * of algorithm (classification, regression, etc.), feature type (continuous, * categorical), depth of the tree, quantile calculation strategy, etc. * @return DecisionTreeModel that can be used for prediction + * @since 1.0.0 */ def train(input: RDD[LabeledPoint], strategy: Strategy): DecisionTreeModel = { new DecisionTree(strategy).run(input) @@ -100,6 +106,7 @@ object DecisionTree extends Serializable with Logging { * @param maxDepth Maximum depth of the tree. * E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. * @return DecisionTreeModel that can be used for prediction + * @since 1.0.0 */ def train( input: RDD[LabeledPoint], @@ -127,6 +134,7 @@ object DecisionTree extends Serializable
spark git commit: [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide
Repository: spark Updated Branches: refs/heads/master f4fa61eff - 747c2ba80 [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide Add Python example for mllib LDAModel user guide Author: Yanbo Liang yblia...@gmail.com Closes #8227 from yanboliang/spark-10032. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/747c2ba8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/747c2ba8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/747c2ba8 Branch: refs/heads/master Commit: 747c2ba8006d5b86f3be8dfa9ace639042a35628 Parents: f4fa61e Author: Yanbo Liang yblia...@gmail.com Authored: Tue Aug 18 12:56:36 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:56:36 2015 -0700 -- docs/mllib-clustering.md | 28 1 file changed, 28 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/747c2ba8/docs/mllib-clustering.md -- diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md index bb875ae..fd9ab25 100644 --- a/docs/mllib-clustering.md +++ b/docs/mllib-clustering.md @@ -564,6 +564,34 @@ public class JavaLDAExample { {% endhighlight %} /div +div data-lang=python markdown=1 +{% highlight python %} +from pyspark.mllib.clustering import LDA, LDAModel +from pyspark.mllib.linalg import Vectors + +# Load and parse the data +data = sc.textFile(data/mllib/sample_lda_data.txt) +parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')])) +# Index documents with unique IDs +corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache() + +# Cluster the documents into three topics using LDA +ldaModel = LDA.train(corpus, k=3) + +# Output topics. Each is a distribution over words (matching word count vectors) +print(Learned topics (as distributions over vocab of + str(ldaModel.vocabSize()) + words):) +topics = ldaModel.topicsMatrix() +for topic in range(3): +print(Topic + str(topic) + :) +for word in range(0, ldaModel.vocabSize()): +print( + str(topics[word][topic])) + +# Save and load model +model.save(sc, myModelPath) +sameModel = LDAModel.load(sc, myModelPath) +{% endhighlight %} +/div + /div ## Streaming k-means - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide
Repository: spark Updated Branches: refs/heads/branch-1.5 80debff12 - ec7079f9c [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide Add Python example for mllib LDAModel user guide Author: Yanbo Liang yblia...@gmail.com Closes #8227 from yanboliang/spark-10032. (cherry picked from commit 747c2ba8006d5b86f3be8dfa9ace639042a35628) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ec7079f9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ec7079f9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ec7079f9 Branch: refs/heads/branch-1.5 Commit: ec7079f9c94cb98efdac6f92b7c85efb0e67492e Parents: 80debff Author: Yanbo Liang yblia...@gmail.com Authored: Tue Aug 18 12:56:36 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:56:43 2015 -0700 -- docs/mllib-clustering.md | 28 1 file changed, 28 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ec7079f9/docs/mllib-clustering.md -- diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md index bb875ae..fd9ab25 100644 --- a/docs/mllib-clustering.md +++ b/docs/mllib-clustering.md @@ -564,6 +564,34 @@ public class JavaLDAExample { {% endhighlight %} /div +div data-lang=python markdown=1 +{% highlight python %} +from pyspark.mllib.clustering import LDA, LDAModel +from pyspark.mllib.linalg import Vectors + +# Load and parse the data +data = sc.textFile(data/mllib/sample_lda_data.txt) +parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')])) +# Index documents with unique IDs +corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache() + +# Cluster the documents into three topics using LDA +ldaModel = LDA.train(corpus, k=3) + +# Output topics. Each is a distribution over words (matching word count vectors) +print(Learned topics (as distributions over vocab of + str(ldaModel.vocabSize()) + words):) +topics = ldaModel.topicsMatrix() +for topic in range(3): +print(Topic + str(topic) + :) +for word in range(0, ldaModel.vocabSize()): +print( + str(topics[word][topic])) + +# Save and load model +model.save(sc, myModelPath) +sameModel = LDAModel.load(sc, myModelPath) +{% endhighlight %} +/div + /div ## Streaming k-means - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression user guide
Repository: spark Updated Branches: refs/heads/branch-1.5 7ff0e5d2f - 80debff12 [SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression user guide Add Python examples for mllib IsotonicRegression user guide Author: Yanbo Liang yblia...@gmail.com Closes #8225 from yanboliang/spark-10029. (cherry picked from commit f4fa61effe34dae2f0eab0bef57b2dee220cf92f) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/80debff1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/80debff1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/80debff1 Branch: refs/heads/branch-1.5 Commit: 80debff123e0b5dcc4e6f5899753a736de2c8e75 Parents: 7ff0e5d Author: Yanbo Liang yblia...@gmail.com Authored: Tue Aug 18 12:55:36 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 18 12:55:42 2015 -0700 -- docs/mllib-isotonic-regression.md | 35 ++ 1 file changed, 35 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/80debff1/docs/mllib-isotonic-regression.md -- diff --git a/docs/mllib-isotonic-regression.md b/docs/mllib-isotonic-regression.md index 5732bc4..6aa881f 100644 --- a/docs/mllib-isotonic-regression.md +++ b/docs/mllib-isotonic-regression.md @@ -160,4 +160,39 @@ model.save(sc.sc(), myModelPath); IsotonicRegressionModel sameModel = IsotonicRegressionModel.load(sc.sc(), myModelPath); {% endhighlight %} /div + +div data-lang=python markdown=1 +Data are read from a file where each line has a format label,feature +i.e. 4710.28,500.00. The data are split to training and testing set. +Model is created using the training set and a mean squared error is calculated from the predicted +labels and real labels in the test set. + +{% highlight python %} +import math +from pyspark.mllib.regression import IsotonicRegression, IsotonicRegressionModel + +data = sc.textFile(data/mllib/sample_isotonic_regression_data.txt) + +# Create label, feature, weight tuples from input data with weight set to default value 1.0. +parsedData = data.map(lambda line: tuple([float(x) for x in line.split(',')]) + (1.0,)) + +# Split data into training (60%) and test (40%) sets. +training, test = parsedData.randomSplit([0.6, 0.4], 11) + +# Create isotonic regression model from training data. +# Isotonic parameter defaults to true so it is only shown for demonstration +model = IsotonicRegression.train(training) + +# Create tuples of predicted and real labels. +predictionAndLabel = test.map(lambda p: (model.predict(p[1]), p[0])) + +# Calculate mean squared error between predicted and real labels. +meanSquaredError = predictionAndLabel.map(lambda pl: math.pow((pl[0] - pl[1]), 2)).mean() +print(Mean Squared Error = + str(meanSquaredError)) + +# Save and load model +model.save(sc, myModelPath) +sameModel = IsotonicRegressionModel.load(sc, myModelPath) +{% endhighlight %} +/div /div - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-8920] [MLLIB] Add @since tags to mllib.linalg
Repository: spark Updated Branches: refs/heads/master fdaf17f63 - 088b11ec5 [SPARK-8920] [MLLIB] Add @since tags to mllib.linalg Author: Sameer Abhyankar sabhyankar@sabhyankar-MBP.Samavihome Author: Sameer Abhyankar sabhyankar@sabhyankar-MBP.local Closes #7729 from sabhyankar/branch_8920. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/088b11ec Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/088b11ec Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/088b11ec Branch: refs/heads/master Commit: 088b11ec5949e135cb3db2a1ce136837e046c288 Parents: fdaf17f Author: Sameer Abhyankar sabhyankar@sabhyankar-MBP.Samavihome Authored: Mon Aug 17 16:00:23 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 16:00:23 2015 -0700 -- .../apache/spark/mllib/linalg/Matrices.scala| 63 .../linalg/SingularValueDecomposition.scala | 1 + .../org/apache/spark/mllib/linalg/Vectors.scala | 60 +++ .../mllib/linalg/distributed/BlockMatrix.scala | 43 +++-- .../linalg/distributed/CoordinateMatrix.scala | 28 +++-- .../linalg/distributed/DistributedMatrix.scala | 1 + .../linalg/distributed/IndexedRowMatrix.scala | 24 +++- .../mllib/linalg/distributed/RowMatrix.scala| 24 +++- 8 files changed, 227 insertions(+), 17 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/088b11ec/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala index 1139ce3..dfa8910 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala @@ -227,6 +227,7 @@ private[spark] class MatrixUDT extends UserDefinedType[Matrix] { * @param values matrix entries in column major if not transposed or in row major otherwise * @param isTransposed whether the matrix is transposed. If true, `values` stores the matrix in * row major. + * @since 1.0.0 */ @SQLUserDefinedType(udt = classOf[MatrixUDT]) class DenseMatrix( @@ -252,6 +253,7 @@ class DenseMatrix( * @param numRows number of rows * @param numCols number of columns * @param values matrix entries in column major + * @since 1.3.0 */ def this(numRows: Int, numCols: Int, values: Array[Double]) = this(numRows, numCols, values, false) @@ -276,6 +278,9 @@ class DenseMatrix( private[mllib] def apply(i: Int): Double = values(i) + /** + * @since 1.3.0 + */ override def apply(i: Int, j: Int): Double = values(index(i, j)) private[mllib] def index(i: Int, j: Int): Int = { @@ -286,6 +291,9 @@ class DenseMatrix( values(index(i, j)) = v } + /** + * @since 1.4.0 + */ override def copy: DenseMatrix = new DenseMatrix(numRows, numCols, values.clone()) private[spark] def map(f: Double = Double) = new DenseMatrix(numRows, numCols, values.map(f), @@ -301,6 +309,9 @@ class DenseMatrix( this } + /** + * @since 1.3.0 + */ override def transpose: DenseMatrix = new DenseMatrix(numCols, numRows, values, !isTransposed) private[spark] override def foreachActive(f: (Int, Int, Double) = Unit): Unit = { @@ -331,13 +342,20 @@ class DenseMatrix( } } + /** + * @since 1.5.0 + */ override def numNonzeros: Int = values.count(_ != 0) + /** + * @since 1.5.0 + */ override def numActives: Int = values.length /** * Generate a `SparseMatrix` from the given `DenseMatrix`. The new matrix will have isTransposed * set to false. + * @since 1.3.0 */ def toSparse: SparseMatrix = { val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble @@ -365,6 +383,7 @@ class DenseMatrix( /** * Factory methods for [[org.apache.spark.mllib.linalg.DenseMatrix]]. + * @since 1.3.0 */ object DenseMatrix { @@ -373,6 +392,7 @@ object DenseMatrix { * @param numRows number of rows of the matrix * @param numCols number of columns of the matrix * @return `DenseMatrix` with size `numRows` x `numCols` and values of zeros + * @since 1.3.0 */ def zeros(numRows: Int, numCols: Int): DenseMatrix = { require(numRows.toLong * numCols = Int.MaxValue, @@ -385,6 +405,7 @@ object DenseMatrix { * @param numRows number of rows of the matrix * @param numCols number of columns of the matrix * @return `DenseMatrix` with size `numRows` x `numCols` and values of ones + * @since 1.3.0 */ def ones(numRows: Int, numCols: Int): DenseMatrix = { require(numRows.toLong * numCols = Int.MaxValue, @@ -396,6 +417,7 @@ object
spark git commit: [SPARK-7707] User guide and example code for KernelDensity
Repository: spark Updated Branches: refs/heads/branch-1.5 18b3d11f7 - 5de0ffbd0 [SPARK-7707] User guide and example code for KernelDensity Author: Sandy Ryza sa...@cloudera.com Closes #8230 from sryza/sandy-spark-7707. (cherry picked from commit f9d1a92aa1bac4494022d78559b871149579e6e8) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5de0ffbd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5de0ffbd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5de0ffbd Branch: refs/heads/branch-1.5 Commit: 5de0ffbd0e0aef170171cec8808eb4ec1ba79b0f Parents: 18b3d11 Author: Sandy Ryza sa...@cloudera.com Authored: Mon Aug 17 17:57:51 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 17:58:06 2015 -0700 -- docs/mllib-statistics.md | 77 +++ 1 file changed, 77 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5de0ffbd/docs/mllib-statistics.md -- diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md index be04d0b..80a9d06 100644 --- a/docs/mllib-statistics.md +++ b/docs/mllib-statistics.md @@ -528,5 +528,82 @@ u = RandomRDDs.uniformRDD(sc, 100L, 10) v = u.map(lambda x: 1.0 + 2.0 * x) {% endhighlight %} /div +/div + +## Kernel density estimation + +[Kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) is a technique +useful for visualizing empirical probability distributions without requiring assumptions about the +particular distribution that the observed samples are drawn from. It computes an estimate of the +probability density function of a random variables, evaluated at a given set of points. It achieves +this estimate by expressing the PDF of the empirical distribution at a particular point as the the +mean of PDFs of normal distributions centered around each of the samples. + +div class=codetabs + +div data-lang=scala markdown=1 +[`KernelDensity`](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods +to compute kernel density estimates from an RDD of samples. The following example demonstrates how +to do so. + +{% highlight scala %} +import org.apache.spark.mllib.stat.KernelDensity +import org.apache.spark.rdd.RDD + +val data: RDD[Double] = ... // an RDD of sample data + +// Construct the density estimator with the sample data and a standard deviation for the Gaussian +// kernels +val kd = new KernelDensity() + .setSample(data) + .setBandwidth(3.0) + +// Find density estimates for the given values +val densities = kd.estimate(Array(-1.0, 2.0, 5.0)) +{% endhighlight %} +/div + +div data-lang=java markdown=1 +[`KernelDensity`](api/java/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods +to compute kernel density estimates from an RDD of samples. The following example demonstrates how +to do so. + +{% highlight java %} +import org.apache.spark.mllib.stat.KernelDensity; +import org.apache.spark.rdd.RDD; + +RDDDouble data = ... // an RDD of sample data + +// Construct the density estimator with the sample data and a standard deviation for the Gaussian +// kernels +KernelDensity kd = new KernelDensity() + .setSample(data) + .setBandwidth(3.0); + +// Find density estimates for the given values +double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0}); +{% endhighlight %} +/div + +div data-lang=python markdown=1 +[`KernelDensity`](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity) provides methods +to compute kernel density estimates from an RDD of samples. The following example demonstrates how +to do so. + +{% highlight python %} +from pyspark.mllib.stat import KernelDensity + +data = ... # an RDD of sample data + +# Construct the density estimator with the sample data and a standard deviation for the Gaussian +# kernels +kd = KernelDensity() +kd.setSample(data) +kd.setBandwidth(3.0) + +# Find density estimates for the given values +densities = kd.estimate([-1.0, 2.0, 5.0]) +{% endhighlight %} +/div /div - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-7808] [ML] add package doc for ml.feature
Repository: spark Updated Branches: refs/heads/branch-1.5 bfb4c8425 - 35542504c [SPARK-7808] [ML] add package doc for ml.feature This PR adds a short description of `ml.feature` package with code example. The Java package doc will come in a separate PR. jkbradley Author: Xiangrui Meng m...@databricks.com Closes #8260 from mengxr/SPARK-7808. (cherry picked from commit e290029a356222bddf4da1be0525a221a5a1630b) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/35542504 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/35542504 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/35542504 Branch: refs/heads/branch-1.5 Commit: 35542504c51c5754db7812cf7bec674a957e66ad Parents: bfb4c84 Author: Xiangrui Meng m...@databricks.com Authored: Mon Aug 17 19:40:51 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 19:40:58 2015 -0700 -- .../org/apache/spark/ml/feature/package.scala | 89 1 file changed, 89 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/35542504/mllib/src/main/scala/org/apache/spark/ml/feature/package.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/package.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/package.scala new file mode 100644 index 000..4571ab2 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/package.scala @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import org.apache.spark.ml.feature.{HashingTF, IDF, IDFModel, VectorAssembler} +import org.apache.spark.sql.DataFrame + +/** + * == Feature transformers == + * + * The `ml.feature` package provides common feature transformers that help convert raw data or + * features into more suitable forms for model fitting. + * Most feature transformers are implemented as [[Transformer]]s, which transform one [[DataFrame]] + * into another, e.g., [[HashingTF]]. + * Some feature transformers are implemented as [[Estimator]]s, because the transformation requires + * some aggregated information of the dataset, e.g., document frequencies in [[IDF]]. + * For those feature transformers, calling [[Estimator!.fit]] is required to obtain the model first, + * e.g., [[IDFModel]], in order to apply transformation. + * The transformation is usually done by appending new columns to the input [[DataFrame]], so all + * input columns are carried over. + * + * We try to make each transformer minimal, so it becomes flexible to assemble feature + * transformation pipelines. + * [[Pipeline]] can be used to chain feature transformers, and [[VectorAssembler]] can be used to + * combine multiple feature transformations, for example: + * + * {{{ + * import org.apache.spark.ml.feature._ + * import org.apache.spark.ml.Pipeline + * + * // a DataFrame with three columns: id (integer), text (string), and rating (double). + * val df = sqlContext.createDataFrame(Seq( + * (0, Hi I heard about Spark, 3.0), + * (1, I wish Java could use case classes, 4.0), + * (2, Logistic regression models are neat, 4.0) + * )).toDF(id, text, rating) + * + * // define feature transformers + * val tok = new RegexTokenizer() + * .setInputCol(text) + * .setOutputCol(words) + * val sw = new StopWordsRemover() + * .setInputCol(words) + * .setOutputCol(filtered_words) + * val tf = new HashingTF() + * .setInputCol(filtered_words) + * .setOutputCol(tf) + * .setNumFeatures(1) + * val idf = new IDF() + * .setInputCol(tf) + * .setOutputCol(tf_idf) + * val assembler = new VectorAssembler() + * .setInputCols(Array(tf_idf, rating)) + * .setOutputCol(features) + * + * // assemble and fit the feature transformation pipeline + * val pipeline = new Pipeline() + * .setStages(Array(tok, sw, tf, idf, assembler)) + * val model = pipeline.fit(df) + * + * // save
spark git commit: [SPARK-9902] [MLLIB] Add Java and Python examples to user guide for 1-sample KS test
Repository: spark Updated Branches: refs/heads/branch-1.5 5de0ffbd0 - 9740d43d3 [SPARK-9902] [MLLIB] Add Java and Python examples to user guide for 1-sample KS test added doc examples for python. Author: jose.cambronero jose.cambron...@cloudera.com Closes #8154 from josepablocam/spark_9902. (cherry picked from commit c90c605dc6a876aef3cc204ac15cd65bab9743ad) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9740d43d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9740d43d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9740d43d Branch: refs/heads/branch-1.5 Commit: 9740d43d3b5e1ca64f39515612e937f640eb436e Parents: 5de0ffb Author: jose.cambronero jose.cambron...@cloudera.com Authored: Mon Aug 17 19:09:45 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 19:09:51 2015 -0700 -- docs/mllib-statistics.md | 51 +++ 1 file changed, 47 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9740d43d/docs/mllib-statistics.md -- diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md index 80a9d06..6acfc71 100644 --- a/docs/mllib-statistics.md +++ b/docs/mllib-statistics.md @@ -438,22 +438,65 @@ run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstra and interpret the hypothesis tests. {% highlight scala %} -import org.apache.spark.SparkContext -import org.apache.spark.mllib.stat.Statistics._ +import org.apache.spark.mllib.stat.Statistics val data: RDD[Double] = ... // an RDD of sample data // run a KS test for the sample versus a standard normal distribution val testResult = Statistics.kolmogorovSmirnovTest(data, norm, 0, 1) println(testResult) // summary of the test including the p-value, test statistic, - // and null hypothesis - // if our p-value indicates significance, we can reject the null hypothesis +// and null hypothesis +// if our p-value indicates significance, we can reject the null hypothesis // perform a KS test using a cumulative distribution function of our making val myCDF: Double = Double = ... val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF) {% endhighlight %} /div + +div data-lang=java markdown=1 +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to +run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run +and interpret the hypothesis tests. + +{% highlight java %} +import java.util.Arrays; + +import org.apache.spark.api.java.JavaDoubleRDD; +import org.apache.spark.api.java.JavaSparkContext; + +import org.apache.spark.mllib.stat.Statistics; +import org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult; + +JavaSparkContext jsc = ... +JavaDoubleRDD data = jsc.parallelizeDoubles(Arrays.asList(0.2, 1.0, ...)); +KolmogorovSmirnovTestResult testResult = Statistics.kolmogorovSmirnovTest(data, norm, 0.0, 1.0); +// summary of the test including the p-value, test statistic, +// and null hypothesis +// if our p-value indicates significance, we can reject the null hypothesis +System.out.println(testResult); +{% endhighlight %} +/div + +div data-lang=python markdown=1 +[`Statistics`](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) provides methods to +run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run +and interpret the hypothesis tests. + +{% highlight python %} +from pyspark.mllib.stat import Statistics + +parallelData = sc.parallelize([1.0, 2.0, ... ]) + +# run a KS test for the sample versus a standard normal distribution +testResult = Statistics.kolmogorovSmirnovTest(parallelData, norm, 0, 1) +print(testResult) # summary of the test including the p-value, test statistic, + # and null hypothesis + # if our p-value indicates significance, we can reject the null hypothesis +# Note that the Scala functionality of calling Statistics.kolmogorovSmirnovTest with +# a lambda to calculate the CDF is not made available in the Python API +{% endhighlight %} +/div /div - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9902] [MLLIB] Add Java and Python examples to user guide for 1-sample KS test
Repository: spark Updated Branches: refs/heads/master f9d1a92aa - c90c605dc [SPARK-9902] [MLLIB] Add Java and Python examples to user guide for 1-sample KS test added doc examples for python. Author: jose.cambronero jose.cambron...@cloudera.com Closes #8154 from josepablocam/spark_9902. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c90c605d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c90c605d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c90c605d Branch: refs/heads/master Commit: c90c605dc6a876aef3cc204ac15cd65bab9743ad Parents: f9d1a92 Author: jose.cambronero jose.cambron...@cloudera.com Authored: Mon Aug 17 19:09:45 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 19:09:45 2015 -0700 -- docs/mllib-statistics.md | 51 +++ 1 file changed, 47 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c90c605d/docs/mllib-statistics.md -- diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md index 80a9d06..6acfc71 100644 --- a/docs/mllib-statistics.md +++ b/docs/mllib-statistics.md @@ -438,22 +438,65 @@ run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstra and interpret the hypothesis tests. {% highlight scala %} -import org.apache.spark.SparkContext -import org.apache.spark.mllib.stat.Statistics._ +import org.apache.spark.mllib.stat.Statistics val data: RDD[Double] = ... // an RDD of sample data // run a KS test for the sample versus a standard normal distribution val testResult = Statistics.kolmogorovSmirnovTest(data, norm, 0, 1) println(testResult) // summary of the test including the p-value, test statistic, - // and null hypothesis - // if our p-value indicates significance, we can reject the null hypothesis +// and null hypothesis +// if our p-value indicates significance, we can reject the null hypothesis // perform a KS test using a cumulative distribution function of our making val myCDF: Double = Double = ... val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF) {% endhighlight %} /div + +div data-lang=java markdown=1 +[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides methods to +run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run +and interpret the hypothesis tests. + +{% highlight java %} +import java.util.Arrays; + +import org.apache.spark.api.java.JavaDoubleRDD; +import org.apache.spark.api.java.JavaSparkContext; + +import org.apache.spark.mllib.stat.Statistics; +import org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult; + +JavaSparkContext jsc = ... +JavaDoubleRDD data = jsc.parallelizeDoubles(Arrays.asList(0.2, 1.0, ...)); +KolmogorovSmirnovTestResult testResult = Statistics.kolmogorovSmirnovTest(data, norm, 0.0, 1.0); +// summary of the test including the p-value, test statistic, +// and null hypothesis +// if our p-value indicates significance, we can reject the null hypothesis +System.out.println(testResult); +{% endhighlight %} +/div + +div data-lang=python markdown=1 +[`Statistics`](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) provides methods to +run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run +and interpret the hypothesis tests. + +{% highlight python %} +from pyspark.mllib.stat import Statistics + +parallelData = sc.parallelize([1.0, 2.0, ... ]) + +# run a KS test for the sample versus a standard normal distribution +testResult = Statistics.kolmogorovSmirnovTest(parallelData, norm, 0, 1) +print(testResult) # summary of the test including the p-value, test statistic, + # and null hypothesis + # if our p-value indicates significance, we can reject the null hypothesis +# Note that the Scala functionality of calling Statistics.kolmogorovSmirnovTest with +# a lambda to calculate the CDF is not made available in the Python API +{% endhighlight %} +/div /div - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10068] [MLLIB] Adds links to MLlib types, algos, utilities listing
Repository: spark Updated Branches: refs/heads/master 772e7c18f - fdaf17f63 [SPARK-10068] [MLLIB] Adds links to MLlib types, algos, utilities listing mengxr jkbradley Author: Feynman Liang fli...@databricks.com Closes #8255 from feynmanliang/SPARK-10068. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fdaf17f6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fdaf17f6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fdaf17f6 Branch: refs/heads/master Commit: fdaf17f63f751f02623414fbc7d0a2f545364050 Parents: 772e7c1 Author: Feynman Liang fli...@databricks.com Authored: Mon Aug 17 15:42:14 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 15:42:14 2015 -0700 -- docs/mllib-guide.md | 26 +- 1 file changed, 13 insertions(+), 13 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fdaf17f6/docs/mllib-guide.md -- diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index eea864e..e8000ff 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -23,19 +23,19 @@ This lists functionality included in `spark.mllib`, the main MLlib API. * [Data types](mllib-data-types.html) * [Basic statistics](mllib-statistics.html) - * summary statistics - * correlations - * stratified sampling - * hypothesis testing - * random data generation + * [summary statistics](mllib-statistics.html#summary-statistics) + * [correlations](mllib-statistics.html#correlations) + * [stratified sampling](mllib-statistics.html#stratified-sampling) + * [hypothesis testing](mllib-statistics.html#hypothesis-testing) + * [random data generation](mllib-statistics.html#random-data-generation) * [Classification and regression](mllib-classification-regression.html) * [linear models (SVMs, logistic regression, linear regression)](mllib-linear-methods.html) * [naive Bayes](mllib-naive-bayes.html) * [decision trees](mllib-decision-tree.html) - * [ensembles of trees](mllib-ensembles.html) (Random Forests and Gradient-Boosted Trees) + * [ensembles of trees (Random Forests and Gradient-Boosted Trees)](mllib-ensembles.html) * [isotonic regression](mllib-isotonic-regression.html) * [Collaborative filtering](mllib-collaborative-filtering.html) - * alternating least squares (ALS) + * [alternating least squares (ALS)](mllib-collaborative-filtering.html#collaborative-filtering) * [Clustering](mllib-clustering.html) * [k-means](mllib-clustering.html#k-means) * [Gaussian mixture](mllib-clustering.html#gaussian-mixture) @@ -43,19 +43,19 @@ This lists functionality included in `spark.mllib`, the main MLlib API. * [latent Dirichlet allocation (LDA)](mllib-clustering.html#latent-dirichlet-allocation-lda) * [streaming k-means](mllib-clustering.html#streaming-k-means) * [Dimensionality reduction](mllib-dimensionality-reduction.html) - * singular value decomposition (SVD) - * principal component analysis (PCA) + * [singular value decomposition (SVD)](mllib-dimensionality-reduction.html#singular-value-decomposition-svd) + * [principal component analysis (PCA)](mllib-dimensionality-reduction.html#principal-component-analysis-pca) * [Feature extraction and transformation](mllib-feature-extraction.html) * [Frequent pattern mining](mllib-frequent-pattern-mining.html) - * FP-growth + * [FP-growth](mllib-frequent-pattern-mining.html#fp-growth) * [Evaluation Metrics](mllib-evaluation-metrics.html) * [Optimization (developer)](mllib-optimization.html) - * stochastic gradient descent - * limited-memory BFGS (L-BFGS) + * [stochastic gradient descent](mllib-optimization.html#stochastic-gradient-descent-sgd) + * [limited-memory BFGS (L-BFGS)](mllib-optimization.html#limited-memory-bfgs-l-bfgs) * [PMML model export](mllib-pmml-model-export.html) MLlib is under active development. -The APIs marked `Experimental`/`DeveloperApi` may change in future releases, +The APIs marked `Experimental`/`DeveloperApi` may change in future releases, and the migration guide below will explain all changes between releases. # spark.ml: high-level APIs for ML pipelines - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-10068] [MLLIB] Adds links to MLlib types, algos, utilities listing
Repository: spark Updated Branches: refs/heads/branch-1.5 f77eaaf34 - bb3bb2a48 [SPARK-10068] [MLLIB] Adds links to MLlib types, algos, utilities listing mengxr jkbradley Author: Feynman Liang fli...@databricks.com Closes #8255 from feynmanliang/SPARK-10068. (cherry picked from commit fdaf17f63f751f02623414fbc7d0a2f545364050) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bb3bb2a4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bb3bb2a4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bb3bb2a4 Branch: refs/heads/branch-1.5 Commit: bb3bb2a48ee32a5de4637a73dd11930c72f9c77e Parents: f77eaaf Author: Feynman Liang fli...@databricks.com Authored: Mon Aug 17 15:42:14 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 15:42:21 2015 -0700 -- docs/mllib-guide.md | 26 +- 1 file changed, 13 insertions(+), 13 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bb3bb2a4/docs/mllib-guide.md -- diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index eea864e..e8000ff 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -23,19 +23,19 @@ This lists functionality included in `spark.mllib`, the main MLlib API. * [Data types](mllib-data-types.html) * [Basic statistics](mllib-statistics.html) - * summary statistics - * correlations - * stratified sampling - * hypothesis testing - * random data generation + * [summary statistics](mllib-statistics.html#summary-statistics) + * [correlations](mllib-statistics.html#correlations) + * [stratified sampling](mllib-statistics.html#stratified-sampling) + * [hypothesis testing](mllib-statistics.html#hypothesis-testing) + * [random data generation](mllib-statistics.html#random-data-generation) * [Classification and regression](mllib-classification-regression.html) * [linear models (SVMs, logistic regression, linear regression)](mllib-linear-methods.html) * [naive Bayes](mllib-naive-bayes.html) * [decision trees](mllib-decision-tree.html) - * [ensembles of trees](mllib-ensembles.html) (Random Forests and Gradient-Boosted Trees) + * [ensembles of trees (Random Forests and Gradient-Boosted Trees)](mllib-ensembles.html) * [isotonic regression](mllib-isotonic-regression.html) * [Collaborative filtering](mllib-collaborative-filtering.html) - * alternating least squares (ALS) + * [alternating least squares (ALS)](mllib-collaborative-filtering.html#collaborative-filtering) * [Clustering](mllib-clustering.html) * [k-means](mllib-clustering.html#k-means) * [Gaussian mixture](mllib-clustering.html#gaussian-mixture) @@ -43,19 +43,19 @@ This lists functionality included in `spark.mllib`, the main MLlib API. * [latent Dirichlet allocation (LDA)](mllib-clustering.html#latent-dirichlet-allocation-lda) * [streaming k-means](mllib-clustering.html#streaming-k-means) * [Dimensionality reduction](mllib-dimensionality-reduction.html) - * singular value decomposition (SVD) - * principal component analysis (PCA) + * [singular value decomposition (SVD)](mllib-dimensionality-reduction.html#singular-value-decomposition-svd) + * [principal component analysis (PCA)](mllib-dimensionality-reduction.html#principal-component-analysis-pca) * [Feature extraction and transformation](mllib-feature-extraction.html) * [Frequent pattern mining](mllib-frequent-pattern-mining.html) - * FP-growth + * [FP-growth](mllib-frequent-pattern-mining.html#fp-growth) * [Evaluation Metrics](mllib-evaluation-metrics.html) * [Optimization (developer)](mllib-optimization.html) - * stochastic gradient descent - * limited-memory BFGS (L-BFGS) + * [stochastic gradient descent](mllib-optimization.html#stochastic-gradient-descent-sgd) + * [limited-memory BFGS (L-BFGS)](mllib-optimization.html#limited-memory-bfgs-l-bfgs) * [PMML model export](mllib-pmml-model-export.html) MLlib is under active development. -The APIs marked `Experimental`/`DeveloperApi` may change in future releases, +The APIs marked `Experimental`/`DeveloperApi` may change in future releases, and the migration guide below will explain all changes between releases. # spark.ml: high-level APIs for ML pipelines - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-7707] User guide and example code for KernelDensity
Repository: spark Updated Branches: refs/heads/master 0b6b01761 - f9d1a92aa [SPARK-7707] User guide and example code for KernelDensity Author: Sandy Ryza sa...@cloudera.com Closes #8230 from sryza/sandy-spark-7707. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f9d1a92a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f9d1a92a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f9d1a92a Branch: refs/heads/master Commit: f9d1a92aa1bac4494022d78559b871149579e6e8 Parents: 0b6b017 Author: Sandy Ryza sa...@cloudera.com Authored: Mon Aug 17 17:57:51 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 17:57:51 2015 -0700 -- docs/mllib-statistics.md | 77 +++ 1 file changed, 77 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f9d1a92a/docs/mllib-statistics.md -- diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md index be04d0b..80a9d06 100644 --- a/docs/mllib-statistics.md +++ b/docs/mllib-statistics.md @@ -528,5 +528,82 @@ u = RandomRDDs.uniformRDD(sc, 100L, 10) v = u.map(lambda x: 1.0 + 2.0 * x) {% endhighlight %} /div +/div + +## Kernel density estimation + +[Kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) is a technique +useful for visualizing empirical probability distributions without requiring assumptions about the +particular distribution that the observed samples are drawn from. It computes an estimate of the +probability density function of a random variables, evaluated at a given set of points. It achieves +this estimate by expressing the PDF of the empirical distribution at a particular point as the the +mean of PDFs of normal distributions centered around each of the samples. + +div class=codetabs + +div data-lang=scala markdown=1 +[`KernelDensity`](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods +to compute kernel density estimates from an RDD of samples. The following example demonstrates how +to do so. + +{% highlight scala %} +import org.apache.spark.mllib.stat.KernelDensity +import org.apache.spark.rdd.RDD + +val data: RDD[Double] = ... // an RDD of sample data + +// Construct the density estimator with the sample data and a standard deviation for the Gaussian +// kernels +val kd = new KernelDensity() + .setSample(data) + .setBandwidth(3.0) + +// Find density estimates for the given values +val densities = kd.estimate(Array(-1.0, 2.0, 5.0)) +{% endhighlight %} +/div + +div data-lang=java markdown=1 +[`KernelDensity`](api/java/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods +to compute kernel density estimates from an RDD of samples. The following example demonstrates how +to do so. + +{% highlight java %} +import org.apache.spark.mllib.stat.KernelDensity; +import org.apache.spark.rdd.RDD; + +RDDDouble data = ... // an RDD of sample data + +// Construct the density estimator with the sample data and a standard deviation for the Gaussian +// kernels +KernelDensity kd = new KernelDensity() + .setSample(data) + .setBandwidth(3.0); + +// Find density estimates for the given values +double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0}); +{% endhighlight %} +/div + +div data-lang=python markdown=1 +[`KernelDensity`](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity) provides methods +to compute kernel density estimates from an RDD of samples. The following example demonstrates how +to do so. + +{% highlight python %} +from pyspark.mllib.stat import KernelDensity + +data = ... # an RDD of sample data + +# Construct the density estimator with the sample data and a standard deviation for the Gaussian +# kernels +kd = KernelDensity() +kd.setSample(data) +kd.setBandwidth(3.0) + +# Find density estimates for the given values +densities = kd.estimate([-1.0, 2.0, 5.0]) +{% endhighlight %} +/div /div - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-7707] User guide and example code for KernelDensity
Repository: spark Updated Branches: refs/heads/branch-1.4 4fc3b8cd2 - f7f2ac69d [SPARK-7707] User guide and example code for KernelDensity Author: Sandy Ryza sa...@cloudera.com Closes #8230 from sryza/sandy-spark-7707. (cherry picked from commit f9d1a92aa1bac4494022d78559b871149579e6e8) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f7f2ac69 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f7f2ac69 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f7f2ac69 Branch: refs/heads/branch-1.4 Commit: f7f2ac69d7298a7eb4a89e94d1efddd97e036a2e Parents: 4fc3b8c Author: Sandy Ryza sa...@cloudera.com Authored: Mon Aug 17 17:57:51 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 17:58:43 2015 -0700 -- docs/mllib-statistics.md | 77 +++ 1 file changed, 77 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f7f2ac69/docs/mllib-statistics.md -- diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md index 887eae7..6b1b860 100644 --- a/docs/mllib-statistics.md +++ b/docs/mllib-statistics.md @@ -493,5 +493,82 @@ u = RandomRDDs.uniformRDD(sc, 100L, 10) v = u.map(lambda x: 1.0 + 2.0 * x) {% endhighlight %} /div +/div + +## Kernel density estimation + +[Kernel density estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) is a technique +useful for visualizing empirical probability distributions without requiring assumptions about the +particular distribution that the observed samples are drawn from. It computes an estimate of the +probability density function of a random variables, evaluated at a given set of points. It achieves +this estimate by expressing the PDF of the empirical distribution at a particular point as the the +mean of PDFs of normal distributions centered around each of the samples. + +div class=codetabs + +div data-lang=scala markdown=1 +[`KernelDensity`](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods +to compute kernel density estimates from an RDD of samples. The following example demonstrates how +to do so. + +{% highlight scala %} +import org.apache.spark.mllib.stat.KernelDensity +import org.apache.spark.rdd.RDD + +val data: RDD[Double] = ... // an RDD of sample data + +// Construct the density estimator with the sample data and a standard deviation for the Gaussian +// kernels +val kd = new KernelDensity() + .setSample(data) + .setBandwidth(3.0) + +// Find density estimates for the given values +val densities = kd.estimate(Array(-1.0, 2.0, 5.0)) +{% endhighlight %} +/div + +div data-lang=java markdown=1 +[`KernelDensity`](api/java/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods +to compute kernel density estimates from an RDD of samples. The following example demonstrates how +to do so. + +{% highlight java %} +import org.apache.spark.mllib.stat.KernelDensity; +import org.apache.spark.rdd.RDD; + +RDDDouble data = ... // an RDD of sample data + +// Construct the density estimator with the sample data and a standard deviation for the Gaussian +// kernels +KernelDensity kd = new KernelDensity() + .setSample(data) + .setBandwidth(3.0); + +// Find density estimates for the given values +double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0}); +{% endhighlight %} +/div + +div data-lang=python markdown=1 +[`KernelDensity`](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity) provides methods +to compute kernel density estimates from an RDD of samples. The following example demonstrates how +to do so. + +{% highlight python %} +from pyspark.mllib.stat import KernelDensity + +data = ... # an RDD of sample data + +# Construct the density estimator with the sample data and a standard deviation for the Gaussian +# kernels +kd = KernelDensity() +kd.setSample(data) +kd.setBandwidth(3.0) + +# Find density estimates for the given values +densities = kd.estimate([-1.0, 2.0, 5.0]) +{% endhighlight %} +/div /div - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-7808] [ML] add package doc for ml.feature
Repository: spark Updated Branches: refs/heads/master ee093c8b9 - e290029a3 [SPARK-7808] [ML] add package doc for ml.feature This PR adds a short description of `ml.feature` package with code example. The Java package doc will come in a separate PR. jkbradley Author: Xiangrui Meng m...@databricks.com Closes #8260 from mengxr/SPARK-7808. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e290029a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e290029a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e290029a Branch: refs/heads/master Commit: e290029a356222bddf4da1be0525a221a5a1630b Parents: ee093c8 Author: Xiangrui Meng m...@databricks.com Authored: Mon Aug 17 19:40:51 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 19:40:51 2015 -0700 -- .../org/apache/spark/ml/feature/package.scala | 89 1 file changed, 89 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e290029a/mllib/src/main/scala/org/apache/spark/ml/feature/package.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/package.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/package.scala new file mode 100644 index 000..4571ab2 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/package.scala @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml + +import org.apache.spark.ml.feature.{HashingTF, IDF, IDFModel, VectorAssembler} +import org.apache.spark.sql.DataFrame + +/** + * == Feature transformers == + * + * The `ml.feature` package provides common feature transformers that help convert raw data or + * features into more suitable forms for model fitting. + * Most feature transformers are implemented as [[Transformer]]s, which transform one [[DataFrame]] + * into another, e.g., [[HashingTF]]. + * Some feature transformers are implemented as [[Estimator]]s, because the transformation requires + * some aggregated information of the dataset, e.g., document frequencies in [[IDF]]. + * For those feature transformers, calling [[Estimator!.fit]] is required to obtain the model first, + * e.g., [[IDFModel]], in order to apply transformation. + * The transformation is usually done by appending new columns to the input [[DataFrame]], so all + * input columns are carried over. + * + * We try to make each transformer minimal, so it becomes flexible to assemble feature + * transformation pipelines. + * [[Pipeline]] can be used to chain feature transformers, and [[VectorAssembler]] can be used to + * combine multiple feature transformations, for example: + * + * {{{ + * import org.apache.spark.ml.feature._ + * import org.apache.spark.ml.Pipeline + * + * // a DataFrame with three columns: id (integer), text (string), and rating (double). + * val df = sqlContext.createDataFrame(Seq( + * (0, Hi I heard about Spark, 3.0), + * (1, I wish Java could use case classes, 4.0), + * (2, Logistic regression models are neat, 4.0) + * )).toDF(id, text, rating) + * + * // define feature transformers + * val tok = new RegexTokenizer() + * .setInputCol(text) + * .setOutputCol(words) + * val sw = new StopWordsRemover() + * .setInputCol(words) + * .setOutputCol(filtered_words) + * val tf = new HashingTF() + * .setInputCol(filtered_words) + * .setOutputCol(tf) + * .setNumFeatures(1) + * val idf = new IDF() + * .setInputCol(tf) + * .setOutputCol(tf_idf) + * val assembler = new VectorAssembler() + * .setInputCols(Array(tf_idf, rating)) + * .setOutputCol(features) + * + * // assemble and fit the feature transformation pipeline + * val pipeline = new Pipeline() + * .setStages(Array(tok, sw, tf, idf, assembler)) + * val model = pipeline.fit(df) + * + * // save transformed features with raw data + * model.transform(df) + * .select(id, text, rating, features) + * .write.format
spark git commit: [SPARK-8920] [MLLIB] Add @since tags to mllib.linalg
Repository: spark Updated Branches: refs/heads/branch-1.5 bb3bb2a48 - 0f1417b6f [SPARK-8920] [MLLIB] Add @since tags to mllib.linalg Author: Sameer Abhyankar sabhyankar@sabhyankar-MBP.Samavihome Author: Sameer Abhyankar sabhyankar@sabhyankar-MBP.local Closes #7729 from sabhyankar/branch_8920. (cherry picked from commit 088b11ec5949e135cb3db2a1ce136837e046c288) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0f1417b6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0f1417b6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0f1417b6 Branch: refs/heads/branch-1.5 Commit: 0f1417b6f31e53dd78aae2a0a661d9ba32dce5b7 Parents: bb3bb2a Author: Sameer Abhyankar sabhyankar@sabhyankar-MBP.Samavihome Authored: Mon Aug 17 16:00:23 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 16:00:31 2015 -0700 -- .../apache/spark/mllib/linalg/Matrices.scala| 63 .../linalg/SingularValueDecomposition.scala | 1 + .../org/apache/spark/mllib/linalg/Vectors.scala | 60 +++ .../mllib/linalg/distributed/BlockMatrix.scala | 43 +++-- .../linalg/distributed/CoordinateMatrix.scala | 28 +++-- .../linalg/distributed/DistributedMatrix.scala | 1 + .../linalg/distributed/IndexedRowMatrix.scala | 24 +++- .../mllib/linalg/distributed/RowMatrix.scala| 24 +++- 8 files changed, 227 insertions(+), 17 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0f1417b6/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala index 1139ce3..dfa8910 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala @@ -227,6 +227,7 @@ private[spark] class MatrixUDT extends UserDefinedType[Matrix] { * @param values matrix entries in column major if not transposed or in row major otherwise * @param isTransposed whether the matrix is transposed. If true, `values` stores the matrix in * row major. + * @since 1.0.0 */ @SQLUserDefinedType(udt = classOf[MatrixUDT]) class DenseMatrix( @@ -252,6 +253,7 @@ class DenseMatrix( * @param numRows number of rows * @param numCols number of columns * @param values matrix entries in column major + * @since 1.3.0 */ def this(numRows: Int, numCols: Int, values: Array[Double]) = this(numRows, numCols, values, false) @@ -276,6 +278,9 @@ class DenseMatrix( private[mllib] def apply(i: Int): Double = values(i) + /** + * @since 1.3.0 + */ override def apply(i: Int, j: Int): Double = values(index(i, j)) private[mllib] def index(i: Int, j: Int): Int = { @@ -286,6 +291,9 @@ class DenseMatrix( values(index(i, j)) = v } + /** + * @since 1.4.0 + */ override def copy: DenseMatrix = new DenseMatrix(numRows, numCols, values.clone()) private[spark] def map(f: Double = Double) = new DenseMatrix(numRows, numCols, values.map(f), @@ -301,6 +309,9 @@ class DenseMatrix( this } + /** + * @since 1.3.0 + */ override def transpose: DenseMatrix = new DenseMatrix(numCols, numRows, values, !isTransposed) private[spark] override def foreachActive(f: (Int, Int, Double) = Unit): Unit = { @@ -331,13 +342,20 @@ class DenseMatrix( } } + /** + * @since 1.5.0 + */ override def numNonzeros: Int = values.count(_ != 0) + /** + * @since 1.5.0 + */ override def numActives: Int = values.length /** * Generate a `SparseMatrix` from the given `DenseMatrix`. The new matrix will have isTransposed * set to false. + * @since 1.3.0 */ def toSparse: SparseMatrix = { val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble @@ -365,6 +383,7 @@ class DenseMatrix( /** * Factory methods for [[org.apache.spark.mllib.linalg.DenseMatrix]]. + * @since 1.3.0 */ object DenseMatrix { @@ -373,6 +392,7 @@ object DenseMatrix { * @param numRows number of rows of the matrix * @param numCols number of columns of the matrix * @return `DenseMatrix` with size `numRows` x `numCols` and values of zeros + * @since 1.3.0 */ def zeros(numRows: Int, numCols: Int): DenseMatrix = { require(numRows.toLong * numCols = Int.MaxValue, @@ -385,6 +405,7 @@ object DenseMatrix { * @param numRows number of rows of the matrix * @param numCols number of columns of the matrix * @return `DenseMatrix` with size `numRows` x `numCols` and values of ones + * @since 1.3.0 */ def ones
spark git commit: [SPARK-9898] [MLLIB] Prefix Span user guide
Repository: spark Updated Branches: refs/heads/branch-1.5 f5ed9ede9 - 18b3d11f7 [SPARK-9898] [MLLIB] Prefix Span user guide Adds user guide for `PrefixSpan`, including Scala and Java example code. mengxr zhangjiajin Author: Feynman Liang fli...@databricks.com Closes #8253 from feynmanliang/SPARK-9898. (cherry picked from commit 0b6b01761370629ce387c143a25d41f3a334ff28) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/18b3d11f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/18b3d11f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/18b3d11f Branch: refs/heads/branch-1.5 Commit: 18b3d11f787c48b429ffdef0075d398d7a0ab1a1 Parents: f5ed9ed Author: Feynman Liang fli...@databricks.com Authored: Mon Aug 17 17:53:24 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 17:53:31 2015 -0700 -- docs/mllib-frequent-pattern-mining.md | 96 ++ docs/mllib-guide.md | 1 + 2 files changed, 97 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/18b3d11f/docs/mllib-frequent-pattern-mining.md -- diff --git a/docs/mllib-frequent-pattern-mining.md b/docs/mllib-frequent-pattern-mining.md index bcc066a..8ea4389 100644 --- a/docs/mllib-frequent-pattern-mining.md +++ b/docs/mllib-frequent-pattern-mining.md @@ -96,3 +96,99 @@ for (FPGrowth.FreqItemsetString itemset: model.freqItemsets().toJavaRDD().coll /div /div + +## PrefixSpan + +PrefixSpan is a sequential pattern mining algorithm described in +[Pei et al., Mining Sequential Patterns by Pattern-Growth: The +PrefixSpan Approach](http://dx.doi.org/10.1109%2FTKDE.2004.77). We refer +the reader to the referenced paper for formalizing the sequential +pattern mining problem. + +MLlib's PrefixSpan implementation takes the following parameters: + +* `minSupport`: the minimum support required to be considered a frequent + sequential pattern. +* `maxPatternLength`: the maximum length of a frequent sequential + pattern. Any frequent pattern exceeding this length will not be + included in the results. +* `maxLocalProjDBSize`: the maximum number of items allowed in a + prefix-projected database before local iterative processing of the + projected databse begins. This parameter should be tuned with respect + to the size of your executors. + +**Examples** + +The following example illustrates PrefixSpan running on the sequences +(using same notation as Pei et al): + +~~~ + (12)3 + 1(32)(12) + (12)5 + 6 +~~~ + +div class=codetabs +div data-lang=scala markdown=1 + +[`PrefixSpan`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpan) implements the +PrefixSpan algorithm. +Calling `PrefixSpan.run` returns a +[`PrefixSpanModel`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpanModel) +that stores the frequent sequences with their frequencies. + +{% highlight scala %} +import org.apache.spark.mllib.fpm.PrefixSpan + +val sequences = sc.parallelize(Seq( +Array(Array(1, 2), Array(3)), +Array(Array(1), Array(3, 2), Array(1, 2)), +Array(Array(1, 2), Array(5)), +Array(Array(6)) + ), 2).cache() +val prefixSpan = new PrefixSpan() + .setMinSupport(0.5) + .setMaxPatternLength(5) +val model = prefixSpan.run(sequences) +model.freqSequences.collect().foreach { freqSequence = +println( + freqSequence.sequence.map(_.mkString([, , , ])).mkString([, , , ]) + , + freqSequence.freq) +} +{% endhighlight %} + +/div + +div data-lang=java markdown=1 + +[`PrefixSpan`](api/java/org/apache/spark/mllib/fpm/PrefixSpan.html) implements the +PrefixSpan algorithm. +Calling `PrefixSpan.run` returns a +[`PrefixSpanModel`](api/java/org/apache/spark/mllib/fpm/PrefixSpanModel.html) +that stores the frequent sequences with their frequencies. + +{% highlight java %} +import java.util.Arrays; +import java.util.List; + +import org.apache.spark.mllib.fpm.PrefixSpan; +import org.apache.spark.mllib.fpm.PrefixSpanModel; + +JavaRDDListListInteger sequences = sc.parallelize(Arrays.asList( + Arrays.asList(Arrays.asList(1, 2), Arrays.asList(3)), + Arrays.asList(Arrays.asList(1), Arrays.asList(3, 2), Arrays.asList(1, 2)), + Arrays.asList(Arrays.asList(1, 2), Arrays.asList(5)), + Arrays.asList(Arrays.asList(6)) +), 2); +PrefixSpan prefixSpan = new PrefixSpan() + .setMinSupport(0.5) + .setMaxPatternLength(5); +PrefixSpanModelInteger model = prefixSpan.run(sequences); +for (PrefixSpan.FreqSequenceInteger freqSeq: model.freqSequences().toJavaRDD().collect()) { + System.out.println(freqSeq.javaSequence() + , + freqSeq.freq()); +} +{% endhighlight %} + +/div +/div + http://git-wip-us.apache.org/repos/asf/spark/blob/18b3d11f/docs/mllib-guide.md
spark git commit: [SPARK-9898] [MLLIB] Prefix Span user guide
Repository: spark Updated Branches: refs/heads/master 18523c130 - 0b6b01761 [SPARK-9898] [MLLIB] Prefix Span user guide Adds user guide for `PrefixSpan`, including Scala and Java example code. mengxr zhangjiajin Author: Feynman Liang fli...@databricks.com Closes #8253 from feynmanliang/SPARK-9898. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0b6b0176 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0b6b0176 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0b6b0176 Branch: refs/heads/master Commit: 0b6b01761370629ce387c143a25d41f3a334ff28 Parents: 18523c1 Author: Feynman Liang fli...@databricks.com Authored: Mon Aug 17 17:53:24 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 17:53:24 2015 -0700 -- docs/mllib-frequent-pattern-mining.md | 96 ++ docs/mllib-guide.md | 1 + 2 files changed, 97 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0b6b0176/docs/mllib-frequent-pattern-mining.md -- diff --git a/docs/mllib-frequent-pattern-mining.md b/docs/mllib-frequent-pattern-mining.md index bcc066a..8ea4389 100644 --- a/docs/mllib-frequent-pattern-mining.md +++ b/docs/mllib-frequent-pattern-mining.md @@ -96,3 +96,99 @@ for (FPGrowth.FreqItemsetString itemset: model.freqItemsets().toJavaRDD().coll /div /div + +## PrefixSpan + +PrefixSpan is a sequential pattern mining algorithm described in +[Pei et al., Mining Sequential Patterns by Pattern-Growth: The +PrefixSpan Approach](http://dx.doi.org/10.1109%2FTKDE.2004.77). We refer +the reader to the referenced paper for formalizing the sequential +pattern mining problem. + +MLlib's PrefixSpan implementation takes the following parameters: + +* `minSupport`: the minimum support required to be considered a frequent + sequential pattern. +* `maxPatternLength`: the maximum length of a frequent sequential + pattern. Any frequent pattern exceeding this length will not be + included in the results. +* `maxLocalProjDBSize`: the maximum number of items allowed in a + prefix-projected database before local iterative processing of the + projected databse begins. This parameter should be tuned with respect + to the size of your executors. + +**Examples** + +The following example illustrates PrefixSpan running on the sequences +(using same notation as Pei et al): + +~~~ + (12)3 + 1(32)(12) + (12)5 + 6 +~~~ + +div class=codetabs +div data-lang=scala markdown=1 + +[`PrefixSpan`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpan) implements the +PrefixSpan algorithm. +Calling `PrefixSpan.run` returns a +[`PrefixSpanModel`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpanModel) +that stores the frequent sequences with their frequencies. + +{% highlight scala %} +import org.apache.spark.mllib.fpm.PrefixSpan + +val sequences = sc.parallelize(Seq( +Array(Array(1, 2), Array(3)), +Array(Array(1), Array(3, 2), Array(1, 2)), +Array(Array(1, 2), Array(5)), +Array(Array(6)) + ), 2).cache() +val prefixSpan = new PrefixSpan() + .setMinSupport(0.5) + .setMaxPatternLength(5) +val model = prefixSpan.run(sequences) +model.freqSequences.collect().foreach { freqSequence = +println( + freqSequence.sequence.map(_.mkString([, , , ])).mkString([, , , ]) + , + freqSequence.freq) +} +{% endhighlight %} + +/div + +div data-lang=java markdown=1 + +[`PrefixSpan`](api/java/org/apache/spark/mllib/fpm/PrefixSpan.html) implements the +PrefixSpan algorithm. +Calling `PrefixSpan.run` returns a +[`PrefixSpanModel`](api/java/org/apache/spark/mllib/fpm/PrefixSpanModel.html) +that stores the frequent sequences with their frequencies. + +{% highlight java %} +import java.util.Arrays; +import java.util.List; + +import org.apache.spark.mllib.fpm.PrefixSpan; +import org.apache.spark.mllib.fpm.PrefixSpanModel; + +JavaRDDListListInteger sequences = sc.parallelize(Arrays.asList( + Arrays.asList(Arrays.asList(1, 2), Arrays.asList(3)), + Arrays.asList(Arrays.asList(1), Arrays.asList(3, 2), Arrays.asList(1, 2)), + Arrays.asList(Arrays.asList(1, 2), Arrays.asList(5)), + Arrays.asList(Arrays.asList(6)) +), 2); +PrefixSpan prefixSpan = new PrefixSpan() + .setMinSupport(0.5) + .setMaxPatternLength(5); +PrefixSpanModelInteger model = prefixSpan.run(sequences); +for (PrefixSpan.FreqSequenceInteger freqSeq: model.freqSequences().toJavaRDD().collect()) { + System.out.println(freqSeq.javaSequence() + , + freqSeq.freq()); +} +{% endhighlight %} + +/div +/div + http://git-wip-us.apache.org/repos/asf/spark/blob/0b6b0176/docs/mllib-guide.md -- diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index e8000ff
spark git commit: [SPARK-9959] [MLLIB] Association Rules Java Compatibility
Repository: spark Updated Branches: refs/heads/master 3ff81ad2d - f7efda397 [SPARK-9959] [MLLIB] Association Rules Java Compatibility mengxr Author: Feynman Liang fli...@databricks.com Closes #8206 from feynmanliang/SPARK-9959-arules-java. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f7efda39 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f7efda39 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f7efda39 Branch: refs/heads/master Commit: f7efda3975d46a8ce4fd720b3730127ea482560b Parents: 3ff81ad Author: Feynman Liang fli...@databricks.com Authored: Mon Aug 17 09:58:34 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 17 09:58:34 2015 -0700 -- .../spark/mllib/fpm/AssociationRules.scala | 30 ++-- 1 file changed, 28 insertions(+), 2 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/f7efda39/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala index 72d0ea0..7f4de77 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala @@ -16,6 +16,7 @@ */ package org.apache.spark.mllib.fpm +import scala.collection.JavaConverters._ import scala.reflect.ClassTag import org.apache.spark.Logging @@ -95,8 +96,10 @@ object AssociationRules { * :: Experimental :: * * An association rule between sets of items. - * @param antecedent hypotheses of the rule - * @param consequent conclusion of the rule + * @param antecedent hypotheses of the rule. Java users should call [[Rule#javaAntecedent]] + * instead. + * @param consequent conclusion of the rule. Java users should call [[Rule#javaConsequent]] + * instead. * @tparam Item item type * * @since 1.5.0 @@ -108,6 +111,11 @@ object AssociationRules { freqUnion: Double, freqAntecedent: Double) extends Serializable { +/** + * Returns the confidence of the rule. + * + * @since 1.5.0 + */ def confidence: Double = freqUnion.toDouble / freqAntecedent require(antecedent.toSet.intersect(consequent.toSet).isEmpty, { @@ -115,5 +123,23 @@ object AssociationRules { sA valid association rule must have disjoint antecedent and + sconsequent but ${sharedItems} is present in both. }) + +/** + * Returns antecedent in a Java List. + * + * @since 1.5.0 + */ +def javaAntecedent: java.util.List[Item] = { + antecedent.toList.asJava +} + +/** + * Returns consequent in a Java List. + * + * @since 1.5.0 + */ +def javaConsequent: java.util.List[Item] = { + consequent.toList.asJava +} } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9828] [PYSPARK] Mutable values should not be default arguments
Repository: spark Updated Branches: refs/heads/master ece00566e - ffa05c84f [SPARK-9828] [PYSPARK] Mutable values should not be default arguments Author: MechCoder manojkumarsivaraj...@gmail.com Closes #8110 from MechCoder/spark-9828. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ffa05c84 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ffa05c84 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ffa05c84 Branch: refs/heads/master Commit: ffa05c84fe75663fc33f3d954d1cb1e084ab3280 Parents: ece0056 Author: MechCoder manojkumarsivaraj...@gmail.com Authored: Fri Aug 14 12:46:05 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Fri Aug 14 12:46:05 2015 -0700 -- python/pyspark/ml/evaluation.py | 4 +++- python/pyspark/ml/param/__init__.py | 26 +- python/pyspark/ml/pipeline.py | 4 ++-- python/pyspark/ml/tuning.py | 8 ++-- python/pyspark/rdd.py | 5 - python/pyspark/sql/readwriter.py| 8 ++-- python/pyspark/statcounter.py | 4 +++- python/pyspark/streaming/kafka.py | 12 +--- 8 files changed, 50 insertions(+), 21 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ffa05c84/python/pyspark/ml/evaluation.py -- diff --git a/python/pyspark/ml/evaluation.py b/python/pyspark/ml/evaluation.py index 2734092..e23ce05 100644 --- a/python/pyspark/ml/evaluation.py +++ b/python/pyspark/ml/evaluation.py @@ -46,7 +46,7 @@ class Evaluator(Params): raise NotImplementedError() -def evaluate(self, dataset, params={}): +def evaluate(self, dataset, params=None): Evaluates the output with optional parameters. @@ -56,6 +56,8 @@ class Evaluator(Params): params :return: metric +if params is None: +params = dict() if isinstance(params, dict): if params: return self.copy(params)._evaluate(dataset) http://git-wip-us.apache.org/repos/asf/spark/blob/ffa05c84/python/pyspark/ml/param/__init__.py -- diff --git a/python/pyspark/ml/param/__init__.py b/python/pyspark/ml/param/__init__.py index 7845536..eeeac49 100644 --- a/python/pyspark/ml/param/__init__.py +++ b/python/pyspark/ml/param/__init__.py @@ -60,14 +60,16 @@ class Params(Identifiable): __metaclass__ = ABCMeta -#: internal param map for user-supplied values param map -_paramMap = {} +def __init__(self): +super(Params, self).__init__() +#: internal param map for user-supplied values param map +self._paramMap = {} -#: internal param map for default values -_defaultParamMap = {} +#: internal param map for default values +self._defaultParamMap = {} -#: value returned by :py:func:`params` -_params = None +#: value returned by :py:func:`params` +self._params = None @property def params(self): @@ -155,7 +157,7 @@ class Params(Identifiable): else: return self._defaultParamMap[param] -def extractParamMap(self, extra={}): +def extractParamMap(self, extra=None): Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into @@ -165,12 +167,14 @@ class Params(Identifiable): :param extra: extra param values :return: merged param map +if extra is None: +extra = dict() paramMap = self._defaultParamMap.copy() paramMap.update(self._paramMap) paramMap.update(extra) return paramMap -def copy(self, extra={}): +def copy(self, extra=None): Creates a copy of this instance with the same uid and some extra params. The default implementation creates a @@ -181,6 +185,8 @@ class Params(Identifiable): :param extra: Extra parameters to copy to the new instance :return: Copy of this instance +if extra is None: +extra = dict() that = copy.copy(self) that._paramMap = self.extractParamMap(extra) return that @@ -233,7 +239,7 @@ class Params(Identifiable): self._defaultParamMap[getattr(self, param)] = value return self -def _copyValues(self, to, extra={}): +def _copyValues(self, to, extra=None): Copies param values from this instance to another instance for params shared by them. @@ -241,6 +247,8 @@ class Params(Identifiable): :param extra: extra params to be copied :return: the target
spark git commit: [SPARK-9828] [PYSPARK] Mutable values should not be default arguments
Repository: spark Updated Branches: refs/heads/branch-1.4 db71ea482 - 969e8b31b [SPARK-9828] [PYSPARK] Mutable values should not be default arguments Author: MechCoder manojkumarsivaraj...@gmail.com Closes #8110 from MechCoder/spark-9828. (cherry picked from commit ffa05c84fe75663fc33f3d954d1cb1e084ab3280) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/969e8b31 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/969e8b31 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/969e8b31 Branch: refs/heads/branch-1.4 Commit: 969e8b31b48fe1b26fcc667b46ba97a538b1e382 Parents: db71ea4 Author: MechCoder manojkumarsivaraj...@gmail.com Authored: Fri Aug 14 12:46:05 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Fri Aug 14 12:50:46 2015 -0700 -- python/pyspark/ml/evaluation.py | 4 +++- python/pyspark/ml/param/__init__.py | 26 +- python/pyspark/ml/pipeline.py | 4 ++-- python/pyspark/ml/tuning.py | 8 ++-- python/pyspark/rdd.py | 5 - python/pyspark/sql/readwriter.py| 8 ++-- python/pyspark/statcounter.py | 4 +++- python/pyspark/streaming/kafka.py | 12 +--- 8 files changed, 50 insertions(+), 21 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/969e8b31/python/pyspark/ml/evaluation.py -- diff --git a/python/pyspark/ml/evaluation.py b/python/pyspark/ml/evaluation.py index 595593a..7af447c 100644 --- a/python/pyspark/ml/evaluation.py +++ b/python/pyspark/ml/evaluation.py @@ -45,7 +45,7 @@ class Evaluator(Params): raise NotImplementedError() -def evaluate(self, dataset, params={}): +def evaluate(self, dataset, params=None): Evaluates the output with optional parameters. @@ -55,6 +55,8 @@ class Evaluator(Params): params :return: metric +if params is None: +params = dict() if isinstance(params, dict): if params: return self.copy(params)._evaluate(dataset) http://git-wip-us.apache.org/repos/asf/spark/blob/969e8b31/python/pyspark/ml/param/__init__.py -- diff --git a/python/pyspark/ml/param/__init__.py b/python/pyspark/ml/param/__init__.py index 7845536..eeeac49 100644 --- a/python/pyspark/ml/param/__init__.py +++ b/python/pyspark/ml/param/__init__.py @@ -60,14 +60,16 @@ class Params(Identifiable): __metaclass__ = ABCMeta -#: internal param map for user-supplied values param map -_paramMap = {} +def __init__(self): +super(Params, self).__init__() +#: internal param map for user-supplied values param map +self._paramMap = {} -#: internal param map for default values -_defaultParamMap = {} +#: internal param map for default values +self._defaultParamMap = {} -#: value returned by :py:func:`params` -_params = None +#: value returned by :py:func:`params` +self._params = None @property def params(self): @@ -155,7 +157,7 @@ class Params(Identifiable): else: return self._defaultParamMap[param] -def extractParamMap(self, extra={}): +def extractParamMap(self, extra=None): Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into @@ -165,12 +167,14 @@ class Params(Identifiable): :param extra: extra param values :return: merged param map +if extra is None: +extra = dict() paramMap = self._defaultParamMap.copy() paramMap.update(self._paramMap) paramMap.update(extra) return paramMap -def copy(self, extra={}): +def copy(self, extra=None): Creates a copy of this instance with the same uid and some extra params. The default implementation creates a @@ -181,6 +185,8 @@ class Params(Identifiable): :param extra: Extra parameters to copy to the new instance :return: Copy of this instance +if extra is None: +extra = dict() that = copy.copy(self) that._paramMap = self.extractParamMap(extra) return that @@ -233,7 +239,7 @@ class Params(Identifiable): self._defaultParamMap[getattr(self, param)] = value return self -def _copyValues(self, to, extra={}): +def _copyValues(self, to, extra=None): Copies param values from this instance to another instance for params shared by them
spark git commit: [SPARK-9981] [ML] Made labels public for StringIndexerModel
Repository: spark Updated Branches: refs/heads/master 11ed2b180 - 2a6590e51 [SPARK-9981] [ML] Made labels public for StringIndexerModel Also added unit test for integration between StringIndexerModel and IndexToString CC: holdenk We realized we should have left in your unit test (to catch the issue with removing the inverse() method), so this adds it back. mengxr Author: Joseph K. Bradley jos...@databricks.com Closes #8211 from jkbradley/stridx-labels. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2a6590e5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2a6590e5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2a6590e5 Branch: refs/heads/master Commit: 2a6590e510aba3bfc6603d280023128b3f5ac702 Parents: 11ed2b1 Author: Joseph K. Bradley jos...@databricks.com Authored: Fri Aug 14 14:05:03 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Fri Aug 14 14:05:03 2015 -0700 -- .../apache/spark/ml/feature/StringIndexer.scala | 5 - .../spark/ml/feature/StringIndexerSuite.scala | 18 ++ 2 files changed, 22 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2a6590e5/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala index 6347578..24250e4 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala @@ -97,14 +97,17 @@ class StringIndexer(override val uid: String) extends Estimator[StringIndexerMod /** * :: Experimental :: * Model fitted by [[StringIndexer]]. + * * NOTE: During transformation, if the input column does not exist, * [[StringIndexerModel.transform]] would return the input dataset unmodified. * This is a temporary fix for the case when target labels do not exist during prediction. + * + * @param labels Ordered list of labels, corresponding to indices to be assigned */ @Experimental class StringIndexerModel ( override val uid: String, -labels: Array[String]) extends Model[StringIndexerModel] with StringIndexerBase { +val labels: Array[String]) extends Model[StringIndexerModel] with StringIndexerBase { def this(labels: Array[String]) = this(Identifiable.randomUID(strIdx), labels) http://git-wip-us.apache.org/repos/asf/spark/blob/2a6590e5/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala index 0b4c8ba..05e05bd 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala @@ -147,4 +147,22 @@ class StringIndexerSuite extends SparkFunSuite with MLlibTestSparkContext { assert(actual === expected) } } + + test(StringIndexer, IndexToString are inverses) { +val data = sc.parallelize(Seq((0, a), (1, b), (2, c), (3, a), (4, a), (5, c)), 2) +val df = sqlContext.createDataFrame(data).toDF(id, label) +val indexer = new StringIndexer() + .setInputCol(label) + .setOutputCol(labelIndex) + .fit(df) +val transformed = indexer.transform(df) +val idx2str = new IndexToString() + .setInputCol(labelIndex) + .setOutputCol(sameLabel) + .setLabels(indexer.labels) +idx2str.transform(transformed).select(label, sameLabel).collect().foreach { + case Row(a: String, b: String) = +assert(a === b) +} + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9981] [ML] Made labels public for StringIndexerModel
Repository: spark Updated Branches: refs/heads/branch-1.5 59cdcc079 - 0f4ccdc4c [SPARK-9981] [ML] Made labels public for StringIndexerModel Also added unit test for integration between StringIndexerModel and IndexToString CC: holdenk We realized we should have left in your unit test (to catch the issue with removing the inverse() method), so this adds it back. mengxr Author: Joseph K. Bradley jos...@databricks.com Closes #8211 from jkbradley/stridx-labels. (cherry picked from commit 2a6590e510aba3bfc6603d280023128b3f5ac702) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0f4ccdc4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0f4ccdc4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0f4ccdc4 Branch: refs/heads/branch-1.5 Commit: 0f4ccdc4cfa02ad78f2c4949ddb3822d07d65104 Parents: 59cdcc0 Author: Joseph K. Bradley jos...@databricks.com Authored: Fri Aug 14 14:05:03 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Fri Aug 14 14:11:26 2015 -0700 -- .../apache/spark/ml/feature/StringIndexer.scala | 5 - .../spark/ml/feature/StringIndexerSuite.scala | 18 ++ 2 files changed, 22 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0f4ccdc4/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala index f5dfba1..76f017d 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala @@ -93,14 +93,17 @@ class StringIndexer(override val uid: String) extends Estimator[StringIndexerMod /** * :: Experimental :: * Model fitted by [[StringIndexer]]. + * * NOTE: During transformation, if the input column does not exist, * [[StringIndexerModel.transform]] would return the input dataset unmodified. * This is a temporary fix for the case when target labels do not exist during prediction. + * + * @param labels Ordered list of labels, corresponding to indices to be assigned */ @Experimental class StringIndexerModel ( override val uid: String, -labels: Array[String]) extends Model[StringIndexerModel] with StringIndexerBase { +val labels: Array[String]) extends Model[StringIndexerModel] with StringIndexerBase { def this(labels: Array[String]) = this(Identifiable.randomUID(strIdx), labels) http://git-wip-us.apache.org/repos/asf/spark/blob/0f4ccdc4/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala index d960861..5fe66a3 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala @@ -116,4 +116,22 @@ class StringIndexerSuite extends SparkFunSuite with MLlibTestSparkContext { assert(actual === expected) } } + + test(StringIndexer, IndexToString are inverses) { +val data = sc.parallelize(Seq((0, a), (1, b), (2, c), (3, a), (4, a), (5, c)), 2) +val df = sqlContext.createDataFrame(data).toDF(id, label) +val indexer = new StringIndexer() + .setInputCol(label) + .setOutputCol(labelIndex) + .fit(df) +val transformed = indexer.transform(df) +val idx2str = new IndexToString() + .setInputCol(labelIndex) + .setOutputCol(sameLabel) + .setLabels(indexer.labels) +idx2str.transform(transformed).select(label, sameLabel).collect().foreach { + case Row(a: String, b: String) = +assert(a === b) +} + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9918] [MLLIB] remove runs from k-means and rename epsilon to tol
Repository: spark Updated Branches: refs/heads/branch-1.5 d213aa77c - ae18342a5 [SPARK-9918] [MLLIB] remove runs from k-means and rename epsilon to tol This requires some discussion. I'm not sure whether `runs` is a useful parameter. It certainly complicates the implementation. We might want to optimize the k-means implementation with block matrix operations. In this case, having `runs` may not be worth the trade-off. Also it increases the communication cost in a single job, which might cause other issues. This PR also renames `epsilon` to `tol` to have consistent naming among algorithms. The Python constructor is updated to include all parameters. jkbradley yu-iskw Author: Xiangrui Meng m...@databricks.com Closes #8148 from mengxr/SPARK-9918 and squashes the following commits: 149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol 3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API (cherry picked from commit 68f99571492f67596b3656e9f076deeb96616f4a) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ae18342a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ae18342a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ae18342a Branch: refs/heads/branch-1.5 Commit: ae18342a5d54a4f13d88579aac45ca4544268112 Parents: d213aa7 Author: Xiangrui Meng m...@databricks.com Authored: Wed Aug 12 23:04:59 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 23:05:06 2015 -0700 -- .../org/apache/spark/ml/clustering/KMeans.scala | 51 .../spark/ml/clustering/KMeansSuite.scala | 12 +--- python/pyspark/ml/clustering.py | 63 3 files changed, 26 insertions(+), 100 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ae18342a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala index dc192ad..47a18cd 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala @@ -18,8 +18,8 @@ package org.apache.spark.ml.clustering import org.apache.spark.annotation.Experimental -import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, ParamMap} -import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, HasPredictionCol, HasSeed} +import org.apache.spark.ml.param.{Param, Params, IntParam, ParamMap} +import org.apache.spark.ml.param.shared._ import org.apache.spark.ml.util.{Identifiable, SchemaUtils} import org.apache.spark.ml.{Estimator, Model} import org.apache.spark.mllib.clustering.{KMeans = MLlibKMeans, KMeansModel = MLlibKMeansModel} @@ -27,14 +27,13 @@ import org.apache.spark.mllib.linalg.{Vector, VectorUDT} import org.apache.spark.sql.functions.{col, udf} import org.apache.spark.sql.types.{IntegerType, StructType} import org.apache.spark.sql.{DataFrame, Row} -import org.apache.spark.util.Utils /** * Common params for KMeans and KMeansModel */ -private[clustering] trait KMeansParams -extends Params with HasMaxIter with HasFeaturesCol with HasSeed with HasPredictionCol { +private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFeaturesCol + with HasSeed with HasPredictionCol with HasTol { /** * Set the number of clusters to create (k). Must be 1. Default: 2. @@ -46,31 +45,6 @@ private[clustering] trait KMeansParams def getK: Int = $(k) /** - * Param the number of runs of the algorithm to execute in parallel. We initialize the algorithm - * this many times with random starting conditions (configured by the initialization mode), then - * return the best clustering found over any run. Must be = 1. Default: 1. - * @group param - */ - final val runs = new IntParam(this, runs, -number of runs of the algorithm to execute in parallel, (value: Int) = value = 1) - - /** @group getParam */ - def getRuns: Int = $(runs) - - /** - * Param the distance threshold within which we've consider centers to have converged. - * If all centers move less than this Euclidean distance, we stop iterating one run. - * Must be = 0.0. Default: 1e-4 - * @group param - */ - final val epsilon = new DoubleParam(this, epsilon, -distance threshold within which we've consider centers to have converge, -(value: Double) = value = 0.0) - - /** @group getParam */ - def getEpsilon: Double = $(epsilon) - - /** * Param
spark git commit: [SPARK-9918] [MLLIB] remove runs from k-means and rename epsilon to tol
Repository: spark Updated Branches: refs/heads/master d0b18919d - 68f995714 [SPARK-9918] [MLLIB] remove runs from k-means and rename epsilon to tol This requires some discussion. I'm not sure whether `runs` is a useful parameter. It certainly complicates the implementation. We might want to optimize the k-means implementation with block matrix operations. In this case, having `runs` may not be worth the trade-off. Also it increases the communication cost in a single job, which might cause other issues. This PR also renames `epsilon` to `tol` to have consistent naming among algorithms. The Python constructor is updated to include all parameters. jkbradley yu-iskw Author: Xiangrui Meng m...@databricks.com Closes #8148 from mengxr/SPARK-9918 and squashes the following commits: 149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol 3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/68f99571 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/68f99571 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/68f99571 Branch: refs/heads/master Commit: 68f99571492f67596b3656e9f076deeb96616f4a Parents: d0b1891 Author: Xiangrui Meng m...@databricks.com Authored: Wed Aug 12 23:04:59 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 23:04:59 2015 -0700 -- .../org/apache/spark/ml/clustering/KMeans.scala | 51 .../spark/ml/clustering/KMeansSuite.scala | 12 +--- python/pyspark/ml/clustering.py | 63 3 files changed, 26 insertions(+), 100 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/68f99571/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala index dc192ad..47a18cd 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala @@ -18,8 +18,8 @@ package org.apache.spark.ml.clustering import org.apache.spark.annotation.Experimental -import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, ParamMap} -import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, HasPredictionCol, HasSeed} +import org.apache.spark.ml.param.{Param, Params, IntParam, ParamMap} +import org.apache.spark.ml.param.shared._ import org.apache.spark.ml.util.{Identifiable, SchemaUtils} import org.apache.spark.ml.{Estimator, Model} import org.apache.spark.mllib.clustering.{KMeans = MLlibKMeans, KMeansModel = MLlibKMeansModel} @@ -27,14 +27,13 @@ import org.apache.spark.mllib.linalg.{Vector, VectorUDT} import org.apache.spark.sql.functions.{col, udf} import org.apache.spark.sql.types.{IntegerType, StructType} import org.apache.spark.sql.{DataFrame, Row} -import org.apache.spark.util.Utils /** * Common params for KMeans and KMeansModel */ -private[clustering] trait KMeansParams -extends Params with HasMaxIter with HasFeaturesCol with HasSeed with HasPredictionCol { +private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFeaturesCol + with HasSeed with HasPredictionCol with HasTol { /** * Set the number of clusters to create (k). Must be 1. Default: 2. @@ -46,31 +45,6 @@ private[clustering] trait KMeansParams def getK: Int = $(k) /** - * Param the number of runs of the algorithm to execute in parallel. We initialize the algorithm - * this many times with random starting conditions (configured by the initialization mode), then - * return the best clustering found over any run. Must be = 1. Default: 1. - * @group param - */ - final val runs = new IntParam(this, runs, -number of runs of the algorithm to execute in parallel, (value: Int) = value = 1) - - /** @group getParam */ - def getRuns: Int = $(runs) - - /** - * Param the distance threshold within which we've consider centers to have converged. - * If all centers move less than this Euclidean distance, we stop iterating one run. - * Must be = 0.0. Default: 1e-4 - * @group param - */ - final val epsilon = new DoubleParam(this, epsilon, -distance threshold within which we've consider centers to have converge, -(value: Double) = value = 0.0) - - /** @group getParam */ - def getEpsilon: Double = $(epsilon) - - /** * Param for the initialization algorithm. This can be either random to choose random points as * initial cluster centers, or k-means|| to use
spark git commit: [MINOR] [ML] change MultilayerPerceptronClassifierModel to MultilayerPerceptronClassificationModel
Repository: spark Updated Branches: refs/heads/branch-1.5 49085b56c - 2b1353249 [MINOR] [ML] change MultilayerPerceptronClassifierModel to MultilayerPerceptronClassificationModel To follow the naming rule of ML, change `MultilayerPerceptronClassifierModel` to `MultilayerPerceptronClassificationModel` like `DecisionTreeClassificationModel`, `GBTClassificationModel` and so on. Author: Yanbo Liang yblia...@gmail.com Closes #8164 from yanboliang/mlp-name. (cherry picked from commit 4b70798c96b0a784b85fda461426ec60f609be12) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2b135324 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2b135324 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2b135324 Branch: refs/heads/branch-1.5 Commit: 2b13532497b23eb6e02e4b0ef7503e73242f932d Parents: 49085b5 Author: Yanbo Liang yblia...@gmail.com Authored: Thu Aug 13 09:31:14 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Aug 13 09:31:24 2015 -0700 -- .../MultilayerPerceptronClassifier.scala| 16 1 file changed, 8 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2b135324/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala index 8cd2103..c154561 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala @@ -131,7 +131,7 @@ private object LabelConverter { */ @Experimental class MultilayerPerceptronClassifier(override val uid: String) - extends Predictor[Vector, MultilayerPerceptronClassifier, MultilayerPerceptronClassifierModel] + extends Predictor[Vector, MultilayerPerceptronClassifier, MultilayerPerceptronClassificationModel] with MultilayerPerceptronParams { def this() = this(Identifiable.randomUID(mlpc)) @@ -146,7 +146,7 @@ class MultilayerPerceptronClassifier(override val uid: String) * @param dataset Training dataset * @return Fitted model */ - override protected def train(dataset: DataFrame): MultilayerPerceptronClassifierModel = { + override protected def train(dataset: DataFrame): MultilayerPerceptronClassificationModel = { val myLayers = $(layers) val labels = myLayers.last val lpData = extractLabeledPoints(dataset) @@ -156,13 +156,13 @@ class MultilayerPerceptronClassifier(override val uid: String) FeedForwardTrainer.LBFGSOptimizer.setConvergenceTol($(tol)).setNumIterations($(maxIter)) FeedForwardTrainer.setStackSize($(blockSize)) val mlpModel = FeedForwardTrainer.train(data) -new MultilayerPerceptronClassifierModel(uid, myLayers, mlpModel.weights()) +new MultilayerPerceptronClassificationModel(uid, myLayers, mlpModel.weights()) } } /** * :: Experimental :: - * Classifier model based on the Multilayer Perceptron. + * Classification model based on the Multilayer Perceptron. * Each layer has sigmoid activation function, output layer has softmax. * @param uid uid * @param layers array of layer sizes including input and output layers @@ -170,11 +170,11 @@ class MultilayerPerceptronClassifier(override val uid: String) * @return prediction model */ @Experimental -class MultilayerPerceptronClassifierModel private[ml] ( +class MultilayerPerceptronClassificationModel private[ml] ( override val uid: String, layers: Array[Int], weights: Vector) - extends PredictionModel[Vector, MultilayerPerceptronClassifierModel] + extends PredictionModel[Vector, MultilayerPerceptronClassificationModel] with Serializable { private val mlpModel = FeedForwardTopology.multiLayerPerceptron(layers, true).getInstance(weights) @@ -187,7 +187,7 @@ class MultilayerPerceptronClassifierModel private[ml] ( LabelConverter.decodeLabel(mlpModel.predict(features)) } - override def copy(extra: ParamMap): MultilayerPerceptronClassifierModel = { -copyValues(new MultilayerPerceptronClassifierModel(uid, layers, weights), extra) + override def copy(extra: ParamMap): MultilayerPerceptronClassificationModel = { +copyValues(new MultilayerPerceptronClassificationModel(uid, layers, weights), extra) } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR] [DOC] fix mllib pydoc warnings
Repository: spark Updated Branches: refs/heads/branch-1.5 2b1353249 - 883c7d35f [MINOR] [DOC] fix mllib pydoc warnings Switch to correct Sphinx syntax. MechCoder Author: Xiangrui Meng m...@databricks.com Closes #8169 from mengxr/mllib-pydoc-fix. (cherry picked from commit 65fec798ce52ca6b8b0fe14b78a16712778ad04c) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/883c7d35 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/883c7d35 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/883c7d35 Branch: refs/heads/branch-1.5 Commit: 883c7d35f978a7d8651aaf8e93bd0c9ba09a441d Parents: 2b13532 Author: Xiangrui Meng m...@databricks.com Authored: Thu Aug 13 10:16:40 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Aug 13 10:16:53 2015 -0700 -- python/pyspark/mllib/regression.py | 14 ++ python/pyspark/mllib/util.py | 1 + 2 files changed, 11 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/883c7d35/python/pyspark/mllib/regression.py -- diff --git a/python/pyspark/mllib/regression.py b/python/pyspark/mllib/regression.py index 5b7afc1..41946e3 100644 --- a/python/pyspark/mllib/regression.py +++ b/python/pyspark/mllib/regression.py @@ -207,8 +207,10 @@ class LinearRegressionWithSGD(object): Train a linear regression model using Stochastic Gradient Descent (SGD). This solves the least squares regression formulation -f(weights) = 1/n ||A weights-y||^2^ -(which is the mean squared error). + +f(weights) = 1/(2n) ||A weights - y||^2, + +which is the mean squared error. Here the data matrix has n rows, and the input RDD holds the set of rows of A, each with its corresponding right hand side label y. See also the documentation for the precise formulation. @@ -334,7 +336,9 @@ class LassoWithSGD(object): Stochastic Gradient Descent. This solves the l1-regularized least squares regression formulation -f(weights) = 1/2n ||A weights-y||^2^ + regParam ||weights||_1 + +f(weights) = 1/(2n) ||A weights - y||^2 + regParam ||weights||_1. + Here the data matrix has n rows, and the input RDD holds the set of rows of A, each with its corresponding right hand side label y. See also the documentation for the precise formulation. @@ -451,7 +455,9 @@ class RidgeRegressionWithSGD(object): Stochastic Gradient Descent. This solves the l2-regularized least squares regression formulation -f(weights) = 1/2n ||A weights-y||^2^ + regParam/2 ||weights||^2^ + +f(weights) = 1/(2n) ||A weights - y||^2 + regParam/2 ||weights||^2. + Here the data matrix has n rows, and the input RDD holds the set of rows of A, each with its corresponding right hand side label y. See also the documentation for the precise formulation. http://git-wip-us.apache.org/repos/asf/spark/blob/883c7d35/python/pyspark/mllib/util.py -- diff --git a/python/pyspark/mllib/util.py b/python/pyspark/mllib/util.py index 916de2d..10a1e4b 100644 --- a/python/pyspark/mllib/util.py +++ b/python/pyspark/mllib/util.py @@ -300,6 +300,7 @@ class LinearDataGenerator(object): :param: seed Random Seed :param: eps Used to scale the noise. If eps is set high, the amount of gaussian noise added is more. + Returns a list of LabeledPoints of length nPoints weights = [float(weight) for weight in weights] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR] [ML] change MultilayerPerceptronClassifierModel to MultilayerPerceptronClassificationModel
Repository: spark Updated Branches: refs/heads/master 7a539ef3b - 4b70798c9 [MINOR] [ML] change MultilayerPerceptronClassifierModel to MultilayerPerceptronClassificationModel To follow the naming rule of ML, change `MultilayerPerceptronClassifierModel` to `MultilayerPerceptronClassificationModel` like `DecisionTreeClassificationModel`, `GBTClassificationModel` and so on. Author: Yanbo Liang yblia...@gmail.com Closes #8164 from yanboliang/mlp-name. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4b70798c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4b70798c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4b70798c Branch: refs/heads/master Commit: 4b70798c96b0a784b85fda461426ec60f609be12 Parents: 7a539ef Author: Yanbo Liang yblia...@gmail.com Authored: Thu Aug 13 09:31:14 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Aug 13 09:31:14 2015 -0700 -- .../MultilayerPerceptronClassifier.scala| 16 1 file changed, 8 insertions(+), 8 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/4b70798c/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala index 8cd2103..c154561 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala @@ -131,7 +131,7 @@ private object LabelConverter { */ @Experimental class MultilayerPerceptronClassifier(override val uid: String) - extends Predictor[Vector, MultilayerPerceptronClassifier, MultilayerPerceptronClassifierModel] + extends Predictor[Vector, MultilayerPerceptronClassifier, MultilayerPerceptronClassificationModel] with MultilayerPerceptronParams { def this() = this(Identifiable.randomUID(mlpc)) @@ -146,7 +146,7 @@ class MultilayerPerceptronClassifier(override val uid: String) * @param dataset Training dataset * @return Fitted model */ - override protected def train(dataset: DataFrame): MultilayerPerceptronClassifierModel = { + override protected def train(dataset: DataFrame): MultilayerPerceptronClassificationModel = { val myLayers = $(layers) val labels = myLayers.last val lpData = extractLabeledPoints(dataset) @@ -156,13 +156,13 @@ class MultilayerPerceptronClassifier(override val uid: String) FeedForwardTrainer.LBFGSOptimizer.setConvergenceTol($(tol)).setNumIterations($(maxIter)) FeedForwardTrainer.setStackSize($(blockSize)) val mlpModel = FeedForwardTrainer.train(data) -new MultilayerPerceptronClassifierModel(uid, myLayers, mlpModel.weights()) +new MultilayerPerceptronClassificationModel(uid, myLayers, mlpModel.weights()) } } /** * :: Experimental :: - * Classifier model based on the Multilayer Perceptron. + * Classification model based on the Multilayer Perceptron. * Each layer has sigmoid activation function, output layer has softmax. * @param uid uid * @param layers array of layer sizes including input and output layers @@ -170,11 +170,11 @@ class MultilayerPerceptronClassifier(override val uid: String) * @return prediction model */ @Experimental -class MultilayerPerceptronClassifierModel private[ml] ( +class MultilayerPerceptronClassificationModel private[ml] ( override val uid: String, layers: Array[Int], weights: Vector) - extends PredictionModel[Vector, MultilayerPerceptronClassifierModel] + extends PredictionModel[Vector, MultilayerPerceptronClassificationModel] with Serializable { private val mlpModel = FeedForwardTopology.multiLayerPerceptron(layers, true).getInstance(weights) @@ -187,7 +187,7 @@ class MultilayerPerceptronClassifierModel private[ml] ( LabelConverter.decodeLabel(mlpModel.predict(features)) } - override def copy(extra: ParamMap): MultilayerPerceptronClassifierModel = { -copyValues(new MultilayerPerceptronClassifierModel(uid, layers, weights), extra) + override def copy(extra: ParamMap): MultilayerPerceptronClassificationModel = { +copyValues(new MultilayerPerceptronClassificationModel(uid, layers, weights), extra) } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR] [DOC] fix mllib pydoc warnings
Repository: spark Updated Branches: refs/heads/master 4b70798c9 - 65fec798c [MINOR] [DOC] fix mllib pydoc warnings Switch to correct Sphinx syntax. MechCoder Author: Xiangrui Meng m...@databricks.com Closes #8169 from mengxr/mllib-pydoc-fix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/65fec798 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/65fec798 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/65fec798 Branch: refs/heads/master Commit: 65fec798ce52ca6b8b0fe14b78a16712778ad04c Parents: 4b70798 Author: Xiangrui Meng m...@databricks.com Authored: Thu Aug 13 10:16:40 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Aug 13 10:16:40 2015 -0700 -- python/pyspark/mllib/regression.py | 14 ++ python/pyspark/mllib/util.py | 1 + 2 files changed, 11 insertions(+), 4 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/65fec798/python/pyspark/mllib/regression.py -- diff --git a/python/pyspark/mllib/regression.py b/python/pyspark/mllib/regression.py index 5b7afc1..41946e3 100644 --- a/python/pyspark/mllib/regression.py +++ b/python/pyspark/mllib/regression.py @@ -207,8 +207,10 @@ class LinearRegressionWithSGD(object): Train a linear regression model using Stochastic Gradient Descent (SGD). This solves the least squares regression formulation -f(weights) = 1/n ||A weights-y||^2^ -(which is the mean squared error). + +f(weights) = 1/(2n) ||A weights - y||^2, + +which is the mean squared error. Here the data matrix has n rows, and the input RDD holds the set of rows of A, each with its corresponding right hand side label y. See also the documentation for the precise formulation. @@ -334,7 +336,9 @@ class LassoWithSGD(object): Stochastic Gradient Descent. This solves the l1-regularized least squares regression formulation -f(weights) = 1/2n ||A weights-y||^2^ + regParam ||weights||_1 + +f(weights) = 1/(2n) ||A weights - y||^2 + regParam ||weights||_1. + Here the data matrix has n rows, and the input RDD holds the set of rows of A, each with its corresponding right hand side label y. See also the documentation for the precise formulation. @@ -451,7 +455,9 @@ class RidgeRegressionWithSGD(object): Stochastic Gradient Descent. This solves the l2-regularized least squares regression formulation -f(weights) = 1/2n ||A weights-y||^2^ + regParam/2 ||weights||^2^ + +f(weights) = 1/(2n) ||A weights - y||^2 + regParam/2 ||weights||^2. + Here the data matrix has n rows, and the input RDD holds the set of rows of A, each with its corresponding right hand side label y. See also the documentation for the precise formulation. http://git-wip-us.apache.org/repos/asf/spark/blob/65fec798/python/pyspark/mllib/util.py -- diff --git a/python/pyspark/mllib/util.py b/python/pyspark/mllib/util.py index 916de2d..10a1e4b 100644 --- a/python/pyspark/mllib/util.py +++ b/python/pyspark/mllib/util.py @@ -300,6 +300,7 @@ class LinearDataGenerator(object): :param: seed Random Seed :param: eps Used to scale the noise. If eps is set high, the amount of gaussian noise added is more. + Returns a list of LabeledPoints of length nPoints weights = [float(weight) for weight in weights] - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9922] [ML] rename StringIndexerReverse to IndexToString
Repository: spark Updated Branches: refs/heads/master c2520f501 - 6c5858bc6 [SPARK-9922] [ML] rename StringIndexerReverse to IndexToString What `StringIndexerInverse` does is not strictly associated with `StringIndexer`, and the name is not clearly describing the transformation. Renaming to `IndexToString` might be better. ~~I also changed `invert` to `inverse` without arguments. `inputCol` and `outputCol` could be set after.~~ I also removed `invert`. jkbradley holdenk Author: Xiangrui Meng m...@databricks.com Closes #8152 from mengxr/SPARK-9922. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6c5858bc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6c5858bc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6c5858bc Branch: refs/heads/master Commit: 6c5858bc65c8a8602422b46bfa9cf0a1fb296b88 Parents: c2520f5 Author: Xiangrui Meng m...@databricks.com Authored: Thu Aug 13 16:52:17 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Aug 13 16:52:17 2015 -0700 -- .../apache/spark/ml/feature/StringIndexer.scala | 34 + .../spark/ml/feature/StringIndexerSuite.scala | 50 ++-- 2 files changed, 48 insertions(+), 36 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6c5858bc/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala index 9e4b0f0..9f6e7b6 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala @@ -24,7 +24,7 @@ import org.apache.spark.ml.attribute.{Attribute, NominalAttribute} import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ import org.apache.spark.ml.Transformer -import org.apache.spark.ml.util.{Identifiable, MetadataUtils} +import org.apache.spark.ml.util.Identifiable import org.apache.spark.sql.DataFrame import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{DoubleType, NumericType, StringType, StructType} @@ -59,6 +59,8 @@ private[feature] trait StringIndexerBase extends Params with HasInputCol with Ha * If the input column is numeric, we cast it to string and index the string values. * The indices are in [0, numLabels), ordered by label frequencies. * So the most frequent label gets index 0. + * + * @see [[IndexToString]] for the inverse transformation */ @Experimental class StringIndexer(override val uid: String) extends Estimator[StringIndexerModel] @@ -170,34 +172,24 @@ class StringIndexerModel private[ml] ( val copied = new StringIndexerModel(uid, labels) copyValues(copied, extra).setParent(parent) } - - /** - * Return a model to perform the inverse transformation. - * Note: By default we keep the original columns during this transformation, so the inverse - * should only be used on new columns such as predicted labels. - */ - def invert(inputCol: String, outputCol: String): StringIndexerInverse = { -new StringIndexerInverse() - .setInputCol(inputCol) - .setOutputCol(outputCol) - .setLabels(labels) - } } /** * :: Experimental :: - * Transform a provided column back to the original input types using either the metadata - * on the input column, or if provided using the labels supplied by the user. - * Note: By default we keep the original columns during this transformation, - * so the inverse should only be used on new columns such as predicted labels. + * A [[Transformer]] that maps a column of string indices back to a new column of corresponding + * string values using either the ML attributes of the input column, or if provided using the labels + * supplied by the user. + * All original columns are kept during transformation. + * + * @see [[StringIndexer]] for converting strings into indices */ @Experimental -class StringIndexerInverse private[ml] ( +class IndexToString private[ml] ( override val uid: String) extends Transformer with HasInputCol with HasOutputCol { def this() = -this(Identifiable.randomUID(strIdxInv)) +this(Identifiable.randomUID(idxToStr)) /** @group setParam */ def setInputCol(value: String): this.type = set(inputCol, value) @@ -257,7 +249,7 @@ class StringIndexerInverse private[ml] ( } val indexer = udf { index: Double = val idx = index.toInt - if (0 = idx idx values.size) { + if (0 = idx idx values.length) { values(idx) } else { throw new SparkException(sUnseen index: $index ??) @@ -268,7 +260,7 @@ class StringIndexerInverse private
spark git commit: [SPARK-9922] [ML] rename StringIndexerReverse to IndexToString
Repository: spark Updated Branches: refs/heads/branch-1.5 2c7f8da58 - 2b6b1d12f [SPARK-9922] [ML] rename StringIndexerReverse to IndexToString What `StringIndexerInverse` does is not strictly associated with `StringIndexer`, and the name is not clearly describing the transformation. Renaming to `IndexToString` might be better. ~~I also changed `invert` to `inverse` without arguments. `inputCol` and `outputCol` could be set after.~~ I also removed `invert`. jkbradley holdenk Author: Xiangrui Meng m...@databricks.com Closes #8152 from mengxr/SPARK-9922. (cherry picked from commit 6c5858bc65c8a8602422b46bfa9cf0a1fb296b88) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2b6b1d12 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2b6b1d12 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2b6b1d12 Branch: refs/heads/branch-1.5 Commit: 2b6b1d12fb6bd0bd86988babc4c807856011f246 Parents: 2c7f8da Author: Xiangrui Meng m...@databricks.com Authored: Thu Aug 13 16:52:17 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Aug 13 16:54:06 2015 -0700 -- .../apache/spark/ml/feature/StringIndexer.scala | 34 ++ .../spark/ml/feature/StringIndexerSuite.scala | 47 ++-- 2 files changed, 47 insertions(+), 34 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2b6b1d12/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala index 569c834..b87e154 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala @@ -24,7 +24,7 @@ import org.apache.spark.ml.attribute.{Attribute, NominalAttribute} import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ import org.apache.spark.ml.Transformer -import org.apache.spark.ml.util.{Identifiable, MetadataUtils} +import org.apache.spark.ml.util.Identifiable import org.apache.spark.sql.DataFrame import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{DoubleType, NumericType, StringType, StructType} @@ -58,6 +58,8 @@ private[feature] trait StringIndexerBase extends Params with HasInputCol with Ha * If the input column is numeric, we cast it to string and index the string values. * The indices are in [0, numLabels), ordered by label frequencies. * So the most frequent label gets index 0. + * + * @see [[IndexToString]] for the inverse transformation */ @Experimental class StringIndexer(override val uid: String) extends Estimator[StringIndexerModel] @@ -152,34 +154,24 @@ class StringIndexerModel private[ml] ( val copied = new StringIndexerModel(uid, labels) copyValues(copied, extra).setParent(parent) } - - /** - * Return a model to perform the inverse transformation. - * Note: By default we keep the original columns during this transformation, so the inverse - * should only be used on new columns such as predicted labels. - */ - def invert(inputCol: String, outputCol: String): StringIndexerInverse = { -new StringIndexerInverse() - .setInputCol(inputCol) - .setOutputCol(outputCol) - .setLabels(labels) - } } /** * :: Experimental :: - * Transform a provided column back to the original input types using either the metadata - * on the input column, or if provided using the labels supplied by the user. - * Note: By default we keep the original columns during this transformation, - * so the inverse should only be used on new columns such as predicted labels. + * A [[Transformer]] that maps a column of string indices back to a new column of corresponding + * string values using either the ML attributes of the input column, or if provided using the labels + * supplied by the user. + * All original columns are kept during transformation. + * + * @see [[StringIndexer]] for converting strings into indices */ @Experimental -class StringIndexerInverse private[ml] ( +class IndexToString private[ml] ( override val uid: String) extends Transformer with HasInputCol with HasOutputCol { def this() = -this(Identifiable.randomUID(strIdxInv)) +this(Identifiable.randomUID(idxToStr)) /** @group setParam */ def setInputCol(value: String): this.type = set(inputCol, value) @@ -239,7 +231,7 @@ class StringIndexerInverse private[ml] ( } val indexer = udf { index: Double = val idx = index.toInt - if (0 = idx idx values.size) { + if (0 = idx idx values.length) { values(idx
spark git commit: [SPARK-9909] [ML] [TRIVIAL] move weightCol to shared params
Repository: spark Updated Branches: refs/heads/master caa14d9dc - 6e409bc13 [SPARK-9909] [ML] [TRIVIAL] move weightCol to shared params As per the TODO move weightCol to Shared Params. Author: Holden Karau hol...@pigscanfly.ca Closes #8144 from holdenk/SPARK-9909-move-weightCol-toSharedParams. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6e409bc1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6e409bc1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6e409bc1 Branch: refs/heads/master Commit: 6e409bc1357f49de2efdfc4226d074b943fb1153 Parents: caa14d9 Author: Holden Karau hol...@pigscanfly.ca Authored: Wed Aug 12 16:54:45 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 16:54:45 2015 -0700 -- .../spark/ml/param/shared/SharedParamsCodeGen.scala | 4 +++- .../apache/spark/ml/param/shared/sharedParams.scala | 15 +++ .../spark/ml/regression/IsotonicRegression.scala| 16 ++-- 3 files changed, 20 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6e409bc1/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala index 9e12f18..8c16c61 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala @@ -70,7 +70,9 @@ private[shared] object SharedParamsCodeGen { For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty., isValid = ParamValidators.inRange(0, 1)), ParamDesc[Double](tol, the convergence tolerance for iterative algorithms), - ParamDesc[Double](stepSize, Step size to be used for each iteration of optimization.)) + ParamDesc[Double](stepSize, Step size to be used for each iteration of optimization.), + ParamDesc[String](weightCol, weight column name. If this is not set or empty, we treat + +all instance weights as 1.0.)) val code = genSharedParams(params) val file = src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala http://git-wip-us.apache.org/repos/asf/spark/blob/6e409bc1/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala b/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala index a17d4ea..c267689 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala @@ -342,4 +342,19 @@ private[ml] trait HasStepSize extends Params { /** @group getParam */ final def getStepSize: Double = $(stepSize) } + +/** + * Trait for shared param weightCol. + */ +private[ml] trait HasWeightCol extends Params { + + /** + * Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.. + * @group param + */ + final val weightCol: Param[String] = new Param[String](this, weightCol, weight column name. If this is not set or empty, we treat all instance weights as 1.0.) + + /** @group getParam */ + final def getWeightCol: String = $(weightCol) +} // scalastyle:on http://git-wip-us.apache.org/repos/asf/spark/blob/6e409bc1/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala index f570590..0f33bae 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala @@ -21,7 +21,7 @@ import org.apache.spark.Logging import org.apache.spark.annotation.Experimental import org.apache.spark.ml.{Estimator, Model} import org.apache.spark.ml.param._ -import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol, HasPredictionCol} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol, HasPredictionCol, HasWeightCol} import org.apache.spark.ml.util.{Identifiable, SchemaUtils} import org.apache.spark.mllib.linalg.{Vector, VectorUDT, Vectors} import org.apache.spark.mllib.regression.{IsotonicRegression = MLlibIsotonicRegression, IsotonicRegressionModel = MLlibIsotonicRegressionModel
spark git commit: [SPARK-9909] [ML] [TRIVIAL] move weightCol to shared params
Repository: spark Updated Branches: refs/heads/branch-1.5 6aca0cf34 - 2f8793b5f [SPARK-9909] [ML] [TRIVIAL] move weightCol to shared params As per the TODO move weightCol to Shared Params. Author: Holden Karau hol...@pigscanfly.ca Closes #8144 from holdenk/SPARK-9909-move-weightCol-toSharedParams. (cherry picked from commit 6e409bc1357f49de2efdfc4226d074b943fb1153) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2f8793b5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2f8793b5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2f8793b5 Branch: refs/heads/branch-1.5 Commit: 2f8793b5f47ec7c17b27715bc9b1026266061cea Parents: 6aca0cf Author: Holden Karau hol...@pigscanfly.ca Authored: Wed Aug 12 16:54:45 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 16:54:52 2015 -0700 -- .../spark/ml/param/shared/SharedParamsCodeGen.scala | 4 +++- .../apache/spark/ml/param/shared/sharedParams.scala | 15 +++ .../spark/ml/regression/IsotonicRegression.scala| 16 ++-- 3 files changed, 20 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2f8793b5/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala index 5cb7235..3899df6 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala @@ -66,7 +66,9 @@ private[shared] object SharedParamsCodeGen { For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty., isValid = ParamValidators.inRange(0, 1)), ParamDesc[Double](tol, the convergence tolerance for iterative algorithms), - ParamDesc[Double](stepSize, Step size to be used for each iteration of optimization.)) + ParamDesc[Double](stepSize, Step size to be used for each iteration of optimization.), + ParamDesc[String](weightCol, weight column name. If this is not set or empty, we treat + +all instance weights as 1.0.)) val code = genSharedParams(params) val file = src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala http://git-wip-us.apache.org/repos/asf/spark/blob/2f8793b5/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala b/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala index d4c89e6..e8e58aa 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala @@ -327,4 +327,19 @@ private[ml] trait HasStepSize extends Params { /** @group getParam */ final def getStepSize: Double = $(stepSize) } + +/** + * Trait for shared param weightCol. + */ +private[ml] trait HasWeightCol extends Params { + + /** + * Param for weight column name. If this is not set or empty, we treat all instance weights as 1.0.. + * @group param + */ + final val weightCol: Param[String] = new Param[String](this, weightCol, weight column name. If this is not set or empty, we treat all instance weights as 1.0.) + + /** @group getParam */ + final def getWeightCol: String = $(weightCol) +} // scalastyle:on http://git-wip-us.apache.org/repos/asf/spark/blob/2f8793b5/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala index f570590..0f33bae 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala @@ -21,7 +21,7 @@ import org.apache.spark.Logging import org.apache.spark.annotation.Experimental import org.apache.spark.ml.{Estimator, Model} import org.apache.spark.ml.param._ -import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol, HasPredictionCol} +import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol, HasPredictionCol, HasWeightCol} import org.apache.spark.ml.util.{Identifiable, SchemaUtils} import org.apache.spark.mllib.linalg.{Vector, VectorUDT, Vectors} import
spark git commit: [SPARK-9913] [MLLIB] LDAUtils should be private
Repository: spark Updated Branches: refs/heads/branch-1.5 08f767a1e - 6aca0cf34 [SPARK-9913] [MLLIB] LDAUtils should be private feynmanliang Author: Xiangrui Meng m...@databricks.com Closes #8142 from mengxr/SPARK-9913. (cherry picked from commit caa14d9dc9e2eb1102052b22445b63b0e004e3c7) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6aca0cf3 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6aca0cf3 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6aca0cf3 Branch: refs/heads/branch-1.5 Commit: 6aca0cf348ca0731ef72155f5a5d7739b796bb3b Parents: 08f767a Author: Xiangrui Meng m...@databricks.com Authored: Wed Aug 12 16:53:47 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 16:53:56 2015 -0700 -- .../main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6aca0cf3/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala index f7e5ce1..a9ba7b6 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala @@ -22,7 +22,7 @@ import breeze.numerics._ /** * Utility methods for LDA. */ -object LDAUtils { +private[clustering] object LDAUtils { /** * Log Sum Exp with overflow protection using the identity: * For any a: \log \sum_{n=1}^N \exp\{x_n\} = a + \log \sum_{n=1}^N \exp\{x_n - a\} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9913] [MLLIB] LDAUtils should be private
Repository: spark Updated Branches: refs/heads/master 7035d880a - caa14d9dc [SPARK-9913] [MLLIB] LDAUtils should be private feynmanliang Author: Xiangrui Meng m...@databricks.com Closes #8142 from mengxr/SPARK-9913. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/caa14d9d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/caa14d9d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/caa14d9d Branch: refs/heads/master Commit: caa14d9dc9e2eb1102052b22445b63b0e004e3c7 Parents: 7035d88 Author: Xiangrui Meng m...@databricks.com Authored: Wed Aug 12 16:53:47 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 16:53:47 2015 -0700 -- .../main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala| 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/caa14d9d/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala index f7e5ce1..a9ba7b6 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala @@ -22,7 +22,7 @@ import breeze.numerics._ /** * Utility methods for LDA. */ -object LDAUtils { +private[clustering] object LDAUtils { /** * Log Sum Exp with overflow protection using the identity: * For any a: \log \sum_{n=1}^N \exp\{x_n\} = a + \log \sum_{n=1}^N \exp\{x_n - a\} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9915] [ML] stopWords should use StringArrayParam
Repository: spark Updated Branches: refs/heads/master e6aef5576 - fc1c7fd66 [SPARK-9915] [ML] stopWords should use StringArrayParam hhbyyh Author: Xiangrui Meng m...@databricks.com Closes #8141 from mengxr/SPARK-9915. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fc1c7fd6 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fc1c7fd6 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fc1c7fd6 Branch: refs/heads/master Commit: fc1c7fd66e64ccea53b31cd2fbb98bc6d307329c Parents: e6aef55 Author: Xiangrui Meng m...@databricks.com Authored: Wed Aug 12 17:06:12 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 17:06:12 2015 -0700 -- .../scala/org/apache/spark/ml/feature/StopWordsRemover.scala | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/fc1c7fd6/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala index 3cc4142..5d77ea0 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala @@ -19,12 +19,12 @@ package org.apache.spark.ml.feature import org.apache.spark.annotation.Experimental import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{BooleanParam, ParamMap, StringArrayParam} import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} -import org.apache.spark.ml.param.{ParamMap, BooleanParam, Param} import org.apache.spark.ml.util.Identifiable import org.apache.spark.sql.DataFrame -import org.apache.spark.sql.types.{StringType, StructField, ArrayType, StructType} import org.apache.spark.sql.functions.{col, udf} +import org.apache.spark.sql.types.{ArrayType, StringType, StructField, StructType} /** * stop words list @@ -100,7 +100,7 @@ class StopWordsRemover(override val uid: String) * the stop words set to be filtered out * @group param */ - val stopWords: Param[Array[String]] = new Param(this, stopWords, stop words) + val stopWords: StringArrayParam = new StringArrayParam(this, stopWords, stop words) /** @group setParam */ def setStopWords(value: Array[String]): this.type = set(stopWords, value) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-8967] [DOC] add Since annotation
Repository: spark Updated Branches: refs/heads/branch-1.5 bdf8dc15d - 6a7582ea2 [SPARK-8967] [DOC] add Since annotation Add `Since` as a Scala annotation. The benefit is that we can use it without having explicit JavaDoc. This is useful for inherited methods. The limitation is that is doesn't show up in the generated Java API documentation. This might be fixed by modifying genjavadoc. I think we could leave it as a TODO. This is how the generated Scala doc looks: `since` JavaDoc tag: ![screen shot 2015-08-11 at 10 00 37 pm](https://cloud.githubusercontent.com/assets/829644/9230761/fa72865c-40d8-11e5-807e-0f3c815c5acd.png) `Since` annotation: ![screen shot 2015-08-11 at 10 00 28 pm](https://cloud.githubusercontent.com/assets/829644/9230764/0041d7f4-40d9-11e5-8124-c3f3e5d5b31f.png) rxin Author: Xiangrui Meng m...@databricks.com Closes #8131 from mengxr/SPARK-8967. (cherry picked from commit 6f60298b1d7aa97268a42eca1e3b4851a7e88cb5) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6a7582ea Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6a7582ea Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6a7582ea Branch: refs/heads/branch-1.5 Commit: 6a7582ea2d232982c3480e7d4ee357ea45d0b303 Parents: bdf8dc1 Author: Xiangrui Meng m...@databricks.com Authored: Wed Aug 12 14:28:23 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 14:28:34 2015 -0700 -- .../org/apache/spark/annotation/Since.scala | 28 1 file changed, 28 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6a7582ea/core/src/main/scala/org/apache/spark/annotation/Since.scala -- diff --git a/core/src/main/scala/org/apache/spark/annotation/Since.scala b/core/src/main/scala/org/apache/spark/annotation/Since.scala new file mode 100644 index 000..fa59393 --- /dev/null +++ b/core/src/main/scala/org/apache/spark/annotation/Since.scala @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.annotation + +import scala.annotation.StaticAnnotation + +/** + * A Scala annotation that specifies the Spark version when a definition was added. + * Different from the `@since` tag in JavaDoc, this annotation does not require explicit JavaDoc and + * hence works for overridden methods that inherit API documentation directly from parents. + * The limitation is that it does not show up in the generated Java API documentation. + */ +private[spark] class Since(version: String) extends StaticAnnotation - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9912] [MLLIB] QRDecomposition should use QType and RType for type names instead of UType and VType
Repository: spark Updated Branches: refs/heads/master 6e409bc13 - e6aef5576 [SPARK-9912] [MLLIB] QRDecomposition should use QType and RType for type names instead of UType and VType hhbyyh Author: Xiangrui Meng m...@databricks.com Closes #8140 from mengxr/SPARK-9912. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e6aef557 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e6aef557 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e6aef557 Branch: refs/heads/master Commit: e6aef55766d0e2a48e0f9cb6eda0e31a71b962f3 Parents: 6e409bc Author: Xiangrui Meng m...@databricks.com Authored: Wed Aug 12 17:04:31 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 17:04:31 2015 -0700 -- .../org/apache/spark/mllib/linalg/SingularValueDecomposition.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e6aef557/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala b/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala index b416d50..cff5dbe 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala @@ -31,5 +31,5 @@ case class SingularValueDecomposition[UType, VType](U: UType, s: Vector, V: VTyp * Represents QR factors. */ @Experimental -case class QRDecomposition[UType, VType](Q: UType, R: VType) +case class QRDecomposition[QType, RType](Q: QType, R: RType) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9912] [MLLIB] QRDecomposition should use QType and RType for type names instead of UType and VType
Repository: spark Updated Branches: refs/heads/branch-1.5 2f8793b5f - 31b7fdc06 [SPARK-9912] [MLLIB] QRDecomposition should use QType and RType for type names instead of UType and VType hhbyyh Author: Xiangrui Meng m...@databricks.com Closes #8140 from mengxr/SPARK-9912. (cherry picked from commit e6aef55766d0e2a48e0f9cb6eda0e31a71b962f3) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/31b7fdc0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/31b7fdc0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/31b7fdc0 Branch: refs/heads/branch-1.5 Commit: 31b7fdc06fc21fa38ac4530f9c70dd27b3b71578 Parents: 2f8793b Author: Xiangrui Meng m...@databricks.com Authored: Wed Aug 12 17:04:31 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 17:04:37 2015 -0700 -- .../org/apache/spark/mllib/linalg/SingularValueDecomposition.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/31b7fdc0/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala b/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala index b416d50..cff5dbe 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala @@ -31,5 +31,5 @@ case class SingularValueDecomposition[UType, VType](U: UType, s: Vector, V: VTyp * Represents QR factors. */ @Experimental -case class QRDecomposition[UType, VType](Q: UType, R: VType) +case class QRDecomposition[QType, RType](Q: QType, R: RType) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9915] [ML] stopWords should use StringArrayParam
Repository: spark Updated Branches: refs/heads/branch-1.5 31b7fdc06 - ed73f5439 [SPARK-9915] [ML] stopWords should use StringArrayParam hhbyyh Author: Xiangrui Meng m...@databricks.com Closes #8141 from mengxr/SPARK-9915. (cherry picked from commit fc1c7fd66e64ccea53b31cd2fbb98bc6d307329c) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ed73f543 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ed73f543 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ed73f543 Branch: refs/heads/branch-1.5 Commit: ed73f5439bbe3a09adf9a770c34b5d87b35499c8 Parents: 31b7fdc Author: Xiangrui Meng m...@databricks.com Authored: Wed Aug 12 17:06:12 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 17:06:19 2015 -0700 -- .../scala/org/apache/spark/ml/feature/StopWordsRemover.scala | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ed73f543/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala index 3cc4142..5d77ea0 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala @@ -19,12 +19,12 @@ package org.apache.spark.ml.feature import org.apache.spark.annotation.Experimental import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.{BooleanParam, ParamMap, StringArrayParam} import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} -import org.apache.spark.ml.param.{ParamMap, BooleanParam, Param} import org.apache.spark.ml.util.Identifiable import org.apache.spark.sql.DataFrame -import org.apache.spark.sql.types.{StringType, StructField, ArrayType, StructType} import org.apache.spark.sql.functions.{col, udf} +import org.apache.spark.sql.types.{ArrayType, StringType, StructField, StructType} /** * stop words list @@ -100,7 +100,7 @@ class StopWordsRemover(override val uid: String) * the stop words set to be filtered out * @group param */ - val stopWords: Param[Array[String]] = new Param(this, stopWords, stop words) + val stopWords: StringArrayParam = new StringArrayParam(this, stopWords, stop words) /** @group setParam */ def setStopWords(value: Array[String]): this.type = set(stopWords, value) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9903] [MLLIB] skip local processing in PrefixSpan if there are no small prefixes
Repository: spark Updated Branches: refs/heads/branch-1.5 a06860c2f - af470a757 [SPARK-9903] [MLLIB] skip local processing in PrefixSpan if there are no small prefixes There exists a chance that the prefixes keep growing to the maximum pattern length. Then the final local processing step becomes unnecessary. feynmanliang Author: Xiangrui Meng m...@databricks.com Closes #8136 from mengxr/SPARK-9903. (cherry picked from commit d7053bea985679c514b3add029631ea23e1730ce) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/af470a75 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/af470a75 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/af470a75 Branch: refs/heads/branch-1.5 Commit: af470a757c7aed81d626634590a0fb395f0241f5 Parents: a06860c Author: Xiangrui Meng m...@databricks.com Authored: Wed Aug 12 20:44:40 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 20:44:49 2015 -0700 -- .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 37 +++- 1 file changed, 21 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/af470a75/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala index ad6715b5..dc4ae1d 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala @@ -282,25 +282,30 @@ object PrefixSpan extends Logging { largePrefixes = newLargePrefixes } -// Switch to local processing. -val bcSmallPrefixes = sc.broadcast(smallPrefixes) -val distributedFreqPattern = postfixes.flatMap { postfix = - bcSmallPrefixes.value.values.map { prefix = -(prefix.id, postfix.project(prefix).compressed) - }.filter(_._2.nonEmpty) -}.groupByKey().flatMap { case (id, projPostfixes) = - val prefix = bcSmallPrefixes.value(id) - val localPrefixSpan = new LocalPrefixSpan(minCount, maxPatternLength - prefix.length) - // TODO: We collect projected postfixes into memory. We should also compare the performance - // TODO: of keeping them on shuffle files. - localPrefixSpan.run(projPostfixes.toArray).map { case (pattern, count) = -(prefix.items ++ pattern, count) +var freqPatterns = sc.parallelize(localFreqPatterns, 1) + +val numSmallPrefixes = smallPrefixes.size +logInfo(snumber of small prefixes for local processing: $numSmallPrefixes) +if (numSmallPrefixes 0) { + // Switch to local processing. + val bcSmallPrefixes = sc.broadcast(smallPrefixes) + val distributedFreqPattern = postfixes.flatMap { postfix = +bcSmallPrefixes.value.values.map { prefix = + (prefix.id, postfix.project(prefix).compressed) +}.filter(_._2.nonEmpty) + }.groupByKey().flatMap { case (id, projPostfixes) = +val prefix = bcSmallPrefixes.value(id) +val localPrefixSpan = new LocalPrefixSpan(minCount, maxPatternLength - prefix.length) +// TODO: We collect projected postfixes into memory. We should also compare the performance +// TODO: of keeping them on shuffle files. +localPrefixSpan.run(projPostfixes.toArray).map { case (pattern, count) = + (prefix.items ++ pattern, count) +} } + // Union local frequent patterns and distributed ones. + freqPatterns = freqPatterns ++ distributedFreqPattern } -// Union local frequent patterns and distributed ones. -val freqPatterns = (sc.parallelize(localFreqPatterns, 1) ++ distributedFreqPattern) - .persist(StorageLevel.MEMORY_AND_DISK) freqPatterns } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9917] [ML] add getMin/getMax and doc for originalMin/origianlMax in MinMaxScaler
Repository: spark Updated Branches: refs/heads/branch-1.5 8229437c3 - 16f4bf4ca [SPARK-9917] [ML] add getMin/getMax and doc for originalMin/origianlMax in MinMaxScaler hhbyyh Author: Xiangrui Meng m...@databricks.com Closes #8145 from mengxr/SPARK-9917. (cherry picked from commit 5fc058a1fc5d83ad53feec936475484aef3800b3) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/16f4bf4c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/16f4bf4c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/16f4bf4c Branch: refs/heads/branch-1.5 Commit: 16f4bf4caa9c6a1403252485470466266d6b1b65 Parents: 8229437 Author: Xiangrui Meng m...@databricks.com Authored: Wed Aug 12 21:33:38 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 21:33:46 2015 -0700 -- .../scala/org/apache/spark/ml/feature/MinMaxScaler.scala | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/16f4bf4c/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala index b30adf3..9a473dd 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala @@ -41,6 +41,9 @@ private[feature] trait MinMaxScalerParams extends Params with HasInputCol with H val min: DoubleParam = new DoubleParam(this, min, lower bound of the output feature range) + /** @group getParam */ + def getMin: Double = $(min) + /** * upper bound after transformation, shared by all features * Default: 1.0 @@ -49,6 +52,9 @@ private[feature] trait MinMaxScalerParams extends Params with HasInputCol with H val max: DoubleParam = new DoubleParam(this, max, upper bound of the output feature range) + /** @group getParam */ + def getMax: Double = $(max) + /** Validates and transforms the input schema. */ protected def validateAndTransformSchema(schema: StructType): StructType = { val inputType = schema($(inputCol)).dataType @@ -115,6 +121,9 @@ class MinMaxScaler(override val uid: String) * :: Experimental :: * Model fitted by [[MinMaxScaler]]. * + * @param originalMin min value for each original column during fitting + * @param originalMax max value for each original column during fitting + * * TODO: The transformer does not yet set the metadata in the output column (SPARK-8529). */ @Experimental @@ -136,7 +145,6 @@ class MinMaxScalerModel private[ml] ( /** @group setParam */ def setMax(value: Double): this.type = set(max, value) - override def transform(dataset: DataFrame): DataFrame = { val originalRange = (originalMax.toBreeze - originalMin.toBreeze).toArray val minArray = originalMin.toArray - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9917] [ML] add getMin/getMax and doc for originalMin/origianlMax in MinMaxScaler
Repository: spark Updated Branches: refs/heads/master a8ab2634c - 5fc058a1f [SPARK-9917] [ML] add getMin/getMax and doc for originalMin/origianlMax in MinMaxScaler hhbyyh Author: Xiangrui Meng m...@databricks.com Closes #8145 from mengxr/SPARK-9917. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5fc058a1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5fc058a1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5fc058a1 Branch: refs/heads/master Commit: 5fc058a1fc5d83ad53feec936475484aef3800b3 Parents: a8ab263 Author: Xiangrui Meng m...@databricks.com Authored: Wed Aug 12 21:33:38 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 21:33:38 2015 -0700 -- .../scala/org/apache/spark/ml/feature/MinMaxScaler.scala | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/5fc058a1/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala index b30adf3..9a473dd 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala @@ -41,6 +41,9 @@ private[feature] trait MinMaxScalerParams extends Params with HasInputCol with H val min: DoubleParam = new DoubleParam(this, min, lower bound of the output feature range) + /** @group getParam */ + def getMin: Double = $(min) + /** * upper bound after transformation, shared by all features * Default: 1.0 @@ -49,6 +52,9 @@ private[feature] trait MinMaxScalerParams extends Params with HasInputCol with H val max: DoubleParam = new DoubleParam(this, max, upper bound of the output feature range) + /** @group getParam */ + def getMax: Double = $(max) + /** Validates and transforms the input schema. */ protected def validateAndTransformSchema(schema: StructType): StructType = { val inputType = schema($(inputCol)).dataType @@ -115,6 +121,9 @@ class MinMaxScaler(override val uid: String) * :: Experimental :: * Model fitted by [[MinMaxScaler]]. * + * @param originalMin min value for each original column during fitting + * @param originalMax max value for each original column during fitting + * * TODO: The transformer does not yet set the metadata in the output column (SPARK-8529). */ @Experimental @@ -136,7 +145,6 @@ class MinMaxScalerModel private[ml] ( /** @group setParam */ def setMax(value: Double): this.type = set(max, value) - override def transform(dataset: DataFrame): DataFrame = { val originalRange = (originalMax.toBreeze - originalMin.toBreeze).toArray val minArray = originalMin.toArray - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-8922] [DOCUMENTATION, MLLIB] Add @since tags to mllib.evaluation
Repository: spark Updated Branches: refs/heads/master 5fc058a1f - df5438921 [SPARK-8922] [DOCUMENTATION, MLLIB] Add @since tags to mllib.evaluation Author: shikai.tang tar.sk...@gmail.com Closes #7429 from mosessky/master. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/df543892 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/df543892 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/df543892 Branch: refs/heads/master Commit: df543892122342b97e5137b266959ba97589b3ef Parents: 5fc058a Author: shikai.tang tar.sk...@gmail.com Authored: Wed Aug 12 21:53:15 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 21:53:15 2015 -0700 -- .../BinaryClassificationMetrics.scala | 32 +--- .../mllib/evaluation/MulticlassMetrics.scala| 9 ++ .../mllib/evaluation/MultilabelMetrics.scala| 4 +++ .../spark/mllib/evaluation/RankingMetrics.scala | 4 +++ .../mllib/evaluation/RegressionMetrics.scala| 6 5 files changed, 50 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/df543892/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala index c1d1a22..486741e 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala @@ -41,6 +41,7 @@ import org.apache.spark.sql.DataFrame *of bins may not exactly equal numBins. The last bin in each partition may *be smaller as a result, meaning there may be an extra sample at *partition boundaries. + * @since 1.3.0 */ @Experimental class BinaryClassificationMetrics( @@ -51,6 +52,7 @@ class BinaryClassificationMetrics( /** * Defaults `numBins` to 0. + * @since 1.0.0 */ def this(scoreAndLabels: RDD[(Double, Double)]) = this(scoreAndLabels, 0) @@ -61,12 +63,18 @@ class BinaryClassificationMetrics( private[mllib] def this(scoreAndLabels: DataFrame) = this(scoreAndLabels.map(r = (r.getDouble(0), r.getDouble(1 - /** Unpersist intermediate RDDs used in the computation. */ + /** + * Unpersist intermediate RDDs used in the computation. + * @since 1.0.0 + */ def unpersist() { cumulativeCounts.unpersist() } - /** Returns thresholds in descending order. */ + /** + * Returns thresholds in descending order. + * @since 1.0.0 + */ def thresholds(): RDD[Double] = cumulativeCounts.map(_._1) /** @@ -74,6 +82,7 @@ class BinaryClassificationMetrics( * which is an RDD of (false positive rate, true positive rate) * with (0.0, 0.0) prepended and (1.0, 1.0) appended to it. * @see http://en.wikipedia.org/wiki/Receiver_operating_characteristic + * @since 1.0.0 */ def roc(): RDD[(Double, Double)] = { val rocCurve = createCurve(FalsePositiveRate, Recall) @@ -85,6 +94,7 @@ class BinaryClassificationMetrics( /** * Computes the area under the receiver operating characteristic (ROC) curve. + * @since 1.0.0 */ def areaUnderROC(): Double = AreaUnderCurve.of(roc()) @@ -92,6 +102,7 @@ class BinaryClassificationMetrics( * Returns the precision-recall curve, which is an RDD of (recall, precision), * NOT (precision, recall), with (0.0, 1.0) prepended to it. * @see http://en.wikipedia.org/wiki/Precision_and_recall + * @since 1.0.0 */ def pr(): RDD[(Double, Double)] = { val prCurve = createCurve(Recall, Precision) @@ -102,6 +113,7 @@ class BinaryClassificationMetrics( /** * Computes the area under the precision-recall curve. + * @since 1.0.0 */ def areaUnderPR(): Double = AreaUnderCurve.of(pr()) @@ -110,16 +122,26 @@ class BinaryClassificationMetrics( * @param beta the beta factor in F-Measure computation. * @return an RDD of (threshold, F-Measure) pairs. * @see http://en.wikipedia.org/wiki/F1_score + * @since 1.0.0 */ def fMeasureByThreshold(beta: Double): RDD[(Double, Double)] = createCurve(FMeasure(beta)) - /** Returns the (threshold, F-Measure) curve with beta = 1.0. */ + /** + * Returns the (threshold, F-Measure) curve with beta = 1.0. + * @since 1.0.0 + */ def fMeasureByThreshold(): RDD[(Double, Double)] = fMeasureByThreshold(1.0) - /** Returns the (threshold, precision) curve. */ + /** + * Returns the (threshold, precision) curve. + * @since 1.0.0 + */ def precisionByThreshold(): RDD[(Double
spark git commit: [SPARK-8922] [DOCUMENTATION, MLLIB] Add @since tags to mllib.evaluation
Repository: spark Updated Branches: refs/heads/branch-1.5 8f055e595 - 690284037 [SPARK-8922] [DOCUMENTATION, MLLIB] Add @since tags to mllib.evaluation Author: shikai.tang tar.sk...@gmail.com Closes #7429 from mosessky/master. (cherry picked from commit df543892122342b97e5137b266959ba97589b3ef) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/69028403 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/69028403 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/69028403 Branch: refs/heads/branch-1.5 Commit: 690284037ecd880d48d5e835b150a2f31feb7c73 Parents: 8f055e5 Author: shikai.tang tar.sk...@gmail.com Authored: Wed Aug 12 21:53:15 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 21:53:24 2015 -0700 -- .../BinaryClassificationMetrics.scala | 32 +--- .../mllib/evaluation/MulticlassMetrics.scala| 9 ++ .../mllib/evaluation/MultilabelMetrics.scala| 4 +++ .../spark/mllib/evaluation/RankingMetrics.scala | 4 +++ .../mllib/evaluation/RegressionMetrics.scala| 6 5 files changed, 50 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/69028403/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala index c1d1a22..486741e 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala @@ -41,6 +41,7 @@ import org.apache.spark.sql.DataFrame *of bins may not exactly equal numBins. The last bin in each partition may *be smaller as a result, meaning there may be an extra sample at *partition boundaries. + * @since 1.3.0 */ @Experimental class BinaryClassificationMetrics( @@ -51,6 +52,7 @@ class BinaryClassificationMetrics( /** * Defaults `numBins` to 0. + * @since 1.0.0 */ def this(scoreAndLabels: RDD[(Double, Double)]) = this(scoreAndLabels, 0) @@ -61,12 +63,18 @@ class BinaryClassificationMetrics( private[mllib] def this(scoreAndLabels: DataFrame) = this(scoreAndLabels.map(r = (r.getDouble(0), r.getDouble(1 - /** Unpersist intermediate RDDs used in the computation. */ + /** + * Unpersist intermediate RDDs used in the computation. + * @since 1.0.0 + */ def unpersist() { cumulativeCounts.unpersist() } - /** Returns thresholds in descending order. */ + /** + * Returns thresholds in descending order. + * @since 1.0.0 + */ def thresholds(): RDD[Double] = cumulativeCounts.map(_._1) /** @@ -74,6 +82,7 @@ class BinaryClassificationMetrics( * which is an RDD of (false positive rate, true positive rate) * with (0.0, 0.0) prepended and (1.0, 1.0) appended to it. * @see http://en.wikipedia.org/wiki/Receiver_operating_characteristic + * @since 1.0.0 */ def roc(): RDD[(Double, Double)] = { val rocCurve = createCurve(FalsePositiveRate, Recall) @@ -85,6 +94,7 @@ class BinaryClassificationMetrics( /** * Computes the area under the receiver operating characteristic (ROC) curve. + * @since 1.0.0 */ def areaUnderROC(): Double = AreaUnderCurve.of(roc()) @@ -92,6 +102,7 @@ class BinaryClassificationMetrics( * Returns the precision-recall curve, which is an RDD of (recall, precision), * NOT (precision, recall), with (0.0, 1.0) prepended to it. * @see http://en.wikipedia.org/wiki/Precision_and_recall + * @since 1.0.0 */ def pr(): RDD[(Double, Double)] = { val prCurve = createCurve(Recall, Precision) @@ -102,6 +113,7 @@ class BinaryClassificationMetrics( /** * Computes the area under the precision-recall curve. + * @since 1.0.0 */ def areaUnderPR(): Double = AreaUnderCurve.of(pr()) @@ -110,16 +122,26 @@ class BinaryClassificationMetrics( * @param beta the beta factor in F-Measure computation. * @return an RDD of (threshold, F-Measure) pairs. * @see http://en.wikipedia.org/wiki/F1_score + * @since 1.0.0 */ def fMeasureByThreshold(beta: Double): RDD[(Double, Double)] = createCurve(FMeasure(beta)) - /** Returns the (threshold, F-Measure) curve with beta = 1.0. */ + /** + * Returns the (threshold, F-Measure) curve with beta = 1.0. + * @since 1.0.0 + */ def fMeasureByThreshold(): RDD[(Double, Double)] = fMeasureByThreshold(1.0) - /** Returns the (threshold, precision) curve
spark git commit: [SPARK-9903] [MLLIB] skip local processing in PrefixSpan if there are no small prefixes
Repository: spark Updated Branches: refs/heads/master d2d5e7fe2 - d7053bea9 [SPARK-9903] [MLLIB] skip local processing in PrefixSpan if there are no small prefixes There exists a chance that the prefixes keep growing to the maximum pattern length. Then the final local processing step becomes unnecessary. feynmanliang Author: Xiangrui Meng m...@databricks.com Closes #8136 from mengxr/SPARK-9903. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d7053bea Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d7053bea Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d7053bea Branch: refs/heads/master Commit: d7053bea985679c514b3add029631ea23e1730ce Parents: d2d5e7f Author: Xiangrui Meng m...@databricks.com Authored: Wed Aug 12 20:44:40 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 20:44:40 2015 -0700 -- .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 37 +++- 1 file changed, 21 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d7053bea/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala index ad6715b5..dc4ae1d 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala @@ -282,25 +282,30 @@ object PrefixSpan extends Logging { largePrefixes = newLargePrefixes } -// Switch to local processing. -val bcSmallPrefixes = sc.broadcast(smallPrefixes) -val distributedFreqPattern = postfixes.flatMap { postfix = - bcSmallPrefixes.value.values.map { prefix = -(prefix.id, postfix.project(prefix).compressed) - }.filter(_._2.nonEmpty) -}.groupByKey().flatMap { case (id, projPostfixes) = - val prefix = bcSmallPrefixes.value(id) - val localPrefixSpan = new LocalPrefixSpan(minCount, maxPatternLength - prefix.length) - // TODO: We collect projected postfixes into memory. We should also compare the performance - // TODO: of keeping them on shuffle files. - localPrefixSpan.run(projPostfixes.toArray).map { case (pattern, count) = -(prefix.items ++ pattern, count) +var freqPatterns = sc.parallelize(localFreqPatterns, 1) + +val numSmallPrefixes = smallPrefixes.size +logInfo(snumber of small prefixes for local processing: $numSmallPrefixes) +if (numSmallPrefixes 0) { + // Switch to local processing. + val bcSmallPrefixes = sc.broadcast(smallPrefixes) + val distributedFreqPattern = postfixes.flatMap { postfix = +bcSmallPrefixes.value.values.map { prefix = + (prefix.id, postfix.project(prefix).compressed) +}.filter(_._2.nonEmpty) + }.groupByKey().flatMap { case (id, projPostfixes) = +val prefix = bcSmallPrefixes.value(id) +val localPrefixSpan = new LocalPrefixSpan(minCount, maxPatternLength - prefix.length) +// TODO: We collect projected postfixes into memory. We should also compare the performance +// TODO: of keeping them on shuffle files. +localPrefixSpan.run(projPostfixes.toArray).map { case (pattern, count) = + (prefix.items ++ pattern, count) +} } + // Union local frequent patterns and distributed ones. + freqPatterns = freqPatterns ++ distributedFreqPattern } -// Union local frequent patterns and distributed ones. -val freqPatterns = (sc.parallelize(localFreqPatterns, 1) ++ distributedFreqPattern) - .persist(StorageLevel.MEMORY_AND_DISK) freqPatterns } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9914] [ML] define setters explicitly for Java and use setParam group in RFormula
Repository: spark Updated Branches: refs/heads/master df5438921 - d7eb371eb [SPARK-9914] [ML] define setters explicitly for Java and use setParam group in RFormula The problem with defining setters in the base class is that it doesn't return the correct type in Java. ericl Author: Xiangrui Meng m...@databricks.com Closes #8143 from mengxr/SPARK-9914 and squashes the following commits: d36c887 [Xiangrui Meng] remove setters from model a49021b [Xiangrui Meng] define setters explicitly for Java and use setParam group Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d7eb371e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d7eb371e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d7eb371e Branch: refs/heads/master Commit: d7eb371eb6369a34e58a09179efe058c4101de9e Parents: df54389 Author: Xiangrui Meng m...@databricks.com Authored: Wed Aug 12 22:30:33 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 22:30:33 2015 -0700 -- .../scala/org/apache/spark/ml/feature/RFormula.scala | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d7eb371e/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala index d5360c9..a752dac 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala @@ -33,11 +33,6 @@ import org.apache.spark.sql.types._ * Base trait for [[RFormula]] and [[RFormulaModel]]. */ private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol { - /** @group getParam */ - def setFeaturesCol(value: String): this.type = set(featuresCol, value) - - /** @group getParam */ - def setLabelCol(value: String): this.type = set(labelCol, value) protected def hasLabelCol(schema: StructType): Boolean = { schema.map(_.name).contains($(labelCol)) @@ -71,6 +66,12 @@ class RFormula(override val uid: String) extends Estimator[RFormulaModel] with R /** @group getParam */ def getFormula: String = $(formula) + /** @group setParam */ + def setFeaturesCol(value: String): this.type = set(featuresCol, value) + + /** @group setParam */ + def setLabelCol(value: String): this.type = set(labelCol, value) + /** Whether the formula specifies fitting an intercept. */ private[ml] def hasIntercept: Boolean = { require(isDefined(formula), Formula must be defined first.) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-7583] [MLLIB] User guide update for RegexTokenizer
Repository: spark Updated Branches: refs/heads/branch-1.5 bc4ac65d4 - 2d86faddd [SPARK-7583] [MLLIB] User guide update for RegexTokenizer jira: https://issues.apache.org/jira/browse/SPARK-7583 User guide update for RegexTokenizer Author: Yuhao Yang hhb...@gmail.com Closes #7828 from hhbyyh/regexTokenizerDoc. (cherry picked from commit 66d87c1d76bea2b81993156ac1fa7dad6c312ebf) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2d86fadd Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2d86fadd Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2d86fadd Branch: refs/heads/branch-1.5 Commit: 2d86faddd87b6e61565cbdf18dadaf4aeb2b223e Parents: bc4ac65 Author: Yuhao Yang hhb...@gmail.com Authored: Wed Aug 12 09:35:32 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 09:35:41 2015 -0700 -- docs/ml-features.md | 41 ++--- 1 file changed, 30 insertions(+), 11 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/2d86fadd/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index fa0ad1f..cec2cbe 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -217,21 +217,32 @@ for feature in result.select(result).take(3): [Tokenization](http://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple [Tokenizer](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer) class provides this functionality. The example below shows how to split sentences into sequences of words. -Note: A more advanced tokenizer is provided via [RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer). +[RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer) allows more + advanced tokenization based on regular expression (regex) matching. + By default, the parameter pattern (regex, default: \\s+) is used as delimiters to split the input text. + Alternatively, users can set parameter gaps to false indicating the regex pattern denotes + tokens rather than splitting gaps, and find all matching occurrences as the tokenization result. div class=codetabs div data-lang=scala markdown=1 {% highlight scala %} -import org.apache.spark.ml.feature.Tokenizer +import org.apache.spark.ml.feature.{Tokenizer, RegexTokenizer} val sentenceDataFrame = sqlContext.createDataFrame(Seq( (0, Hi I heard about Spark), - (0, I wish Java could use case classes), - (1, Logistic regression models are neat) + (1, I wish Java could use case classes), + (2, Logistic,regression,models,are,neat) )).toDF(label, sentence) val tokenizer = new Tokenizer().setInputCol(sentence).setOutputCol(words) -val wordsDataFrame = tokenizer.transform(sentenceDataFrame) -wordsDataFrame.select(words, label).take(3).foreach(println) +val regexTokenizer = new RegexTokenizer() + .setInputCol(sentence) + .setOutputCol(words) + .setPattern(\\W) // alternatively .setPattern(\\w+).setGaps(false) + +val tokenized = tokenizer.transform(sentenceDataFrame) +tokenized.select(words, label).take(3).foreach(println) +val regexTokenized = regexTokenizer.transform(sentenceDataFrame) +regexTokenized.select(words, label).take(3).foreach(println) {% endhighlight %} /div @@ -240,6 +251,7 @@ wordsDataFrame.select(words, label).take(3).foreach(println) import com.google.common.collect.Lists; import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.ml.feature.RegexTokenizer; import org.apache.spark.ml.feature.Tokenizer; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.sql.DataFrame; @@ -252,8 +264,8 @@ import org.apache.spark.sql.types.StructType; JavaRDDRow jrdd = jsc.parallelize(Lists.newArrayList( RowFactory.create(0, Hi I heard about Spark), - RowFactory.create(0, I wish Java could use case classes), - RowFactory.create(1, Logistic regression models are neat) + RowFactory.create(1, I wish Java could use case classes), + RowFactory.create(2, Logistic,regression,models,are,neat) )); StructType schema = new StructType(new StructField[]{ new StructField(label, DataTypes.DoubleType, false, Metadata.empty()), @@ -267,22 +279,29 @@ for (Row r : wordsDataFrame.select(words, label).take(3)) { for (String word : words) System.out.print(word + ); System.out.println(); } + +RegexTokenizer regexTokenizer = new RegexTokenizer() + .setInputCol(sentence) + .setOutputCol(words) + .setPattern(\\W); // alternatively .setPattern(\\w+).setGaps(false); {% endhighlight %} /div div data-lang=python markdown=1 {% highlight
spark git commit: [SPARK-7583] [MLLIB] User guide update for RegexTokenizer
Repository: spark Updated Branches: refs/heads/master be5d19120 - 66d87c1d7 [SPARK-7583] [MLLIB] User guide update for RegexTokenizer jira: https://issues.apache.org/jira/browse/SPARK-7583 User guide update for RegexTokenizer Author: Yuhao Yang hhb...@gmail.com Closes #7828 from hhbyyh/regexTokenizerDoc. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/66d87c1d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/66d87c1d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/66d87c1d Branch: refs/heads/master Commit: 66d87c1d76bea2b81993156ac1fa7dad6c312ebf Parents: be5d191 Author: Yuhao Yang hhb...@gmail.com Authored: Wed Aug 12 09:35:32 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 09:35:32 2015 -0700 -- docs/ml-features.md | 41 ++--- 1 file changed, 30 insertions(+), 11 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/66d87c1d/docs/ml-features.md -- diff --git a/docs/ml-features.md b/docs/ml-features.md index fa0ad1f..cec2cbe 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -217,21 +217,32 @@ for feature in result.select(result).take(3): [Tokenization](http://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple [Tokenizer](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer) class provides this functionality. The example below shows how to split sentences into sequences of words. -Note: A more advanced tokenizer is provided via [RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer). +[RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer) allows more + advanced tokenization based on regular expression (regex) matching. + By default, the parameter pattern (regex, default: \\s+) is used as delimiters to split the input text. + Alternatively, users can set parameter gaps to false indicating the regex pattern denotes + tokens rather than splitting gaps, and find all matching occurrences as the tokenization result. div class=codetabs div data-lang=scala markdown=1 {% highlight scala %} -import org.apache.spark.ml.feature.Tokenizer +import org.apache.spark.ml.feature.{Tokenizer, RegexTokenizer} val sentenceDataFrame = sqlContext.createDataFrame(Seq( (0, Hi I heard about Spark), - (0, I wish Java could use case classes), - (1, Logistic regression models are neat) + (1, I wish Java could use case classes), + (2, Logistic,regression,models,are,neat) )).toDF(label, sentence) val tokenizer = new Tokenizer().setInputCol(sentence).setOutputCol(words) -val wordsDataFrame = tokenizer.transform(sentenceDataFrame) -wordsDataFrame.select(words, label).take(3).foreach(println) +val regexTokenizer = new RegexTokenizer() + .setInputCol(sentence) + .setOutputCol(words) + .setPattern(\\W) // alternatively .setPattern(\\w+).setGaps(false) + +val tokenized = tokenizer.transform(sentenceDataFrame) +tokenized.select(words, label).take(3).foreach(println) +val regexTokenized = regexTokenizer.transform(sentenceDataFrame) +regexTokenized.select(words, label).take(3).foreach(println) {% endhighlight %} /div @@ -240,6 +251,7 @@ wordsDataFrame.select(words, label).take(3).foreach(println) import com.google.common.collect.Lists; import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.ml.feature.RegexTokenizer; import org.apache.spark.ml.feature.Tokenizer; import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.sql.DataFrame; @@ -252,8 +264,8 @@ import org.apache.spark.sql.types.StructType; JavaRDDRow jrdd = jsc.parallelize(Lists.newArrayList( RowFactory.create(0, Hi I heard about Spark), - RowFactory.create(0, I wish Java could use case classes), - RowFactory.create(1, Logistic regression models are neat) + RowFactory.create(1, I wish Java could use case classes), + RowFactory.create(2, Logistic,regression,models,are,neat) )); StructType schema = new StructType(new StructField[]{ new StructField(label, DataTypes.DoubleType, false, Metadata.empty()), @@ -267,22 +279,29 @@ for (Row r : wordsDataFrame.select(words, label).take(3)) { for (String word : words) System.out.print(word + ); System.out.println(); } + +RegexTokenizer regexTokenizer = new RegexTokenizer() + .setInputCol(sentence) + .setOutputCol(words) + .setPattern(\\W); // alternatively .setPattern(\\w+).setGaps(false); {% endhighlight %} /div div data-lang=python markdown=1 {% highlight python %} -from pyspark.ml.feature import Tokenizer +from pyspark.ml.feature import Tokenizer, RegexTokenizer sentenceDataFrame
spark git commit: [SPARK-9847] [ML] Modified copyValues to distinguish between default, explicit param values
Repository: spark Updated Branches: refs/heads/master 57ec27dd7 - 70fe55886 [SPARK-9847] [ML] Modified copyValues to distinguish between default, explicit param values From JIRA: Currently, Params.copyValues copies default parameter values to the paramMap of the target instance, rather than the defaultParamMap. It should copy to the defaultParamMap because explicitly setting a parameter can change the semantics. This issue arose in SPARK-9789, where 2 params threshold and thresholds for LogisticRegression can have mutually exclusive values. If thresholds is set, then fit() will copy the default value of threshold as well, easily resulting in inconsistent settings for the 2 params. CC: mengxr Author: Joseph K. Bradley jos...@databricks.com Closes #8115 from jkbradley/copyvalues-fix. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/70fe5588 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/70fe5588 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/70fe5588 Branch: refs/heads/master Commit: 70fe558867ccb4bcff6ec673438b03608bb02252 Parents: 57ec27d Author: Joseph K. Bradley jos...@databricks.com Authored: Wed Aug 12 10:48:52 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 12 10:48:52 2015 -0700 -- .../scala/org/apache/spark/ml/param/params.scala | 19 --- .../org/apache/spark/ml/param/ParamsSuite.scala | 8 2 files changed, 24 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/70fe5588/mllib/src/main/scala/org/apache/spark/ml/param/params.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/param/params.scala b/mllib/src/main/scala/org/apache/spark/ml/param/params.scala index d68f5ff..91c0a56 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/param/params.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/param/params.scala @@ -559,13 +559,26 @@ trait Params extends Identifiable with Serializable { /** * Copies param values from this instance to another instance for params shared by them. - * @param to the target instance - * @param extra extra params to be copied + * + * This handles default Params and explicitly set Params separately. + * Default Params are copied from and to [[defaultParamMap]], and explicitly set Params are + * copied from and to [[paramMap]]. + * Warning: This implicitly assumes that this [[Params]] instance and the target instance + * share the same set of default Params. + * + * @param to the target instance, which should work with the same set of default Params as this + * source instance + * @param extra extra params to be copied to the target's [[paramMap]] * @return the target instance with param values copied */ protected def copyValues[T : Params](to: T, extra: ParamMap = ParamMap.empty): T = { -val map = extractParamMap(extra) +val map = paramMap ++ extra params.foreach { param = + // copy default Params + if (defaultParamMap.contains(param) to.hasParam(param.name)) { +to.defaultParamMap.put(to.getParam(param.name), defaultParamMap(param)) + } + // copy explicitly set Params if (map.contains(param) to.hasParam(param.name)) { to.set(param.name, map(param)) } http://git-wip-us.apache.org/repos/asf/spark/blob/70fe5588/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala index 050d417..be95638 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala @@ -200,6 +200,14 @@ class ParamsSuite extends SparkFunSuite { val inArray = ParamValidators.inArray[Int](Array(1, 2)) assert(inArray(1) inArray(2) !inArray(0)) } + + test(Params.copyValues) { +val t = new TestParams() +val t2 = t.copy(ParamMap.empty) +assert(!t2.isSet(t2.maxIter)) +val t3 = t.copy(ParamMap(t.maxIter - 20)) +assert(t3.isSet(t3.maxIter)) + } } object ParamsSuite extends SparkFunSuite { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: Closes #1290 Closes #4934
Repository: spark Updated Branches: refs/heads/master f16bc68df - 423cdfd83 Closes #1290 Closes #4934 Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/423cdfd8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/423cdfd8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/423cdfd8 Branch: refs/heads/master Commit: 423cdfd83d7fd02a4f8cf3e714db913fd3f9ca09 Parents: f16bc68 Author: Xiangrui Meng m...@databricks.com Authored: Tue Aug 11 14:08:09 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 11 14:08:09 2015 -0700 -- -- - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-8925] [MLLIB] Add @since tags to mllib.util
Repository: spark Updated Branches: refs/heads/branch-1.5 2273e7432 - ef961ed48 [SPARK-8925] [MLLIB] Add @since tags to mllib.util Went thru the history of changes the file MLUtils.scala and picked up the version that the change went in. Author: Sudhakar Thota sudhakarth...@yahoo.com Author: Sudhakar Thota sudhakarth...@sudhakars-mbp-2.usca.ibm.com Closes #7436 from sthota2014/SPARK-8925_thotas. (cherry picked from commit 017b5de07ef6cff249e984a2ab781c520249ac76) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ef961ed4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ef961ed4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ef961ed4 Branch: refs/heads/branch-1.5 Commit: ef961ed48a4f45447f0e0ad256b040c7ab2d78d9 Parents: 2273e74 Author: Sudhakar Thota sudhakarth...@yahoo.com Authored: Tue Aug 11 14:31:51 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 11 14:32:01 2015 -0700 -- .../org/apache/spark/mllib/util/MLUtils.scala | 22 +++- 1 file changed, 21 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ef961ed4/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala index 7c5cfa7..26eb84a 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala @@ -64,6 +64,7 @@ object MLUtils { *feature dimensions. * @param minPartitions min number of partitions * @return labeled data stored as an RDD[LabeledPoint] + * @since 1.0.0 */ def loadLibSVMFile( sc: SparkContext, @@ -113,7 +114,10 @@ object MLUtils { } // Convenient methods for `loadLibSVMFile`. - + + /** + * @since 1.0.0 + */ @deprecated(use method without multiclass argument, which no longer has effect, 1.1.0) def loadLibSVMFile( sc: SparkContext, @@ -126,6 +130,7 @@ object MLUtils { /** * Loads labeled data in the LIBSVM format into an RDD[LabeledPoint], with the default number of * partitions. + * @since 1.0.0 */ def loadLibSVMFile( sc: SparkContext, @@ -133,6 +138,9 @@ object MLUtils { numFeatures: Int): RDD[LabeledPoint] = loadLibSVMFile(sc, path, numFeatures, sc.defaultMinPartitions) + /** + * @since 1.0.0 + */ @deprecated(use method without multiclass argument, which no longer has effect, 1.1.0) def loadLibSVMFile( sc: SparkContext, @@ -141,6 +149,9 @@ object MLUtils { numFeatures: Int): RDD[LabeledPoint] = loadLibSVMFile(sc, path, numFeatures) + /** + * @since 1.0.0 + */ @deprecated(use method without multiclass argument, which no longer has effect, 1.1.0) def loadLibSVMFile( sc: SparkContext, @@ -151,6 +162,7 @@ object MLUtils { /** * Loads binary labeled data in the LIBSVM format into an RDD[LabeledPoint], with number of * features determined automatically and the default number of partitions. + * @since 1.0.0 */ def loadLibSVMFile(sc: SparkContext, path: String): RDD[LabeledPoint] = loadLibSVMFile(sc, path, -1) @@ -181,12 +193,14 @@ object MLUtils { * @param path file or directory path in any Hadoop-supported file system URI * @param minPartitions min number of partitions * @return vectors stored as an RDD[Vector] + * @since 1.1.0 */ def loadVectors(sc: SparkContext, path: String, minPartitions: Int): RDD[Vector] = sc.textFile(path, minPartitions).map(Vectors.parse) /** * Loads vectors saved using `RDD[Vector].saveAsTextFile` with the default number of partitions. + * @since 1.1.0 */ def loadVectors(sc: SparkContext, path: String): RDD[Vector] = sc.textFile(path, sc.defaultMinPartitions).map(Vectors.parse) @@ -197,6 +211,7 @@ object MLUtils { * @param path file or directory path in any Hadoop-supported file system URI * @param minPartitions min number of partitions * @return labeled points stored as an RDD[LabeledPoint] + * @since 1.1.0 */ def loadLabeledPoints(sc: SparkContext, path: String, minPartitions: Int): RDD[LabeledPoint] = sc.textFile(path, minPartitions).map(LabeledPoint.parse) @@ -204,6 +219,7 @@ object MLUtils { /** * Loads labeled points saved using `RDD[LabeledPoint].saveAsTextFile` with the default number of * partitions. + * @since 1.1.0 */ def loadLabeledPoints(sc: SparkContext, dir: String): RDD[LabeledPoint] = loadLabeledPoints(sc, dir, sc.defaultMinPartitions) @@ -220,6 +236,7
spark git commit: [SPARK-8925] [MLLIB] Add @since tags to mllib.util
Repository: spark Updated Branches: refs/heads/master be3e27164 - 017b5de07 [SPARK-8925] [MLLIB] Add @since tags to mllib.util Went thru the history of changes the file MLUtils.scala and picked up the version that the change went in. Author: Sudhakar Thota sudhakarth...@yahoo.com Author: Sudhakar Thota sudhakarth...@sudhakars-mbp-2.usca.ibm.com Closes #7436 from sthota2014/SPARK-8925_thotas. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/017b5de0 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/017b5de0 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/017b5de0 Branch: refs/heads/master Commit: 017b5de07ef6cff249e984a2ab781c520249ac76 Parents: be3e271 Author: Sudhakar Thota sudhakarth...@yahoo.com Authored: Tue Aug 11 14:31:51 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 11 14:31:51 2015 -0700 -- .../org/apache/spark/mllib/util/MLUtils.scala | 22 +++- 1 file changed, 21 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/017b5de0/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala index 7c5cfa7..26eb84a 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala @@ -64,6 +64,7 @@ object MLUtils { *feature dimensions. * @param minPartitions min number of partitions * @return labeled data stored as an RDD[LabeledPoint] + * @since 1.0.0 */ def loadLibSVMFile( sc: SparkContext, @@ -113,7 +114,10 @@ object MLUtils { } // Convenient methods for `loadLibSVMFile`. - + + /** + * @since 1.0.0 + */ @deprecated(use method without multiclass argument, which no longer has effect, 1.1.0) def loadLibSVMFile( sc: SparkContext, @@ -126,6 +130,7 @@ object MLUtils { /** * Loads labeled data in the LIBSVM format into an RDD[LabeledPoint], with the default number of * partitions. + * @since 1.0.0 */ def loadLibSVMFile( sc: SparkContext, @@ -133,6 +138,9 @@ object MLUtils { numFeatures: Int): RDD[LabeledPoint] = loadLibSVMFile(sc, path, numFeatures, sc.defaultMinPartitions) + /** + * @since 1.0.0 + */ @deprecated(use method without multiclass argument, which no longer has effect, 1.1.0) def loadLibSVMFile( sc: SparkContext, @@ -141,6 +149,9 @@ object MLUtils { numFeatures: Int): RDD[LabeledPoint] = loadLibSVMFile(sc, path, numFeatures) + /** + * @since 1.0.0 + */ @deprecated(use method without multiclass argument, which no longer has effect, 1.1.0) def loadLibSVMFile( sc: SparkContext, @@ -151,6 +162,7 @@ object MLUtils { /** * Loads binary labeled data in the LIBSVM format into an RDD[LabeledPoint], with number of * features determined automatically and the default number of partitions. + * @since 1.0.0 */ def loadLibSVMFile(sc: SparkContext, path: String): RDD[LabeledPoint] = loadLibSVMFile(sc, path, -1) @@ -181,12 +193,14 @@ object MLUtils { * @param path file or directory path in any Hadoop-supported file system URI * @param minPartitions min number of partitions * @return vectors stored as an RDD[Vector] + * @since 1.1.0 */ def loadVectors(sc: SparkContext, path: String, minPartitions: Int): RDD[Vector] = sc.textFile(path, minPartitions).map(Vectors.parse) /** * Loads vectors saved using `RDD[Vector].saveAsTextFile` with the default number of partitions. + * @since 1.1.0 */ def loadVectors(sc: SparkContext, path: String): RDD[Vector] = sc.textFile(path, sc.defaultMinPartitions).map(Vectors.parse) @@ -197,6 +211,7 @@ object MLUtils { * @param path file or directory path in any Hadoop-supported file system URI * @param minPartitions min number of partitions * @return labeled points stored as an RDD[LabeledPoint] + * @since 1.1.0 */ def loadLabeledPoints(sc: SparkContext, path: String, minPartitions: Int): RDD[LabeledPoint] = sc.textFile(path, minPartitions).map(LabeledPoint.parse) @@ -204,6 +219,7 @@ object MLUtils { /** * Loads labeled points saved using `RDD[LabeledPoint].saveAsTextFile` with the default number of * partitions. + * @since 1.1.0 */ def loadLabeledPoints(sc: SparkContext, dir: String): RDD[LabeledPoint] = loadLabeledPoints(sc, dir, sc.defaultMinPartitions) @@ -220,6 +236,7 @@ object MLUtils { * * @deprecated Should use [[org.apache.spark.rdd.RDD#saveAsTextFile]] for saving
spark git commit: [SPARK-8345] [ML] Add an SQL node as a feature transformer
Repository: spark Updated Branches: refs/heads/master bce72797f - 8cad854ef [SPARK-8345] [ML] Add an SQL node as a feature transformer Implements the transforms which are defined by SQL statement. Currently we only support SQL syntax like 'SELECT ... FROM __THIS__' where '__THIS__' represents the underlying table of the input dataset. Author: Yanbo Liang yblia...@gmail.com Closes #7465 from yanboliang/spark-8345 and squashes the following commits: b403fcb [Yanbo Liang] address comments 0d4bb15 [Yanbo Liang] a better transformSchema() implementation 51eb9e7 [Yanbo Liang] Add an SQL node as a feature transformer Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8cad854e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8cad854e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8cad854e Branch: refs/heads/master Commit: 8cad854ef6a2066de5adffcca6b79a205ccfd5f3 Parents: bce7279 Author: Yanbo Liang yblia...@gmail.com Authored: Tue Aug 11 11:01:59 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 11 11:01:59 2015 -0700 -- .../spark/ml/feature/SQLTransformer.scala | 72 .../spark/ml/feature/SQLTransformerSuite.scala | 44 2 files changed, 116 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/8cad854e/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala new file mode 100644 index 000..95e4305 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.SparkContext +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.param.{ParamMap, Param} +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.{SQLContext, DataFrame, Row} +import org.apache.spark.sql.types.StructType + +/** + * :: Experimental :: + * Implements the transforms which are defined by SQL statement. + * Currently we only support SQL syntax like 'SELECT ... FROM __THIS__' + * where '__THIS__' represents the underlying table of the input dataset. + */ +@Experimental +class SQLTransformer (override val uid: String) extends Transformer { + + def this() = this(Identifiable.randomUID(sql)) + + /** + * SQL statement parameter. The statement is provided in string form. + * @group param + */ + final val statement: Param[String] = new Param[String](this, statement, SQL statement) + + /** @group setParam */ + def setStatement(value: String): this.type = set(statement, value) + + /** @group getParam */ + def getStatement: String = $(statement) + + private val tableIdentifier: String = __THIS__ + + override def transform(dataset: DataFrame): DataFrame = { +val tableName = Identifiable.randomUID(uid) +dataset.registerTempTable(tableName) +val realStatement = $(statement).replace(tableIdentifier, tableName) +val outputDF = dataset.sqlContext.sql(realStatement) +outputDF + } + + override def transformSchema(schema: StructType): StructType = { +val sc = SparkContext.getOrCreate() +val sqlContext = SQLContext.getOrCreate(sc) +val dummyRDD = sc.parallelize(Seq(Row.empty)) +val dummyDF = sqlContext.createDataFrame(dummyRDD, schema) +dummyDF.registerTempTable(tableIdentifier) +val outputSchema = sqlContext.sql($(statement)).schema +outputSchema + } + + override def copy(extra: ParamMap): SQLTransformer = defaultCopy(extra) +} http://git-wip-us.apache.org/repos/asf/spark/blob/8cad854e/mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala -- diff --git a/mllib/src/test/scala
spark git commit: [SPARK-9713] [ML] Document SparkR MLlib glm() integration in Spark 1.5
Repository: spark Updated Branches: refs/heads/branch-1.5 6ea33f5bf - 890c75bc2 [SPARK-9713] [ML] Document SparkR MLlib glm() integration in Spark 1.5 This documents the use of R model formulae in the SparkR guide. Also fixes some bugs in the R api doc. mengxr Author: Eric Liang e...@databricks.com Closes #8085 from ericl/docs. (cherry picked from commit 74a293f4537c6982345166f8883538f81d850872) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/890c75bc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/890c75bc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/890c75bc Branch: refs/heads/branch-1.5 Commit: 890c75bc2c2e1405c00485a98c034342122b639f Parents: 6ea33f5 Author: Eric Liang e...@databricks.com Authored: Tue Aug 11 21:26:03 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 11 21:26:12 2015 -0700 -- R/pkg/R/generics.R | 4 ++-- R/pkg/R/mllib.R| 8 docs/sparkr.md | 37 - 3 files changed, 42 insertions(+), 7 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/890c75bc/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index c43b947..379a78b 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -535,8 +535,8 @@ setGeneric(showDF, function(x,...) { standardGeneric(showDF) }) #' @export setGeneric(summarize, function(x,...) { standardGeneric(summarize) }) -##' rdname summary -##' @export +#' @rdname summary +#' @export setGeneric(summary, function(x, ...) { standardGeneric(summary) }) # @rdname tojson http://git-wip-us.apache.org/repos/asf/spark/blob/890c75bc/R/pkg/R/mllib.R -- diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R index b524d1f..cea3d76 100644 --- a/R/pkg/R/mllib.R +++ b/R/pkg/R/mllib.R @@ -56,10 +56,10 @@ setMethod(glm, signature(formula = formula, family = ANY, data = DataFram #' #' Makes predictions from a model produced by glm(), similarly to R's predict(). #' -#' @param model A fitted MLlib model +#' @param object A fitted MLlib model #' @param newData DataFrame for testing #' @return DataFrame containing predicted values -#' @rdname glm +#' @rdname predict #' @export #' @examples #'\dontrun{ @@ -76,10 +76,10 @@ setMethod(predict, signature(object = PipelineModel), #' #' Returns the summary of a model produced by glm(), similarly to R's summary(). #' -#' @param model A fitted MLlib model +#' @param x A fitted MLlib model #' @return a list with a 'coefficient' component, which is the matrix of coefficients. See #' summary.glm for more information. -#' @rdname glm +#' @rdname summary #' @export #' @examples #'\dontrun{ http://git-wip-us.apache.org/repos/asf/spark/blob/890c75bc/docs/sparkr.md -- diff --git a/docs/sparkr.md b/docs/sparkr.md index 4385a4e..7139d16 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -11,7 +11,8 @@ title: SparkR (R on Spark) SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark {{site.SPARK_VERSION}}, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, -[dplyr](https://github.com/hadley/dplyr)) but on large datasets. +[dplyr](https://github.com/hadley/dplyr)) but on large datasets. SparkR also supports distributed +machine learning using MLlib. # SparkR DataFrames @@ -230,3 +231,37 @@ head(teenagers) {% endhighlight %} /div + +# Machine Learning + +SparkR allows the fitting of generalized linear models over DataFrames using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib to train a model of the specified family. Currently the gaussian and binomial families are supported. We support a subset of the available R formula operators for model fitting, including '~', '.', '+', and '-'. The example below shows the use of building a gaussian GLM model using SparkR. + +div data-lang=r markdown=1 +{% highlight r %} +# Create the DataFrame +df - createDataFrame(sqlContext, iris) + +# Fit a linear model over the dataset. +model - glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = gaussian) + +# Model coefficients are returned in a similar format to R's native glm(). +summary(model) +##$coefficients +##Estimate +##(Intercept)2.2513930 +##Sepal_Width0.8035609 +##Species_versicolor 1.4587432 +##Species_virginica 1.9468169 + +# Make predictions based on the model. +predictions - predict(model, newData = df
spark git commit: [SPARK-9713] [ML] Document SparkR MLlib glm() integration in Spark 1.5
Repository: spark Updated Branches: refs/heads/master 3ef0f3292 - 74a293f45 [SPARK-9713] [ML] Document SparkR MLlib glm() integration in Spark 1.5 This documents the use of R model formulae in the SparkR guide. Also fixes some bugs in the R api doc. mengxr Author: Eric Liang e...@databricks.com Closes #8085 from ericl/docs. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/74a293f4 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/74a293f4 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/74a293f4 Branch: refs/heads/master Commit: 74a293f4537c6982345166f8883538f81d850872 Parents: 3ef0f32 Author: Eric Liang e...@databricks.com Authored: Tue Aug 11 21:26:03 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 11 21:26:03 2015 -0700 -- R/pkg/R/generics.R | 4 ++-- R/pkg/R/mllib.R| 8 docs/sparkr.md | 37 - 3 files changed, 42 insertions(+), 7 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/74a293f4/R/pkg/R/generics.R -- diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index c43b947..379a78b 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -535,8 +535,8 @@ setGeneric(showDF, function(x,...) { standardGeneric(showDF) }) #' @export setGeneric(summarize, function(x,...) { standardGeneric(summarize) }) -##' rdname summary -##' @export +#' @rdname summary +#' @export setGeneric(summary, function(x, ...) { standardGeneric(summary) }) # @rdname tojson http://git-wip-us.apache.org/repos/asf/spark/blob/74a293f4/R/pkg/R/mllib.R -- diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R index b524d1f..cea3d76 100644 --- a/R/pkg/R/mllib.R +++ b/R/pkg/R/mllib.R @@ -56,10 +56,10 @@ setMethod(glm, signature(formula = formula, family = ANY, data = DataFram #' #' Makes predictions from a model produced by glm(), similarly to R's predict(). #' -#' @param model A fitted MLlib model +#' @param object A fitted MLlib model #' @param newData DataFrame for testing #' @return DataFrame containing predicted values -#' @rdname glm +#' @rdname predict #' @export #' @examples #'\dontrun{ @@ -76,10 +76,10 @@ setMethod(predict, signature(object = PipelineModel), #' #' Returns the summary of a model produced by glm(), similarly to R's summary(). #' -#' @param model A fitted MLlib model +#' @param x A fitted MLlib model #' @return a list with a 'coefficient' component, which is the matrix of coefficients. See #' summary.glm for more information. -#' @rdname glm +#' @rdname summary #' @export #' @examples #'\dontrun{ http://git-wip-us.apache.org/repos/asf/spark/blob/74a293f4/docs/sparkr.md -- diff --git a/docs/sparkr.md b/docs/sparkr.md index 4385a4e..7139d16 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -11,7 +11,8 @@ title: SparkR (R on Spark) SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. In Spark {{site.SPARK_VERSION}}, SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, -[dplyr](https://github.com/hadley/dplyr)) but on large datasets. +[dplyr](https://github.com/hadley/dplyr)) but on large datasets. SparkR also supports distributed +machine learning using MLlib. # SparkR DataFrames @@ -230,3 +231,37 @@ head(teenagers) {% endhighlight %} /div + +# Machine Learning + +SparkR allows the fitting of generalized linear models over DataFrames using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib to train a model of the specified family. Currently the gaussian and binomial families are supported. We support a subset of the available R formula operators for model fitting, including '~', '.', '+', and '-'. The example below shows the use of building a gaussian GLM model using SparkR. + +div data-lang=r markdown=1 +{% highlight r %} +# Create the DataFrame +df - createDataFrame(sqlContext, iris) + +# Fit a linear model over the dataset. +model - glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = gaussian) + +# Model coefficients are returned in a similar format to R's native glm(). +summary(model) +##$coefficients +##Estimate +##(Intercept)2.2513930 +##Sepal_Width0.8035609 +##Species_versicolor 1.4587432 +##Species_virginica 1.9468169 + +# Make predictions based on the model. +predictions - predict(model, newData = df) +head(select(predictions, Sepal_Length, prediction)) +## Sepal_Length prediction +##1 5.1 5.063856 +##2
spark git commit: [SPARK-9615] [SPARK-9616] [SQL] [MLLIB] Bugs related to FrequentItems when merging and with Tungsten
Repository: spark Updated Branches: refs/heads/branch-1.5 e24b97650 - 78f168e97 [SPARK-9615] [SPARK-9616] [SQL] [MLLIB] Bugs related to FrequentItems when merging and with Tungsten In short: 1- FrequentItems should not use the InternalRow representation, because the keys in the map get messed up. For example, every key in the Map correspond to the very last element observed in the partition, when the elements are strings. 2- Merging two partitions had a bug: **Existing behavior with size 3** Partition A - Map(1 - 3, 2 - 3, 3 - 4) Partition B - Map(4 - 25) Result - Map() **Correct Behavior:** Partition A - Map(1 - 3, 2 - 3, 3 - 4) Partition B - Map(4 - 25) Result - Map(3 - 1, 4 - 22) cc mengxr rxin JoshRosen Author: Burak Yavuz brk...@gmail.com Closes #7945 from brkyvz/freq-fix and squashes the following commits: 07fa001 [Burak Yavuz] address 2 1dc61a8 [Burak Yavuz] address 1 506753e [Burak Yavuz] fixed and added reg test 47bfd50 [Burak Yavuz] pushing (cherry picked from commit 98e69467d4fda2c26a951409b5b7c6f1e9345ce4) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/78f168e9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/78f168e9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/78f168e9 Branch: refs/heads/branch-1.5 Commit: 78f168e97238316e33ce0d3763ba655603928c32 Parents: e24b976 Author: Burak Yavuz brk...@gmail.com Authored: Thu Aug 6 10:29:40 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Aug 6 10:29:47 2015 -0700 -- .../sql/execution/stat/FrequentItems.scala | 26 +++- .../apache/spark/sql/DataFrameStatSuite.scala | 24 +++--- 2 files changed, 36 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/78f168e9/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala index 9329148..db46302 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala @@ -20,17 +20,15 @@ package org.apache.spark.sql.execution.stat import scala.collection.mutable.{Map = MutableMap} import org.apache.spark.Logging -import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.plans.logical.LocalRelation import org.apache.spark.sql.types._ -import org.apache.spark.sql.{Column, DataFrame} +import org.apache.spark.sql.{Row, Column, DataFrame} private[sql] object FrequentItems extends Logging { /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */ private class FreqItemCounter(size: Int) extends Serializable { val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long] - /** * Add a new example to the counts if it exists, otherwise deduct the count * from existing items. @@ -42,9 +40,15 @@ private[sql] object FrequentItems extends Logging { if (baseMap.size size) { baseMap += key - count } else { - // TODO: Make this more efficient... A flatMap? - baseMap.retain((k, v) = v count) - baseMap.transform((k, v) = v - count) + val minCount = baseMap.values.min + val remainder = count - minCount + if (remainder = 0) { +baseMap += key - count // something will get kicked out, so we can add this +baseMap.retain((k, v) = v minCount) +baseMap.transform((k, v) = v - minCount) + } else { +baseMap.transform((k, v) = v - count) + } } } this @@ -90,12 +94,12 @@ private[sql] object FrequentItems extends Logging { (name, originalSchema.fields(index).dataType) }.toArray -val freqItems = df.select(cols.map(Column(_)) : _*).queryExecution.toRdd.aggregate(countMaps)( +val freqItems = df.select(cols.map(Column(_)) : _*).rdd.aggregate(countMaps)( seqOp = (counts, row) = { var i = 0 while (i numCols) { val thisMap = counts(i) - val key = row.get(i, colInfo(i)._2) + val key = row.get(i) thisMap.add(key, 1L) i += 1 } @@ -110,13 +114,13 @@ private[sql] object FrequentItems extends Logging { baseCounts } ) -val justItems = freqItems.map(m = m.baseMap.keys.toArray).map(new GenericArrayData(_)) -val resultRow = InternalRow(justItems : _*) +val justItems = freqItems.map(m = m.baseMap.keys.toArray
spark git commit: [SPARK-9615] [SPARK-9616] [SQL] [MLLIB] Bugs related to FrequentItems when merging and with Tungsten
Repository: spark Updated Branches: refs/heads/master 076ec0568 - 98e69467d [SPARK-9615] [SPARK-9616] [SQL] [MLLIB] Bugs related to FrequentItems when merging and with Tungsten In short: 1- FrequentItems should not use the InternalRow representation, because the keys in the map get messed up. For example, every key in the Map correspond to the very last element observed in the partition, when the elements are strings. 2- Merging two partitions had a bug: **Existing behavior with size 3** Partition A - Map(1 - 3, 2 - 3, 3 - 4) Partition B - Map(4 - 25) Result - Map() **Correct Behavior:** Partition A - Map(1 - 3, 2 - 3, 3 - 4) Partition B - Map(4 - 25) Result - Map(3 - 1, 4 - 22) cc mengxr rxin JoshRosen Author: Burak Yavuz brk...@gmail.com Closes #7945 from brkyvz/freq-fix and squashes the following commits: 07fa001 [Burak Yavuz] address 2 1dc61a8 [Burak Yavuz] address 1 506753e [Burak Yavuz] fixed and added reg test 47bfd50 [Burak Yavuz] pushing Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/98e69467 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/98e69467 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/98e69467 Branch: refs/heads/master Commit: 98e69467d4fda2c26a951409b5b7c6f1e9345ce4 Parents: 076ec05 Author: Burak Yavuz brk...@gmail.com Authored: Thu Aug 6 10:29:40 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Aug 6 10:29:40 2015 -0700 -- .../sql/execution/stat/FrequentItems.scala | 26 +++- .../apache/spark/sql/DataFrameStatSuite.scala | 24 +++--- 2 files changed, 36 insertions(+), 14 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/98e69467/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala index 9329148..db46302 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala @@ -20,17 +20,15 @@ package org.apache.spark.sql.execution.stat import scala.collection.mutable.{Map = MutableMap} import org.apache.spark.Logging -import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.plans.logical.LocalRelation import org.apache.spark.sql.types._ -import org.apache.spark.sql.{Column, DataFrame} +import org.apache.spark.sql.{Row, Column, DataFrame} private[sql] object FrequentItems extends Logging { /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */ private class FreqItemCounter(size: Int) extends Serializable { val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long] - /** * Add a new example to the counts if it exists, otherwise deduct the count * from existing items. @@ -42,9 +40,15 @@ private[sql] object FrequentItems extends Logging { if (baseMap.size size) { baseMap += key - count } else { - // TODO: Make this more efficient... A flatMap? - baseMap.retain((k, v) = v count) - baseMap.transform((k, v) = v - count) + val minCount = baseMap.values.min + val remainder = count - minCount + if (remainder = 0) { +baseMap += key - count // something will get kicked out, so we can add this +baseMap.retain((k, v) = v minCount) +baseMap.transform((k, v) = v - minCount) + } else { +baseMap.transform((k, v) = v - count) + } } } this @@ -90,12 +94,12 @@ private[sql] object FrequentItems extends Logging { (name, originalSchema.fields(index).dataType) }.toArray -val freqItems = df.select(cols.map(Column(_)) : _*).queryExecution.toRdd.aggregate(countMaps)( +val freqItems = df.select(cols.map(Column(_)) : _*).rdd.aggregate(countMaps)( seqOp = (counts, row) = { var i = 0 while (i numCols) { val thisMap = counts(i) - val key = row.get(i, colInfo(i)._2) + val key = row.get(i) thisMap.add(key, 1L) i += 1 } @@ -110,13 +114,13 @@ private[sql] object FrequentItems extends Logging { baseCounts } ) -val justItems = freqItems.map(m = m.baseMap.keys.toArray).map(new GenericArrayData(_)) -val resultRow = InternalRow(justItems : _*) +val justItems = freqItems.map(m = m.baseMap.keys.toArray) +val resultRow = Row(justItems : _*) // append frequent Items to the column name for easy debugging val outputCols
spark git commit: [SPARK-6486] [MLLIB] [PYTHON] Add BlockMatrix to PySpark.
Repository: spark Updated Branches: refs/heads/branch-1.5 350006497 - eedb996dd [SPARK-6486] [MLLIB] [PYTHON] Add BlockMatrix to PySpark. mengxr This adds the `BlockMatrix` to PySpark. I have the conversions to `IndexedRowMatrix` and `CoordinateMatrix` ready as well, so once PR #7554 is completed (which relies on PR #7746), this PR can be finished. Author: Mike Dusenberry mwdus...@us.ibm.com Closes #7761 from dusenberrymw/SPARK-6486_Add_BlockMatrix_to_PySpark and squashes the following commits: 27195c2 [Mike Dusenberry] Adding one more check to _convert_to_matrix_block_tuple, and a few minor documentation changes. ae50883 [Mike Dusenberry] Minor update: BlockMatrix should inherit from DistributedMatrix. b8acc1c [Mike Dusenberry] Moving BlockMatrix to pyspark.mllib.linalg.distributed, updating the logic to match that of the other distributed matrices, adding conversions, and adding documentation. c014002 [Mike Dusenberry] Using properties for better documentation. 3bda6ab [Mike Dusenberry] Adding documentation. 8fb3095 [Mike Dusenberry] Small cleanup. e17af2e [Mike Dusenberry] Adding BlockMatrix to PySpark. (cherry picked from commit 34dcf10104460816382908b2b8eeb6c925e862bf) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eedb996d Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eedb996d Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eedb996d Branch: refs/heads/branch-1.5 Commit: eedb996dde5593a97bcb61b3b1515e6fdea6aa70 Parents: 3500064 Author: Mike Dusenberry mwdus...@us.ibm.com Authored: Wed Aug 5 07:40:50 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 5 07:42:25 2015 -0700 -- docs/mllib-data-types.md| 41 +++ .../spark/mllib/api/python/PythonMLLibAPI.scala | 25 ++ python/pyspark/mllib/linalg/distributed.py | 328 ++- 3 files changed, 388 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/eedb996d/docs/mllib-data-types.md -- diff --git a/docs/mllib-data-types.md b/docs/mllib-data-types.md index 11033bf..f0e8d54 100644 --- a/docs/mllib-data-types.md +++ b/docs/mllib-data-types.md @@ -494,6 +494,9 @@ rowMat = mat.toRowMatrix() # Convert to a CoordinateMatrix. coordinateMat = mat.toCoordinateMatrix() + +# Convert to a BlockMatrix. +blockMat = mat.toBlockMatrix() {% endhighlight %} /div @@ -594,6 +597,9 @@ rowMat = mat.toRowMatrix() # Convert to an IndexedRowMatrix. indexedRowMat = mat.toIndexedRowMatrix() + +# Convert to a BlockMatrix. +blockMat = mat.toBlockMatrix() {% endhighlight %} /div @@ -661,4 +667,39 @@ matA.validate(); BlockMatrix ata = matA.transpose().multiply(matA); {% endhighlight %} /div + +div data-lang=python markdown=1 + +A [`BlockMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.BlockMatrix) +can be created from an `RDD` of sub-matrix blocks, where a sub-matrix block is a +`((blockRowIndex, blockColIndex), sub-matrix)` tuple. + +{% highlight python %} +from pyspark.mllib.linalg import Matrices +from pyspark.mllib.linalg.distributed import BlockMatrix + +# Create an RDD of sub-matrix blocks. +blocks = sc.parallelize([((0, 0), Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])), + ((1, 0), Matrices.dense(3, 2, [7, 8, 9, 10, 11, 12]))]) + +# Create a BlockMatrix from an RDD of sub-matrix blocks. +mat = BlockMatrix(blocks, 3, 2) + +# Get its size. +m = mat.numRows() # 6 +n = mat.numCols() # 2 + +# Get the blocks as an RDD of sub-matrix blocks. +blocksRDD = mat.blocks + +# Convert to a LocalMatrix. +localMat = mat.toLocalMatrix() + +# Convert to an IndexedRowMatrix. +indexedRowMat = mat.toIndexedRowMatrix() + +# Convert to a CoordinateMatrix. +coordinateMat = mat.toCoordinateMatrix() +{% endhighlight %} +/div /div http://git-wip-us.apache.org/repos/asf/spark/blob/eedb996d/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala index d2b3fae..f585aac 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala @@ -1129,6 +1129,21 @@ private[python] class PythonMLLibAPI extends Serializable { } /** + * Wrapper around BlockMatrix constructor. + */ + def createBlockMatrix(blocks: DataFrame, rowsPerBlock: Int, colsPerBlock: Int, +numRows: Long, numCols: Long
spark git commit: [SPARK-6486] [MLLIB] [PYTHON] Add BlockMatrix to PySpark.
Repository: spark Updated Branches: refs/heads/master 519cf6d3f - 34dcf1010 [SPARK-6486] [MLLIB] [PYTHON] Add BlockMatrix to PySpark. mengxr This adds the `BlockMatrix` to PySpark. I have the conversions to `IndexedRowMatrix` and `CoordinateMatrix` ready as well, so once PR #7554 is completed (which relies on PR #7746), this PR can be finished. Author: Mike Dusenberry mwdus...@us.ibm.com Closes #7761 from dusenberrymw/SPARK-6486_Add_BlockMatrix_to_PySpark and squashes the following commits: 27195c2 [Mike Dusenberry] Adding one more check to _convert_to_matrix_block_tuple, and a few minor documentation changes. ae50883 [Mike Dusenberry] Minor update: BlockMatrix should inherit from DistributedMatrix. b8acc1c [Mike Dusenberry] Moving BlockMatrix to pyspark.mllib.linalg.distributed, updating the logic to match that of the other distributed matrices, adding conversions, and adding documentation. c014002 [Mike Dusenberry] Using properties for better documentation. 3bda6ab [Mike Dusenberry] Adding documentation. 8fb3095 [Mike Dusenberry] Small cleanup. e17af2e [Mike Dusenberry] Adding BlockMatrix to PySpark. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/34dcf101 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/34dcf101 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/34dcf101 Branch: refs/heads/master Commit: 34dcf10104460816382908b2b8eeb6c925e862bf Parents: 519cf6d Author: Mike Dusenberry mwdus...@us.ibm.com Authored: Wed Aug 5 07:40:50 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 5 07:40:50 2015 -0700 -- docs/mllib-data-types.md| 41 +++ .../spark/mllib/api/python/PythonMLLibAPI.scala | 25 ++ python/pyspark/mllib/linalg/distributed.py | 328 ++- 3 files changed, 388 insertions(+), 6 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/34dcf101/docs/mllib-data-types.md -- diff --git a/docs/mllib-data-types.md b/docs/mllib-data-types.md index 11033bf..f0e8d54 100644 --- a/docs/mllib-data-types.md +++ b/docs/mllib-data-types.md @@ -494,6 +494,9 @@ rowMat = mat.toRowMatrix() # Convert to a CoordinateMatrix. coordinateMat = mat.toCoordinateMatrix() + +# Convert to a BlockMatrix. +blockMat = mat.toBlockMatrix() {% endhighlight %} /div @@ -594,6 +597,9 @@ rowMat = mat.toRowMatrix() # Convert to an IndexedRowMatrix. indexedRowMat = mat.toIndexedRowMatrix() + +# Convert to a BlockMatrix. +blockMat = mat.toBlockMatrix() {% endhighlight %} /div @@ -661,4 +667,39 @@ matA.validate(); BlockMatrix ata = matA.transpose().multiply(matA); {% endhighlight %} /div + +div data-lang=python markdown=1 + +A [`BlockMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.BlockMatrix) +can be created from an `RDD` of sub-matrix blocks, where a sub-matrix block is a +`((blockRowIndex, blockColIndex), sub-matrix)` tuple. + +{% highlight python %} +from pyspark.mllib.linalg import Matrices +from pyspark.mllib.linalg.distributed import BlockMatrix + +# Create an RDD of sub-matrix blocks. +blocks = sc.parallelize([((0, 0), Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])), + ((1, 0), Matrices.dense(3, 2, [7, 8, 9, 10, 11, 12]))]) + +# Create a BlockMatrix from an RDD of sub-matrix blocks. +mat = BlockMatrix(blocks, 3, 2) + +# Get its size. +m = mat.numRows() # 6 +n = mat.numCols() # 2 + +# Get the blocks as an RDD of sub-matrix blocks. +blocksRDD = mat.blocks + +# Convert to a LocalMatrix. +localMat = mat.toLocalMatrix() + +# Convert to an IndexedRowMatrix. +indexedRowMat = mat.toIndexedRowMatrix() + +# Convert to a CoordinateMatrix. +coordinateMat = mat.toCoordinateMatrix() +{% endhighlight %} +/div /div http://git-wip-us.apache.org/repos/asf/spark/blob/34dcf101/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala index d2b3fae..f585aac 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala @@ -1129,6 +1129,21 @@ private[python] class PythonMLLibAPI extends Serializable { } /** + * Wrapper around BlockMatrix constructor. + */ + def createBlockMatrix(blocks: DataFrame, rowsPerBlock: Int, colsPerBlock: Int, +numRows: Long, numCols: Long): BlockMatrix = { +// We use DataFrames for serialization of sub-matrix blocks from +// Python, so map each Row in the DataFrame
spark git commit: [SPARK-5895] [ML] Add VectorSlicer - updated
Repository: spark Updated Branches: refs/heads/master 9c878923d - a018b8571 [SPARK-5895] [ML] Add VectorSlicer - updated Add VectorSlicer transformer to spark.ml, with features specified as either indices or names. Transfers feature attributes for selected features. Updated version of [https://github.com/apache/spark/pull/5731] CC: yinxusen This updates your PR. You'll still be the primary author of this PR. CC: mengxr Author: Xusen Yin yinxu...@gmail.com Author: Joseph K. Bradley jos...@databricks.com Closes #7972 from jkbradley/yinxusen-SPARK-5895 and squashes the following commits: b16e86e [Joseph K. Bradley] fixed scala style 71c65d2 [Joseph K. Bradley] fix import order 86e9739 [Joseph K. Bradley] cleanups per code review 9d8d6f1 [Joseph K. Bradley] style fix 83bc2e9 [Joseph K. Bradley] Updated VectorSlicer 98c6939 [Xusen Yin] fix style error ecbf2d3 [Xusen Yin] change interfaces and params f6be302 [Xusen Yin] Merge branch 'master' into SPARK-5895 e4781f2 [Xusen Yin] fix commit error fd154d7 [Xusen Yin] add test suite of vector slicer 17171f8 [Xusen Yin] fix slicer 9ab9747 [Xusen Yin] add vector slicer aa5a0bf [Xusen Yin] add vector slicer Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a018b857 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a018b857 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a018b857 Branch: refs/heads/master Commit: a018b85716fd510ae95a3c66d676bbdb90f8d4e7 Parents: 9c87892 Author: Xusen Yin yinxu...@gmail.com Authored: Wed Aug 5 17:07:55 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 5 17:07:55 2015 -0700 -- .../apache/spark/ml/feature/VectorSlicer.scala | 170 +++ .../apache/spark/ml/util/MetadataUtils.scala| 17 ++ .../org/apache/spark/mllib/linalg/Vectors.scala | 24 +++ .../spark/ml/feature/VectorSlicerSuite.scala| 109 .../spark/mllib/linalg/VectorsSuite.scala | 7 + 5 files changed, 327 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a018b857/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala new file mode 100644 index 000..772bebe --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala @@ -0,0 +1,170 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.attribute.{Attribute, AttributeGroup} +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.param.{IntArrayParam, ParamMap, StringArrayParam} +import org.apache.spark.ml.util.{Identifiable, MetadataUtils, SchemaUtils} +import org.apache.spark.mllib.linalg._ +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.StructType + +/** + * :: Experimental :: + * This class takes a feature vector and outputs a new feature vector with a subarray of the + * original features. + * + * The subset of features can be specified with either indices ([[setIndices()]]) + * or names ([[setNames()]]). At least one feature must be selected. Duplicate features + * are not allowed, so there can be no overlap between selected indices and names. + * + * The output vector will order features with the selected indices first (in the order given), + * followed by the selected names (in the order given). + */ +@Experimental +final class VectorSlicer(override val uid: String) + extends Transformer with HasInputCol with HasOutputCol { + + def this() = this(Identifiable.randomUID(vectorSlicer)) + + /** + * An array of indices to select features from a vector column. + * There can be no overlap with [[names]]. + * @group param
spark git commit: [SPARK-9657] Fix return type of getMaxPatternLength
Repository: spark Updated Branches: refs/heads/master f9c2a2af1 - dac090d1e [SPARK-9657] Fix return type of getMaxPatternLength mengxr Author: Feynman Liang fli...@databricks.com Closes #7974 from feynmanliang/SPARK-9657 and squashes the following commits: 7ca533f [Feynman Liang] Fix return type of getMaxPatternLength Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dac090d1 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dac090d1 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dac090d1 Branch: refs/heads/master Commit: dac090d1e9be7dec6c5ebdb2a81105b87e853193 Parents: f9c2a2a Author: Feynman Liang fli...@databricks.com Authored: Wed Aug 5 15:42:18 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 5 15:42:18 2015 -0700 -- mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dac090d1/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala index d5f0c92..ad6715b5 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala @@ -82,7 +82,7 @@ class PrefixSpan private ( /** * Gets the maximal pattern length (i.e. the length of the longest sequential pattern to consider. */ - def getMaxPatternLength: Double = maxPatternLength + def getMaxPatternLength: Int = maxPatternLength /** * Sets maximal pattern length (default: `10`). - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9657] Fix return type of getMaxPatternLength
Repository: spark Updated Branches: refs/heads/branch-1.5 05cbf133d - 30e9fcfb3 [SPARK-9657] Fix return type of getMaxPatternLength mengxr Author: Feynman Liang fli...@databricks.com Closes #7974 from feynmanliang/SPARK-9657 and squashes the following commits: 7ca533f [Feynman Liang] Fix return type of getMaxPatternLength (cherry picked from commit dac090d1e9be7dec6c5ebdb2a81105b87e853193) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/30e9fcfb Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/30e9fcfb Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/30e9fcfb Branch: refs/heads/branch-1.5 Commit: 30e9fcfb321966c09f86eec4e70c579d6dff1cca Parents: 05cbf13 Author: Feynman Liang fli...@databricks.com Authored: Wed Aug 5 15:42:18 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 5 15:42:24 2015 -0700 -- mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/30e9fcfb/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala index d5f0c92..ad6715b5 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala @@ -82,7 +82,7 @@ class PrefixSpan private ( /** * Gets the maximal pattern length (i.e. the length of the longest sequential pattern to consider. */ - def getMaxPatternLength: Double = maxPatternLength + def getMaxPatternLength: Int = maxPatternLength /** * Sets maximal pattern length (default: `10`). - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-5895] [ML] Add VectorSlicer - updated
Repository: spark Updated Branches: refs/heads/branch-1.5 618dc63e7 - 3b617e87c [SPARK-5895] [ML] Add VectorSlicer - updated Add VectorSlicer transformer to spark.ml, with features specified as either indices or names. Transfers feature attributes for selected features. Updated version of [https://github.com/apache/spark/pull/5731] CC: yinxusen This updates your PR. You'll still be the primary author of this PR. CC: mengxr Author: Xusen Yin yinxu...@gmail.com Author: Joseph K. Bradley jos...@databricks.com Closes #7972 from jkbradley/yinxusen-SPARK-5895 and squashes the following commits: b16e86e [Joseph K. Bradley] fixed scala style 71c65d2 [Joseph K. Bradley] fix import order 86e9739 [Joseph K. Bradley] cleanups per code review 9d8d6f1 [Joseph K. Bradley] style fix 83bc2e9 [Joseph K. Bradley] Updated VectorSlicer 98c6939 [Xusen Yin] fix style error ecbf2d3 [Xusen Yin] change interfaces and params f6be302 [Xusen Yin] Merge branch 'master' into SPARK-5895 e4781f2 [Xusen Yin] fix commit error fd154d7 [Xusen Yin] add test suite of vector slicer 17171f8 [Xusen Yin] fix slicer 9ab9747 [Xusen Yin] add vector slicer aa5a0bf [Xusen Yin] add vector slicer (cherry picked from commit a018b85716fd510ae95a3c66d676bbdb90f8d4e7) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3b617e87 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3b617e87 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3b617e87 Branch: refs/heads/branch-1.5 Commit: 3b617e87cc8524a86a9d5c4a9971520b91119736 Parents: 618dc63 Author: Xusen Yin yinxu...@gmail.com Authored: Wed Aug 5 17:07:55 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Wed Aug 5 17:08:04 2015 -0700 -- .../apache/spark/ml/feature/VectorSlicer.scala | 170 +++ .../apache/spark/ml/util/MetadataUtils.scala| 17 ++ .../org/apache/spark/mllib/linalg/Vectors.scala | 24 +++ .../spark/ml/feature/VectorSlicerSuite.scala| 109 .../spark/mllib/linalg/VectorsSuite.scala | 7 + 5 files changed, 327 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/3b617e87/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala new file mode 100644 index 000..772bebe --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala @@ -0,0 +1,170 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.attribute.{Attribute, AttributeGroup} +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.param.{IntArrayParam, ParamMap, StringArrayParam} +import org.apache.spark.ml.util.{Identifiable, MetadataUtils, SchemaUtils} +import org.apache.spark.mllib.linalg._ +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types.StructType + +/** + * :: Experimental :: + * This class takes a feature vector and outputs a new feature vector with a subarray of the + * original features. + * + * The subset of features can be specified with either indices ([[setIndices()]]) + * or names ([[setNames()]]). At least one feature must be selected. Duplicate features + * are not allowed, so there can be no overlap between selected indices and names. + * + * The output vector will order features with the selected indices first (in the order given), + * followed by the selected names (in the order given). + */ +@Experimental +final class VectorSlicer(override val uid: String) + extends Transformer with HasInputCol with HasOutputCol { + + def this() = this(Identifiable.randomUID(vectorSlicer
spark git commit: [SPARK-9540] [MLLIB] optimize PrefixSpan implementation
Repository: spark Updated Branches: refs/heads/branch-1.5 6e72d24e2 - bca196754 [SPARK-9540] [MLLIB] optimize PrefixSpan implementation This is a major refactoring of the PrefixSpan implementation. It contains the following changes: 1. Expand prefix with one item at a time. The existing implementation generates all subsets for each itemset, which might have scalability issue when the itemset is large. 2. Use a new internal format. `(12)(31)` is represented by `[0, 1, 2, 0, 1, 3, 0]` internally. We use `0` because negative numbers are used to indicates partial prefix items, e.g., `_2` is represented by `-2`. 3. Remember the start indices of all partial projections in the projected postfix to help next projection. 4. Reuse the original sequence array for projected postfixes. 5. Use `Prefix` IDs in aggregation rather than its content. 6. Use `ArrayBuilder` for building primitive arrays. 7. Expose `maxLocalProjDBSize`. 8. Tests are not changed except using `0` instead of `-1` as the delimiter. `Postfix`'s API doc should be a good place to start. Closes #7594 feynmanliang zhangjiajin Author: Xiangrui Meng m...@databricks.com Closes #7937 from mengxr/SPARK-9540 and squashes the following commits: 2d0ec31 [Xiangrui Meng] address more comments 48f450c [Xiangrui Meng] address comments from Feynman; fixed a bug in project and added a test 65f90e8 [Xiangrui Meng] naming and documentation 8afc86a [Xiangrui Meng] refactor impl (cherry picked from commit a02bcf20c4fc9e2e182630d197221729e996afc2) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bca19675 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bca19675 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bca19675 Branch: refs/heads/branch-1.5 Commit: bca196754ddf2ccd057d775bd5c3f7d3e5657e6f Parents: 6e72d24 Author: Xiangrui Meng m...@databricks.com Authored: Tue Aug 4 22:28:49 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 4 22:28:58 2015 -0700 -- .../spark/mllib/fpm/LocalPrefixSpan.scala | 132 +++-- .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 587 --- .../spark/mllib/fpm/PrefixSpanSuite.scala | 271 + 3 files changed, 599 insertions(+), 391 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/bca19675/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala index ccebf95..3ea1077 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala @@ -22,85 +22,89 @@ import scala.collection.mutable import org.apache.spark.Logging /** - * Calculate all patterns of a projected database in local. + * Calculate all patterns of a projected database in local mode. + * + * @param minCount minimal count for a frequent pattern + * @param maxPatternLength max pattern length for a frequent pattern */ -private[fpm] object LocalPrefixSpan extends Logging with Serializable { - import PrefixSpan._ +private[fpm] class LocalPrefixSpan( +val minCount: Long, +val maxPatternLength: Int) extends Logging with Serializable { + import PrefixSpan.Postfix + import LocalPrefixSpan.ReversedPrefix + /** - * Calculate all patterns of a projected database. - * @param minCount minimum count - * @param maxPatternLength maximum pattern length - * @param prefixes prefixes in reversed order - * @param database the projected database - * @return a set of sequential pattern pairs, - * the key of pair is sequential pattern (a list of items in reversed order), - * the value of pair is the pattern's count. + * Generates frequent patterns on the input array of postfixes. + * @param postfixes an array of postfixes + * @return an iterator of (frequent pattern, count) */ - def run( - minCount: Long, - maxPatternLength: Int, - prefixes: List[Set[Int]], - database: Iterable[List[Set[Int]]]): Iterator[(List[Set[Int]], Long)] = { -if (prefixes.length == maxPatternLength || database.isEmpty) { - return Iterator.empty -} -val freqItemSetsAndCounts = getFreqItemAndCounts(minCount, database) -val freqItems = freqItemSetsAndCounts.keys.flatten.toSet -val filteredDatabase = database.map { suffix = - suffix -.map(item = freqItems.intersect(item)) -.filter(_.nonEmpty) -} -freqItemSetsAndCounts.iterator.flatMap { case (item, count) = - val newPrefixes = item :: prefixes - val
spark git commit: [SPARK-6485] [MLLIB] [PYTHON] Add CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark.
conversion logic. 4d7af86 [Mike Dusenberry] Adding type checks to the constructors. Although it is slightly verbose, it is better for the user to have a good error message than a cryptic stacktrace. 93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out of this pull request. f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out of this pull request. 6a3ecb7 [Mike Dusenberry] Updating pattern matching. 08f287b [Mike Dusenberry] Slight reformatting of the documentation. a245dc0 [Mike Dusenberry] Updating Python doctests for compatability between Python 2 3. Since Python 3 removed the idea of a separate 'long' type, all values that would have been outputted as a 'long' (ex: '4L') will now be treated as an 'int' and outputed as one (ex: '4'). The doctests now explicitly convert to ints so that both Python 2 and 3 will have the same output. This is fine since the values are all small, and thus can be easily represented as ints. 4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines. 7e3ca16 [Mike Dusenberry] Fixing long lines. f721ead [Mike Dusenberry] Updating documentation for each of the distributed matrices. ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful. dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the various distributed matrices. Added logic to be able to access the rows/entries of the distributed matrices, which requires serialization through DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests. 0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark, following the idea of the IndexedRowMatrix API, including using DataFrames for serialization. 3c369cb [Mike Dusenberry] Updating the architecture a bit to make conversions between the various distributed matrix types easier. The different distributed matrix classes are now only wrappers around the Java objects, and take the Java object as an argument during construction. This way, we can call for example on an , which returns a reference to a Java RowMatrix object, and then construct a PySpark RowMatrix object wrapped around the Java object. This is analogous to the behavior of PySpark RDDs and DataFrames. We now delegate creation of the various distributed matrices from scratch in PySpark to the factory methods on . 4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark, following the idea of the RowMatrix API. Note that for the IndexedRowMatrix, we use DataFrames to serialize the data between Python and Scala/Java, so we accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on the Scala/Java side before constructing the IndexedRowMatrix. 23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix. Inserting newline above doctest so that it renders properly in API docs. b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix simply create and keep a reference to a wrapper over a Java RowMatrix. Updating DistributedMatrices factory methods to accept numRows and numCols with default values. Updating PySpark DistributedMatrices factory method to simply create a PySpark RowMatrix. Adding additional doctests for numRows and numCols parameters. bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods. d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing the following: Added a DistributedMatrices class to contain factory methods for creating the various distributed matrices. Added a factory method for creating a RowMatrix from an RDD of Vectors. Added a createRowMatrix function to the PythonMLlibAPI to interface with the factory method. Added DistributedMatrix, DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/571d5b53 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/571d5b53 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/571d5b53 Branch: refs/heads/master Commit: 571d5b5363ff4dbbce1f7019ab8e86cbc3cba4d5 Parents: 1833d9c Author: Mike Dusenberry mwdus...@us.ibm.com Authored: Tue Aug 4 16:30:03 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 4 16:30:03 2015 -0700 -- dev/sparktestsupport/modules.py | 1 + docs/mllib-data-types.md| 106 +++- .../spark/mllib/api/python/PythonMLLibAPI.scala | 53 +- python/docs/pyspark.mllib.rst | 8 + python/pyspark/mllib/common.py | 2 + python/pyspark/mllib/linalg/distributed.py | 537 +++ 6 files changed, 704 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/571d5b53/dev/sparktestsupport
spark git commit: [SPARK-6485] [MLLIB] [PYTHON] Add CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark.
conversion logic. 4d7af86 [Mike Dusenberry] Adding type checks to the constructors. Although it is slightly verbose, it is better for the user to have a good error message than a cryptic stacktrace. 93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out of this pull request. f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out of this pull request. 6a3ecb7 [Mike Dusenberry] Updating pattern matching. 08f287b [Mike Dusenberry] Slight reformatting of the documentation. a245dc0 [Mike Dusenberry] Updating Python doctests for compatability between Python 2 3. Since Python 3 removed the idea of a separate 'long' type, all values that would have been outputted as a 'long' (ex: '4L') will now be treated as an 'int' and outputed as one (ex: '4'). The doctests now explicitly convert to ints so that both Python 2 and 3 will have the same output. This is fine since the values are all small, and thus can be easily represented as ints. 4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines. 7e3ca16 [Mike Dusenberry] Fixing long lines. f721ead [Mike Dusenberry] Updating documentation for each of the distributed matrices. ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful. dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the various distributed matrices. Added logic to be able to access the rows/entries of the distributed matrices, which requires serialization through DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests. 0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark, following the idea of the IndexedRowMatrix API, including using DataFrames for serialization. 3c369cb [Mike Dusenberry] Updating the architecture a bit to make conversions between the various distributed matrix types easier. The different distributed matrix classes are now only wrappers around the Java objects, and take the Java object as an argument during construction. This way, we can call for example on an , which returns a reference to a Java RowMatrix object, and then construct a PySpark RowMatrix object wrapped around the Java object. This is analogous to the behavior of PySpark RDDs and DataFrames. We now delegate creation of the various distributed matrices from scratch in PySpark to the factory methods on . 4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark, following the idea of the RowMatrix API. Note that for the IndexedRowMatrix, we use DataFrames to serialize the data between Python and Scala/Java, so we accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on the Scala/Java side before constructing the IndexedRowMatrix. 23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix. Inserting newline above doctest so that it renders properly in API docs. b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix simply create and keep a reference to a wrapper over a Java RowMatrix. Updating DistributedMatrices factory methods to accept numRows and numCols with default values. Updating PySpark DistributedMatrices factory method to simply create a PySpark RowMatrix. Adding additional doctests for numRows and numCols parameters. bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods. d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing the following: Added a DistributedMatrices class to contain factory methods for creating the various distributed matrices. Added a factory method for creating a RowMatrix from an RDD of Vectors. Added a createRowMatrix function to the PythonMLlibAPI to interface with the factory method. Added DistributedMatrix, DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api. (cherry picked from commit 571d5b5363ff4dbbce1f7019ab8e86cbc3cba4d5) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f4e125ac Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f4e125ac Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f4e125ac Branch: refs/heads/branch-1.5 Commit: f4e125acf36023425722abb0fb74be63a425aa7b Parents: fe4a4f4 Author: Mike Dusenberry mwdus...@us.ibm.com Authored: Tue Aug 4 16:30:03 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 4 16:30:11 2015 -0700 -- dev/sparktestsupport/modules.py | 1 + docs/mllib-data-types.md| 106 +++- .../spark/mllib/api/python/PythonMLLibAPI.scala | 53 +- python/docs/pyspark.mllib.rst | 8 + python/pyspark/mllib/common.py | 2 + python/pyspark/mllib/linalg/distributed.py | 537 +++ 6 files changed, 704 insertions(+), 3 deletions
spark git commit: [SPARK-9586] [ML] Update BinaryClassificationEvaluator to use setRawPredictionCol
Repository: spark Updated Branches: refs/heads/branch-1.5 f4e125acf - cff0fe291 [SPARK-9586] [ML] Update BinaryClassificationEvaluator to use setRawPredictionCol Update BinaryClassificationEvaluator to use setRawPredictionCol, rather than setScoreCol. Deprecated setScoreCol. I don't think setScoreCol was actually used anywhere (based on search). CC: mengxr Author: Joseph K. Bradley jos...@databricks.com Closes #7921 from jkbradley/binary-eval-rawpred and squashes the following commits: e5d7dfa [Joseph K. Bradley] Update BinaryClassificationEvaluator to use setRawPredictionCol (cherry picked from commit b77d3b9688d56d33737909375d1d0db07da5827b) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cff0fe29 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cff0fe29 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cff0fe29 Branch: refs/heads/branch-1.5 Commit: cff0fe291aa470ef5cf4e5087c7114fb6360572f Parents: f4e125a Author: Joseph K. Bradley jos...@databricks.com Authored: Tue Aug 4 16:52:43 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 4 16:52:53 2015 -0700 -- .../spark/ml/evaluation/BinaryClassificationEvaluator.scala | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/cff0fe29/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala b/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala index 4a82b77..5d5cb7e 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala @@ -28,7 +28,7 @@ import org.apache.spark.sql.types.DoubleType /** * :: Experimental :: - * Evaluator for binary classification, which expects two input columns: score and label. + * Evaluator for binary classification, which expects two input columns: rawPrediction and label. */ @Experimental class BinaryClassificationEvaluator(override val uid: String) @@ -50,6 +50,13 @@ class BinaryClassificationEvaluator(override val uid: String) def setMetricName(value: String): this.type = set(metricName, value) /** @group setParam */ + def setRawPredictionCol(value: String): this.type = set(rawPredictionCol, value) + + /** + * @group setParam + * @deprecated use [[setRawPredictionCol()]] instead + */ + @deprecated(use setRawPredictionCol instead, 1.5.0) def setScoreCol(value: String): this.type = set(rawPredictionCol, value) /** @group setParam */ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9586] [ML] Update BinaryClassificationEvaluator to use setRawPredictionCol
Repository: spark Updated Branches: refs/heads/master 571d5b536 - b77d3b968 [SPARK-9586] [ML] Update BinaryClassificationEvaluator to use setRawPredictionCol Update BinaryClassificationEvaluator to use setRawPredictionCol, rather than setScoreCol. Deprecated setScoreCol. I don't think setScoreCol was actually used anywhere (based on search). CC: mengxr Author: Joseph K. Bradley jos...@databricks.com Closes #7921 from jkbradley/binary-eval-rawpred and squashes the following commits: e5d7dfa [Joseph K. Bradley] Update BinaryClassificationEvaluator to use setRawPredictionCol Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b77d3b96 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b77d3b96 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b77d3b96 Branch: refs/heads/master Commit: b77d3b9688d56d33737909375d1d0db07da5827b Parents: 571d5b5 Author: Joseph K. Bradley jos...@databricks.com Authored: Tue Aug 4 16:52:43 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Tue Aug 4 16:52:43 2015 -0700 -- .../spark/ml/evaluation/BinaryClassificationEvaluator.scala | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/b77d3b96/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala b/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala index 4a82b77..5d5cb7e 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala @@ -28,7 +28,7 @@ import org.apache.spark.sql.types.DoubleType /** * :: Experimental :: - * Evaluator for binary classification, which expects two input columns: score and label. + * Evaluator for binary classification, which expects two input columns: rawPrediction and label. */ @Experimental class BinaryClassificationEvaluator(override val uid: String) @@ -50,6 +50,13 @@ class BinaryClassificationEvaluator(override val uid: String) def setMetricName(value: String): this.type = set(metricName, value) /** @group setParam */ + def setRawPredictionCol(value: String): this.type = set(rawPredictionCol, value) + + /** + * @group setParam + * @deprecated use [[setRawPredictionCol()]] instead + */ + @deprecated(use setRawPredictionCol instead, 1.5.0) def setScoreCol(value: String): this.type = set(rawPredictionCol, value) /** @group setParam */ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9544] [MLLIB] add Python API for RFormula
Repository: spark Updated Branches: refs/heads/branch-1.5 444058d91 - dc0c8c982 [SPARK-9544] [MLLIB] add Python API for RFormula Add Python API for RFormula. Similar to other feature transformers in Python. This is just a thin wrapper over the Scala implementation. ericl MechCoder Author: Xiangrui Meng m...@databricks.com Closes #7879 from mengxr/SPARK-9544 and squashes the following commits: 3d5ff03 [Xiangrui Meng] add an doctest for . and - 5e969a5 [Xiangrui Meng] fix pydoc 1cd41f8 [Xiangrui Meng] organize imports 3c18b10 [Xiangrui Meng] add Python API for RFormula (cherry picked from commit e4765a46833baff1dd7465c4cf50e947de7e8f21) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dc0c8c98 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dc0c8c98 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dc0c8c98 Branch: refs/heads/branch-1.5 Commit: dc0c8c982825c3c58b7c6c4570c03ba97dba608b Parents: 444058d Author: Xiangrui Meng m...@databricks.com Authored: Mon Aug 3 13:59:35 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 3 13:59:45 2015 -0700 -- .../org/apache/spark/ml/feature/RFormula.scala | 21 ++--- python/pyspark/ml/feature.py| 85 +++- 2 files changed, 91 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/dc0c8c98/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala index d172691..d5360c9 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala @@ -19,16 +19,14 @@ package org.apache.spark.ml.feature import scala.collection.mutable import scala.collection.mutable.ArrayBuffer -import scala.util.parsing.combinator.RegexParsers import org.apache.spark.annotation.Experimental -import org.apache.spark.ml.{Estimator, Model, Transformer, Pipeline, PipelineModel, PipelineStage} +import org.apache.spark.ml.{Estimator, Model, Pipeline, PipelineModel, PipelineStage, Transformer} import org.apache.spark.ml.param.{Param, ParamMap} import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} import org.apache.spark.ml.util.Identifiable import org.apache.spark.mllib.linalg.VectorUDT import org.apache.spark.sql.DataFrame -import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ /** @@ -63,31 +61,26 @@ class RFormula(override val uid: String) extends Estimator[RFormulaModel] with R */ val formula: Param[String] = new Param(this, formula, R model formula) - private var parsedFormula: Option[ParsedRFormula] = None - /** * Sets the formula to use for this transformer. Must be called before use. * @group setParam * @param value an R formula in string form (e.g. y ~ x + z) */ - def setFormula(value: String): this.type = { -parsedFormula = Some(RFormulaParser.parse(value)) -set(formula, value) -this - } + def setFormula(value: String): this.type = set(formula, value) /** @group getParam */ def getFormula: String = $(formula) /** Whether the formula specifies fitting an intercept. */ private[ml] def hasIntercept: Boolean = { -require(parsedFormula.isDefined, Must call setFormula() first.) -parsedFormula.get.hasIntercept +require(isDefined(formula), Formula must be defined first.) +RFormulaParser.parse($(formula)).hasIntercept } override def fit(dataset: DataFrame): RFormulaModel = { -require(parsedFormula.isDefined, Must call setFormula() first.) -val resolvedFormula = parsedFormula.get.resolve(dataset.schema) +require(isDefined(formula), Formula must be defined first.) +val parsedFormula = RFormulaParser.parse($(formula)) +val resolvedFormula = parsedFormula.resolve(dataset.schema) // StringType terms and terms representing interactions need to be encoded before assembly. // TODO(ekl) add support for feature interactions val encoderStages = ArrayBuffer[PipelineStage]() http://git-wip-us.apache.org/repos/asf/spark/blob/dc0c8c98/python/pyspark/ml/feature.py -- diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index 015e7a9..3f04c41 100644 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -24,7 +24,7 @@ from pyspark.mllib.common import inherit_doc __all__ = ['Binarizer', 'HashingTF', 'IDF', 'IDFModel', 'NGram', 'Normalizer', 'OneHotEncoder
spark git commit: [SPARK-9544] [MLLIB] add Python API for RFormula
Repository: spark Updated Branches: refs/heads/master 8ca287ebb - e4765a468 [SPARK-9544] [MLLIB] add Python API for RFormula Add Python API for RFormula. Similar to other feature transformers in Python. This is just a thin wrapper over the Scala implementation. ericl MechCoder Author: Xiangrui Meng m...@databricks.com Closes #7879 from mengxr/SPARK-9544 and squashes the following commits: 3d5ff03 [Xiangrui Meng] add an doctest for . and - 5e969a5 [Xiangrui Meng] fix pydoc 1cd41f8 [Xiangrui Meng] organize imports 3c18b10 [Xiangrui Meng] add Python API for RFormula Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e4765a46 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e4765a46 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e4765a46 Branch: refs/heads/master Commit: e4765a46833baff1dd7465c4cf50e947de7e8f21 Parents: 8ca287e Author: Xiangrui Meng m...@databricks.com Authored: Mon Aug 3 13:59:35 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Mon Aug 3 13:59:35 2015 -0700 -- .../org/apache/spark/ml/feature/RFormula.scala | 21 ++--- python/pyspark/ml/feature.py| 85 +++- 2 files changed, 91 insertions(+), 15 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e4765a46/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala index d172691..d5360c9 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala @@ -19,16 +19,14 @@ package org.apache.spark.ml.feature import scala.collection.mutable import scala.collection.mutable.ArrayBuffer -import scala.util.parsing.combinator.RegexParsers import org.apache.spark.annotation.Experimental -import org.apache.spark.ml.{Estimator, Model, Transformer, Pipeline, PipelineModel, PipelineStage} +import org.apache.spark.ml.{Estimator, Model, Pipeline, PipelineModel, PipelineStage, Transformer} import org.apache.spark.ml.param.{Param, ParamMap} import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol} import org.apache.spark.ml.util.Identifiable import org.apache.spark.mllib.linalg.VectorUDT import org.apache.spark.sql.DataFrame -import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ /** @@ -63,31 +61,26 @@ class RFormula(override val uid: String) extends Estimator[RFormulaModel] with R */ val formula: Param[String] = new Param(this, formula, R model formula) - private var parsedFormula: Option[ParsedRFormula] = None - /** * Sets the formula to use for this transformer. Must be called before use. * @group setParam * @param value an R formula in string form (e.g. y ~ x + z) */ - def setFormula(value: String): this.type = { -parsedFormula = Some(RFormulaParser.parse(value)) -set(formula, value) -this - } + def setFormula(value: String): this.type = set(formula, value) /** @group getParam */ def getFormula: String = $(formula) /** Whether the formula specifies fitting an intercept. */ private[ml] def hasIntercept: Boolean = { -require(parsedFormula.isDefined, Must call setFormula() first.) -parsedFormula.get.hasIntercept +require(isDefined(formula), Formula must be defined first.) +RFormulaParser.parse($(formula)).hasIntercept } override def fit(dataset: DataFrame): RFormulaModel = { -require(parsedFormula.isDefined, Must call setFormula() first.) -val resolvedFormula = parsedFormula.get.resolve(dataset.schema) +require(isDefined(formula), Formula must be defined first.) +val parsedFormula = RFormulaParser.parse($(formula)) +val resolvedFormula = parsedFormula.resolve(dataset.schema) // StringType terms and terms representing interactions need to be encoded before assembly. // TODO(ekl) add support for feature interactions val encoderStages = ArrayBuffer[PipelineStage]() http://git-wip-us.apache.org/repos/asf/spark/blob/e4765a46/python/pyspark/ml/feature.py -- diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index 015e7a9..3f04c41 100644 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -24,7 +24,7 @@ from pyspark.mllib.common import inherit_doc __all__ = ['Binarizer', 'HashingTF', 'IDF', 'IDFModel', 'NGram', 'Normalizer', 'OneHotEncoder', 'PolynomialExpansion', 'RegexTokenizer', 'StandardScaler', 'StandardScalerModel', 'StringIndexer', 'StringIndexerModel
spark git commit: [SPARK-9000] [MLLIB] Support generic item types in PrefixSpan
Repository: spark Updated Branches: refs/heads/master 57084e0c7 - 28d944e86 [SPARK-9000] [MLLIB] Support generic item types in PrefixSpan mengxr Please review after #7818 merges and master is rebased. Continues work by rikima Closes #7400 Author: Feynman Liang fli...@databricks.com Author: masaki rikitoku rikima3...@gmail.com Closes #7837 from feynmanliang/SPARK-7400-genericItems and squashes the following commits: 8b2c756 [Feynman Liang] Remove orig 92443c8 [Feynman Liang] Style fixes 42c6349 [Feynman Liang] Style fix 14e67fc [Feynman Liang] Generic prefixSpan itemtypes b3b21e0 [Feynman Liang] Initial support for generic itemtype in public api b86e0d5 [masaki rikitoku] modify to support generic item type Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/28d944e8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/28d944e8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/28d944e8 Branch: refs/heads/master Commit: 28d944e86d066eb4c651dd803f0b022605ed644e Parents: 57084e0 Author: Feynman Liang fli...@databricks.com Authored: Sat Aug 1 23:11:25 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Sat Aug 1 23:11:25 2015 -0700 -- .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 40 ++- .../spark/mllib/fpm/PrefixSpanSuite.scala | 104 +-- 2 files changed, 132 insertions(+), 12 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/28d944e8/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala index 22b4ddb..c1761c3 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala @@ -18,6 +18,7 @@ package org.apache.spark.mllib.fpm import scala.collection.mutable.ArrayBuilder +import scala.reflect.ClassTag import org.apache.spark.Logging import org.apache.spark.annotation.Experimental @@ -90,15 +91,44 @@ class PrefixSpan private ( } /** - * Find the complete set of sequential patterns in the input sequences. + * Find the complete set of sequential patterns in the input sequences of itemsets. + * @param data ordered sequences of itemsets. + * @return (sequential itemset pattern, count) tuples + */ + def run[Item: ClassTag](data: RDD[Array[Array[Item]]]): RDD[(Array[Array[Item]], Long)] = { +val itemToInt = data.aggregate(Set[Item]())( + seqOp = { (uniqItems, item) = uniqItems ++ item.flatten.toSet }, + combOp = { _ ++ _ } +).zipWithIndex.toMap +val intToItem = Map() ++ (itemToInt.map { case (k, v) = (v, k) }) + +val dataInternalRepr = data.map { seq = + seq.map(itemset = itemset.map(itemToInt)).reduce((a, b) = a ++ (DELIMITER +: b)) +} +val results = run(dataInternalRepr) + +def toPublicRepr(pattern: Iterable[Int]): List[Array[Item]] = { + pattern.span(_ != DELIMITER) match { +case (x, xs) if xs.size 1 = x.map(intToItem).toArray :: toPublicRepr(xs.tail) +case (x, xs) = List(x.map(intToItem).toArray) + } +} +results.map { case (seq: Array[Int], count: Long) = + (toPublicRepr(seq).toArray, count) +} + } + + /** + * Find the complete set of sequential patterns in the input sequences. This method utilizes + * the internal representation of itemsets as Array[Int] where each itemset is represented by + * a contiguous sequence of non-negative integers and delimiters represented by [[DELIMITER]]. * @param data ordered sequences of itemsets. Items are represented by non-negative integers. - * Each itemset has one or more items and is delimited by [[DELIMITER]]. + * Each itemset has one or more items and is delimited by [[DELIMITER]]. * @return a set of sequential pattern pairs, * the key of pair is pattern (a list of elements), * the value of pair is the pattern's count. */ - // TODO: generalize to arbitrary item-types and use mapping to Ints for internal algorithm - def run(data: RDD[Array[Int]]): RDD[(Array[Int], Long)] = { + private[fpm] def run(data: RDD[Array[Int]]): RDD[(Array[Int], Long)] = { val sc = data.sparkContext if (data.getStorageLevel == StorageLevel.NONE) { @@ -260,7 +290,7 @@ class PrefixSpan private ( private[fpm] object PrefixSpan { private[fpm] val DELIMITER = -1 - /** Splits a sequence of itemsets delimited by [[DELIMITER]]. */ + /** Splits an array of itemsets delimited by [[DELIMITER]]. */ private[fpm] def splitSequence(sequence: List[Int]): List[Set[Int]] = { sequence.span
spark git commit: [SPARK-9527] [MLLIB] add PrefixSpanModel and make PrefixSpan Java friendly
Repository: spark Updated Branches: refs/heads/master 8eafa2aeb - 66924ffa6 [SPARK-9527] [MLLIB] add PrefixSpanModel and make PrefixSpan Java friendly 1. Use `PrefixSpanModel` to wrap the frequent sequences. 2. Define `FreqSequence` to wrap each frequent sequence, which contains a Java-friendly method `javaSequence` 3. Overload `run` for Java users. 4. Added a unit test in Java to check Java compatibility. zhangjiajin feynmanliang Author: Xiangrui Meng m...@databricks.com Closes #7869 from mengxr/SPARK-9527 and squashes the following commits: 4345594 [Xiangrui Meng] add PrefixSpanModel and make PrefixSpan Java friendly Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/66924ffa Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/66924ffa Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/66924ffa Branch: refs/heads/master Commit: 66924ffa6bdb8e0df1b90b789cb7ad443377e729 Parents: 8eafa2a Author: Xiangrui Meng m...@databricks.com Authored: Sun Aug 2 11:50:17 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Sun Aug 2 11:50:17 2015 -0700 -- .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 52 +-- .../spark/mllib/fpm/JavaPrefixSpanSuite.java| 67 .../spark/mllib/fpm/PrefixSpanSuite.scala | 8 +-- 3 files changed, 118 insertions(+), 9 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/66924ffa/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala index c1761c3..9eaf733 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala @@ -17,11 +17,16 @@ package org.apache.spark.mllib.fpm +import java.{lang = jl, util = ju} + +import scala.collection.JavaConverters._ import scala.collection.mutable.ArrayBuilder import scala.reflect.ClassTag import org.apache.spark.Logging import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.api.java.JavaSparkContext.fakeClassTag import org.apache.spark.rdd.RDD import org.apache.spark.storage.StorageLevel @@ -93,9 +98,9 @@ class PrefixSpan private ( /** * Find the complete set of sequential patterns in the input sequences of itemsets. * @param data ordered sequences of itemsets. - * @return (sequential itemset pattern, count) tuples + * @return a [[PrefixSpanModel]] that contains the frequent sequences */ - def run[Item: ClassTag](data: RDD[Array[Array[Item]]]): RDD[(Array[Array[Item]], Long)] = { + def run[Item: ClassTag](data: RDD[Array[Array[Item]]]): PrefixSpanModel[Item] = { val itemToInt = data.aggregate(Set[Item]())( seqOp = { (uniqItems, item) = uniqItems ++ item.flatten.toSet }, combOp = { _ ++ _ } @@ -113,9 +118,25 @@ class PrefixSpan private ( case (x, xs) = List(x.map(intToItem).toArray) } } -results.map { case (seq: Array[Int], count: Long) = - (toPublicRepr(seq).toArray, count) +val freqSequences = results.map { case (seq: Array[Int], count: Long) = + new FreqSequence[Item](toPublicRepr(seq).toArray, count) } +new PrefixSpanModel[Item](freqSequences) + } + + /** + * A Java-friendly version of [[run()]] that reads sequences from a [[JavaRDD]] and returns + * frequent sequences in a [[PrefixSpanModel]]. + * @param data ordered sequences of itemsets stored as Java Iterable of Iterables + * @tparam Item item type + * @tparam Itemset itemset type, which is an Iterable of Items + * @tparam Sequence sequence type, which is an Iterable of Itemsets + * @return a [[PrefixSpanModel]] that contains the frequent sequences + */ + def run[Item, Itemset : jl.Iterable[Item], Sequence : jl.Iterable[Itemset]]( + data: JavaRDD[Sequence]): PrefixSpanModel[Item] = { +implicit val tag = fakeClassTag[Item] +run(data.rdd.map(_.asScala.map(_.asScala.toArray).toArray)) } /** @@ -287,7 +308,7 @@ class PrefixSpan private ( } -private[fpm] object PrefixSpan { +object PrefixSpan { private[fpm] val DELIMITER = -1 /** Splits an array of itemsets delimited by [[DELIMITER]]. */ @@ -313,4 +334,25 @@ private[fpm] object PrefixSpan { // TODO: improve complexity by using partial prefixes, considering one item at a time itemSet.subsets.filter(_ != Set.empty[Int]) } + + /** + * Represents a frequence sequence. + * @param sequence a sequence of itemsets stored as an Array of Arrays + * @param freq frequency + * @tparam Item item type + */ + class
spark git commit: [SPARK-8999] [MLLIB] PrefixSpan non-temporal sequences
Repository: spark Updated Branches: refs/heads/master 65038973a - d2a9b66f6 [SPARK-8999] [MLLIB] PrefixSpan non-temporal sequences mengxr Extends PrefixSpan to non-temporal itemsets. Continues work by zhangjiajin * Internal API uses List[Set[Int]] which is likely not efficient; will need to refactor during QA Closes #7646 Author: zhangjiajin zhangjia...@huawei.com Author: Feynman Liang fli...@databricks.com Author: zhang jiajin zhangjia...@huawei.com Closes #7818 from feynmanliang/SPARK-8999-nonTemporal and squashes the following commits: 4ded81d [Feynman Liang] Replace all filters to filter nonempty 350e67e [Feynman Liang] Code review feedback 03156ca [Feynman Liang] Fix tests, drop delimiters at boundaries of sequences d1fe0ed [Feynman Liang] Remove comments 86ca4e5 [Feynman Liang] Fix style 7c7bf39 [Feynman Liang] Fixed itemSet sequences 6073b10 [Feynman Liang] Basic itemset functionality, failing test 1a7fb48 [Feynman Liang] Add delimiter to results 5db00aa [Feynman Liang] Working for items, not itemsets 6787716 [Feynman Liang] Working on temporal sequences f1114b9 [Feynman Liang] Add -1 delimiter 00fe756 [Feynman Liang] Reset base files for rebase f486dcd [zhangjiajin] change maxLocalProjDBSize and fix a bug (remove -3 from frequent items). 60a0b76 [zhangjiajin] fixed a scala style error. 740c203 [zhangjiajin] fixed a scala style error. 5785cb8 [zhangjiajin] support non-temporal sequence a5d649d [zhangjiajin] restore original version 09dc409 [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark into multiItems_2 ae8c02d [zhangjiajin] Fixed some Scala style errors. 216ab0c [zhangjiajin] Support non-temporal sequence in PrefixSpan b572f54 [zhangjiajin] initialize file before rebase. f06772f [zhangjiajin] fix a scala style error. a7e50d4 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixSpan. c1d13d0 [zhang jiajin] Delete PrefixspanSuite.scala d9d8137 [zhang jiajin] Delete Prefixspan.scala c6ceb63 [zhangjiajin] Add new algorithm PrefixSpan and test file. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d2a9b66f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d2a9b66f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d2a9b66f Branch: refs/heads/master Commit: d2a9b66f6c0de89d6d16370af1c77c7f51b11d3e Parents: 6503897 Author: zhangjiajin zhangjia...@huawei.com Authored: Sat Aug 1 01:56:27 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Sat Aug 1 01:56:27 2015 -0700 -- .../spark/mllib/fpm/LocalPrefixSpan.scala | 46 ++-- .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 111 ++--- .../spark/mllib/fpm/PrefixSpanSuite.scala | 237 --- 3 files changed, 302 insertions(+), 92 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d2a9b66f/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala index 0ea7920..ccebf95 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala @@ -25,7 +25,7 @@ import org.apache.spark.Logging * Calculate all patterns of a projected database in local. */ private[fpm] object LocalPrefixSpan extends Logging with Serializable { - + import PrefixSpan._ /** * Calculate all patterns of a projected database. * @param minCount minimum count @@ -39,12 +39,19 @@ private[fpm] object LocalPrefixSpan extends Logging with Serializable { def run( minCount: Long, maxPatternLength: Int, - prefixes: List[Int], - database: Iterable[Array[Int]]): Iterator[(List[Int], Long)] = { -if (prefixes.length == maxPatternLength || database.isEmpty) return Iterator.empty -val frequentItemAndCounts = getFreqItemAndCounts(minCount, database) -val filteredDatabase = database.map(x = x.filter(frequentItemAndCounts.contains)) -frequentItemAndCounts.iterator.flatMap { case (item, count) = + prefixes: List[Set[Int]], + database: Iterable[List[Set[Int]]]): Iterator[(List[Set[Int]], Long)] = { +if (prefixes.length == maxPatternLength || database.isEmpty) { + return Iterator.empty +} +val freqItemSetsAndCounts = getFreqItemAndCounts(minCount, database) +val freqItems = freqItemSetsAndCounts.keys.flatten.toSet +val filteredDatabase = database.map { suffix = + suffix +.map(item = freqItems.intersect(item)) +.filter(_.nonEmpty) +} +freqItemSetsAndCounts.iterator.flatMap { case (item, count) = val
spark git commit: [SPARK-8169] [ML] Add StopWordsRemover as a transformer
Repository: spark Updated Branches: refs/heads/master d2a9b66f6 - 876566501 [SPARK-8169] [ML] Add StopWordsRemover as a transformer jira: https://issues.apache.org/jira/browse/SPARK-8169 stop words: http://en.wikipedia.org/wiki/Stop_words StopWordsRemover takes a string array column and outputs a string array column with all defined stop words removed. The transformer should also come with a standard set of stop words as default. Currently I used a minimum stop words set since on some [case](http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html), small set of stop words is preferred. ASCII char has been tested, Yet I cannot check it in due to style check. Further thought, 1. Maybe I should use OpenHashSet. Is it recommended? 2. Currently I leave the null in input array untouched, i.e. Array(null, null) = Array(null, null). 3. If the current stop words set looks too limited, any suggestion for replacement? We can have something similar to the one in [SKlearn](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py). Author: Yuhao Yang hhb...@gmail.com Closes #6742 from hhbyyh/stopwords and squashes the following commits: fa959d8 [Yuhao Yang] separating udf f190217 [Yuhao Yang] replace default list and other small fix 04403ab [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into stopwords b3aa957 [Yuhao Yang] add stopWordsRemover Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/87656650 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/87656650 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/87656650 Branch: refs/heads/master Commit: 8765665015ef47a23e00f7d01d4d280c31bb236d Parents: d2a9b66 Author: Yuhao Yang hhb...@gmail.com Authored: Sat Aug 1 02:31:28 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Sat Aug 1 02:31:28 2015 -0700 -- .../spark/ml/feature/StopWordsRemover.scala | 155 +++ .../ml/feature/StopWordsRemoverSuite.scala | 80 ++ 2 files changed, 235 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/87656650/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala new file mode 100644 index 000..3cc4142 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.Transformer +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.param.{ParamMap, BooleanParam, Param} +import org.apache.spark.ml.util.Identifiable +import org.apache.spark.sql.DataFrame +import org.apache.spark.sql.types.{StringType, StructField, ArrayType, StructType} +import org.apache.spark.sql.functions.{col, udf} + +/** + * stop words list + */ +private object StopWords { + + /** + * Use the same default stopwords list as scikit-learn. + * The original list can be found from Glasgow Information Retrieval Group + * [[http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words]] + */ + val EnglishStopWords = Array( a, about, above, across, after, afterwards, again, +against, all, almost, alone, along, already, also, although, always, +am, among, amongst, amoungst, amount, an, and, another, +any, anyhow, anyone, anything, anyway, anywhere, are, +around, as, at, back, be, became, because, become, +becomes, becoming, been, before, beforehand, behind, being, +below, beside, besides, between, beyond, bill, both, +bottom, but, by, call, can, cannot, cant, co, con, +could, couldnt, cry, de, describe, detail, do, done, +down, due, during, each
spark git commit: [SPARK-9490] [DOCS] [MLLIB] MLlib evaluation metrics guide example python code uses deprecated print statement
Repository: spark Updated Branches: refs/heads/master 815c8245f - 873ab0f96 [SPARK-9490] [DOCS] [MLLIB] MLlib evaluation metrics guide example python code uses deprecated print statement Use print(x) not print x for Python 3 in eval examples CC sethah mengxr -- just wanted to close this out before 1.5 Author: Sean Owen so...@cloudera.com Closes #7822 from srowen/SPARK-9490 and squashes the following commits: 01abeba [Sean Owen] Change print x to print(x) in the rest of the docs too bd7f7fb [Sean Owen] Use print(x) not print x for Python 3 in eval examples Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/873ab0f9 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/873ab0f9 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/873ab0f9 Branch: refs/heads/master Commit: 873ab0f9692d8ea6220abdb8d9200041068372a8 Parents: 815c824 Author: Sean Owen so...@cloudera.com Authored: Fri Jul 31 13:45:28 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Fri Jul 31 13:45:28 2015 -0700 -- docs/ml-guide.md| 2 +- docs/mllib-evaluation-metrics.md| 66 docs/mllib-feature-extraction.md| 2 +- docs/mllib-statistics.md| 20 +- docs/quick-start.md | 2 +- docs/sql-programming-guide.md | 6 +-- docs/streaming-programming-guide.md | 2 +- 7 files changed, 50 insertions(+), 50 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/873ab0f9/docs/ml-guide.md -- diff --git a/docs/ml-guide.md b/docs/ml-guide.md index 8c46adf..b6ca50e 100644 --- a/docs/ml-guide.md +++ b/docs/ml-guide.md @@ -561,7 +561,7 @@ test = sc.parallelize([(4L, spark i j k), prediction = model.transform(test) selected = prediction.select(id, text, prediction) for row in selected.collect(): -print row +print(row) sc.stop() {% endhighlight %} http://git-wip-us.apache.org/repos/asf/spark/blob/873ab0f9/docs/mllib-evaluation-metrics.md -- diff --git a/docs/mllib-evaluation-metrics.md b/docs/mllib-evaluation-metrics.md index 4ca0bb0..7066d5c 100644 --- a/docs/mllib-evaluation-metrics.md +++ b/docs/mllib-evaluation-metrics.md @@ -302,10 +302,10 @@ predictionAndLabels = test.map(lambda lp: (float(model.predict(lp.features)), lp metrics = BinaryClassificationMetrics(predictionAndLabels) # Area under precision-recall curve -print Area under PR = %s % metrics.areaUnderPR +print(Area under PR = %s % metrics.areaUnderPR) # Area under ROC curve -print Area under ROC = %s % metrics.areaUnderROC +print(Area under ROC = %s % metrics.areaUnderROC) {% endhighlight %} @@ -606,24 +606,24 @@ metrics = MulticlassMetrics(predictionAndLabels) precision = metrics.precision() recall = metrics.recall() f1Score = metrics.fMeasure() -print Summary Stats -print Precision = %s % precision -print Recall = %s % recall -print F1 Score = %s % f1Score +print(Summary Stats) +print(Precision = %s % precision) +print(Recall = %s % recall) +print(F1 Score = %s % f1Score) # Statistics by class labels = data.map(lambda lp: lp.label).distinct().collect() for label in sorted(labels): -print Class %s precision = %s % (label, metrics.precision(label)) -print Class %s recall = %s % (label, metrics.recall(label)) -print Class %s F1 Measure = %s % (label, metrics.fMeasure(label, beta=1.0)) +print(Class %s precision = %s % (label, metrics.precision(label))) +print(Class %s recall = %s % (label, metrics.recall(label))) +print(Class %s F1 Measure = %s % (label, metrics.fMeasure(label, beta=1.0))) # Weighted stats -print Weighted recall = %s % metrics.weightedRecall -print Weighted precision = %s % metrics.weightedPrecision -print Weighted F(1) Score = %s % metrics.weightedFMeasure() -print Weighted F(0.5) Score = %s % metrics.weightedFMeasure(beta=0.5) -print Weighted false positive rate = %s % metrics.weightedFalsePositiveRate +print(Weighted recall = %s % metrics.weightedRecall) +print(Weighted precision = %s % metrics.weightedPrecision) +print(Weighted F(1) Score = %s % metrics.weightedFMeasure()) +print(Weighted F(0.5) Score = %s % metrics.weightedFMeasure(beta=0.5)) +print(Weighted false positive rate = %s % metrics.weightedFalsePositiveRate) {% endhighlight %} /div @@ -881,28 +881,28 @@ scoreAndLabels = sc.parallelize([ metrics = MultilabelMetrics(scoreAndLabels) # Summary stats -print Recall = %s % metrics.recall() -print Precision = %s % metrics.precision() -print F1 measure = %s % metrics.f1Measure() -print Accuracy = %s % metrics.accuracy +print(Recall = %s % metrics.recall()) +print(Precision = %s % metrics.precision()) +print(F1 measure = %s
spark git commit: [SPARK-8998] [MLLIB] Distribute PrefixSpan computation for large projected databases
Repository: spark Updated Branches: refs/heads/master c5815930b - d212a3142 [SPARK-8998] [MLLIB] Distribute PrefixSpan computation for large projected databases Continuation of work by zhangjiajin Closes #7412 Author: zhangjiajin zhangjia...@huawei.com Author: Feynman Liang fli...@databricks.com Author: zhang jiajin zhangjia...@huawei.com Closes #7783 from feynmanliang/SPARK-8998-improve-distributed and squashes the following commits: a61943d [Feynman Liang] Collect small patterns to local 4ddf479 [Feynman Liang] Parallelize freqItemCounts ad23aa9 [zhang jiajin] Merge pull request #1 from feynmanliang/SPARK-8998-collectBeforeLocal 87fa021 [Feynman Liang] Improve extend prefix readability c2caa5c [Feynman Liang] Readability improvements and comments 1235cfc [Feynman Liang] Use Iterable[Array[_]] over Array[Array[_]] for database da0091b [Feynman Liang] Use lists for prefixes to reuse data cb2a4fc [Feynman Liang] Inline code for readability 01c9ae9 [Feynman Liang] Add getters 6e149fa [Feynman Liang] Fix splitPrefixSuffixPairs 64271b3 [zhangjiajin] Modified codes according to comments. d2250b7 [zhangjiajin] remove minPatternsBeforeLocalProcessing, add maxSuffixesBeforeLocalProcessing. b07e20c [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark into CollectEnoughPrefixes 095aa3a [zhangjiajin] Modified the code according to the review comments. baa2885 [zhangjiajin] Modified the code according to the review comments. 6560c69 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixeSpan a8fde87 [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark 4dd1c8a [zhangjiajin] initialize file before rebase. 078d410 [zhangjiajin] fix a scala style error. 22b0ef4 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixSpan. ca9c4c8 [zhangjiajin] Modified the code according to the review comments. 574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization. ba5df34 [zhangjiajin] Fix a Scala style error. 4c60fb3 [zhangjiajin] Fix some Scala style errors. 1dd33ad [zhangjiajin] Modified the code according to the review comments. 89bc368 [zhangjiajin] Fixed a Scala style error. a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala 951fd42 [zhang jiajin] Delete Prefixspan.scala 575995f [zhangjiajin] Modified the code according to the review comments. 91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d212a314 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d212a314 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d212a314 Branch: refs/heads/master Commit: d212a314227dec26c0dbec8ed3422d0ec8f818f9 Parents: c581593 Author: zhangjiajin zhangjia...@huawei.com Authored: Thu Jul 30 08:14:09 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Jul 30 08:14:09 2015 -0700 -- .../spark/mllib/fpm/LocalPrefixSpan.scala | 6 +- .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 203 ++- .../spark/mllib/fpm/PrefixSpanSuite.scala | 21 +- 3 files changed, 161 insertions(+), 69 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d212a314/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala index 7ead632..0ea7920 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala @@ -40,7 +40,7 @@ private[fpm] object LocalPrefixSpan extends Logging with Serializable { minCount: Long, maxPatternLength: Int, prefixes: List[Int], - database: Array[Array[Int]]): Iterator[(List[Int], Long)] = { + database: Iterable[Array[Int]]): Iterator[(List[Int], Long)] = { if (prefixes.length == maxPatternLength || database.isEmpty) return Iterator.empty val frequentItemAndCounts = getFreqItemAndCounts(minCount, database) val filteredDatabase = database.map(x = x.filter(frequentItemAndCounts.contains)) @@ -67,7 +67,7 @@ private[fpm] object LocalPrefixSpan extends Logging with Serializable { } } - def project(database: Array[Array[Int]], prefix: Int): Array[Array[Int]] = { + def project(database: Iterable[Array[Int]], prefix: Int): Iterable[Array[Int]] = { database .map(getSuffix(prefix, _)) .filter(_.nonEmpty) @@ -81,7 +81,7 @@ private[fpm] object LocalPrefixSpan extends Logging with Serializable { */ private def getFreqItemAndCounts( minCount: Long
spark git commit: [SPARK-7368] [MLLIB] Add QR decomposition for RowMatrix
Repository: spark Updated Branches: refs/heads/master 6175d6cfe - d31c618e3 [SPARK-7368] [MLLIB] Add QR decomposition for RowMatrix jira: https://issues.apache.org/jira/browse/SPARK-7368 Add QR decomposition for RowMatrix. I'm not sure what's the blueprint about the distributed Matrix from community and whether this will be a desirable feature , so I sent a prototype for discussion. I'll go on polish the code and provide ut and performance statistics if it's acceptable. The implementation refers to the [paper: https://www.cs.purdue.edu/homes/dgleich/publications/Benson%202013%20-%20direct-tsqr.pdf] Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures, 2013 IEEE International Conference on Big Data, which is a stable algorithm with good scalability. Currently I tried it on a 40 * 500 rowMatrix (16 partitions) and it can bring down the computation time from 8.8 mins (using breeze.linalg.qr.reduced) to 2.6 mins on a 4 worker cluster. I think there will still be some room for performance improvement. Any trial and suggestion is welcome. Author: Yuhao Yang hhb...@gmail.com Closes #5909 from hhbyyh/qrDecomposition and squashes the following commits: cec797b [Yuhao Yang] remove unnecessary qr 0fb1012 [Yuhao Yang] hierarchy R computing 3fbdb61 [Yuhao Yang] update qr to indirect and add ut 0d913d3 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition 39213c3 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition c0fc0c7 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into qrDecomposition 39b0b22 [Yuhao Yang] initial draft for discussion Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d31c618e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d31c618e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d31c618e Branch: refs/heads/master Commit: d31c618e3c8838f8198556876b9dcbbbf835f7b2 Parents: 6175d6c Author: Yuhao Yang hhb...@gmail.com Authored: Thu Jul 30 07:49:10 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Jul 30 07:49:10 2015 -0700 -- .../linalg/SingularValueDecomposition.scala | 8 .../mllib/linalg/distributed/RowMatrix.scala| 46 +++- .../linalg/distributed/RowMatrixSuite.scala | 17 3 files changed, 70 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/d31c618e/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala b/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala index 9669c36..b416d50 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala @@ -25,3 +25,11 @@ import org.apache.spark.annotation.Experimental */ @Experimental case class SingularValueDecomposition[UType, VType](U: UType, s: Vector, V: VType) + +/** + * :: Experimental :: + * Represents QR factors. + */ +@Experimental +case class QRDecomposition[UType, VType](Q: UType, R: VType) + http://git-wip-us.apache.org/repos/asf/spark/blob/d31c618e/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala b/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala index 1626da9..bfc90c9 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala @@ -22,7 +22,7 @@ import java.util.Arrays import scala.collection.mutable.ListBuffer import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV, SparseVector = BSV, axpy = brzAxpy, - svd = brzSvd} + svd = brzSvd, MatrixSingularException, inv} import breeze.numerics.{sqrt = brzSqrt} import com.github.fommil.netlib.BLAS.{getInstance = blas} @@ -498,6 +498,50 @@ class RowMatrix( } /** + * Compute QR decomposition for [[RowMatrix]]. The implementation is designed to optimize the QR + * decomposition (factorization) for the [[RowMatrix]] of a tall and skinny shape. + * Reference: + * Paul G. Constantine, David F. Gleich. Tall and skinny QR factorizations in MapReduce + * architectures ([[http://dx.doi.org/10.1145/1996092.1996103]]) + * + * @param computeQ whether to computeQ + * @return QRDecomposition(Q
spark git commit: [SPARK-] [MLLIB] minor fix on tokenizer doc
Repository: spark Updated Branches: refs/heads/master d212a3142 - 9c0501c5d [SPARK-] [MLLIB] minor fix on tokenizer doc A trivial fix for the comments of RegexTokenizer. Maybe this is too small, yet I just noticed it and think it can be quite misleading. I can create a jira if necessary. Author: Yuhao Yang hhb...@gmail.com Closes #7791 from hhbyyh/docFix and squashes the following commits: cdf2542 [Yuhao Yang] minor fix on tokenizer doc Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9c0501c5 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9c0501c5 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9c0501c5 Branch: refs/heads/master Commit: 9c0501c5d04d83ca25ce433138bf64df6a14dc58 Parents: d212a31 Author: Yuhao Yang hhb...@gmail.com Authored: Thu Jul 30 08:20:52 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Jul 30 08:20:52 2015 -0700 -- mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/9c0501c5/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala index 0b3af47..248288c 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala @@ -50,7 +50,7 @@ class Tokenizer(override val uid: String) extends UnaryTransformer[String, Seq[S /** * :: Experimental :: * A regex based tokenizer that extracts tokens either by using the provided regex pattern to split - * the text (default) or repeatedly matching the regex (if `gaps` is true). + * the text (default) or repeatedly matching the regex (if `gaps` is false). * Optional parameters also allow filtering tokens using a minimal length. * It returns an array of strings that can be empty. */ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-] [MLLIB] minor fix on tokenizer doc
Repository: spark Updated Branches: refs/heads/branch-1.4 8dfdca46d - 020dd30e5 [SPARK-] [MLLIB] minor fix on tokenizer doc A trivial fix for the comments of RegexTokenizer. Maybe this is too small, yet I just noticed it and think it can be quite misleading. I can create a jira if necessary. Author: Yuhao Yang hhb...@gmail.com Closes #7791 from hhbyyh/docFix and squashes the following commits: cdf2542 [Yuhao Yang] minor fix on tokenizer doc (cherry picked from commit 9c0501c5d04d83ca25ce433138bf64df6a14dc58) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/020dd30e Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/020dd30e Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/020dd30e Branch: refs/heads/branch-1.4 Commit: 020dd30e5173d534d1a2cd5934a66f70bc764459 Parents: 8dfdca4 Author: Yuhao Yang hhb...@gmail.com Authored: Thu Jul 30 08:20:52 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Jul 30 08:21:09 2015 -0700 -- mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/020dd30e/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala index 5f9f57a..4b1700d 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala @@ -50,7 +50,7 @@ class Tokenizer(override val uid: String) extends UnaryTransformer[String, Seq[S /** * :: Experimental :: * A regex based tokenizer that extracts tokens either by using the provided regex pattern to split - * the text (default) or repeatedly matching the regex (if `gaps` is true). + * the text (default) or repeatedly matching the regex (if `gaps` is false). * Optional parameters also allow filtering tokens using a minimal length. * It returns an array of strings that can be empty. */ - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-5561] [MLLIB] Generalized PeriodicCheckpointer for RDDs and Graphs
Repository: spark Updated Branches: refs/heads/master d31c618e3 - c5815930b [SPARK-5561] [MLLIB] Generalized PeriodicCheckpointer for RDDs and Graphs PeriodicGraphCheckpointer was introduced for Latent Dirichlet Allocation (LDA), but it was meant to be generalized to work with Graphs, RDDs, and other data structures based on RDDs. This PR generalizes it. For those who are not familiar with the periodic checkpointer, it tries to automatically handle persisting/unpersisting and checkpointing/removing checkpoint files in a lineage of RDD-based objects. I need it generalized to use with GradientBoostedTrees [https://issues.apache.org/jira/browse/SPARK-6684]. It should be useful for other iterative algorithms as well. Changes I made: * Copied PeriodicGraphCheckpointer to PeriodicCheckpointer. * Within PeriodicCheckpointer, I created abstract methods for the basic operations (checkpoint, persist, etc.). * The subclasses for Graphs and RDDs implement those abstract methods. * I copied the test suite for the graph checkpointer and made tiny modifications to make it work for RDDs. To review this PR, I recommend doing 2 diffs: (1) diff between the old PeriodicGraphCheckpointer.scala and the new PeriodicCheckpointer.scala (2) diff between the 2 test suites CCing andrewor14 in case there are relevant changes to checkpointing. CCing feynmanliang in case you're interested in learning about checkpointing. CCing mengxr for final OK. Thanks all! Author: Joseph K. Bradley jos...@databricks.com Closes #7728 from jkbradley/gbt-checkpoint and squashes the following commits: d41902c [Joseph K. Bradley] Oops, forgot to update an extra time in the checkpointer tests, after the last commit. I'll fix that. I'll also make some of the checkpointer methods protected, which I should have done before. 32b23b8 [Joseph K. Bradley] fixed usage of checkpointer in lda 0b3dbc0 [Joseph K. Bradley] Changed checkpointer constructor not to take initial data. 568918c [Joseph K. Bradley] Generalized PeriodicGraphCheckpointer to PeriodicCheckpointer, with subclasses for RDDs and Graphs. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c5815930 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c5815930 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c5815930 Branch: refs/heads/master Commit: c5815930be46a89469440b7c61b59764fb67a54c Parents: d31c618 Author: Joseph K. Bradley jos...@databricks.com Authored: Thu Jul 30 07:56:15 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Jul 30 07:56:15 2015 -0700 -- .../spark/mllib/clustering/LDAOptimizer.scala | 6 +- .../spark/mllib/impl/PeriodicCheckpointer.scala | 154 + .../mllib/impl/PeriodicGraphCheckpointer.scala | 105 ++- .../mllib/impl/PeriodicRDDCheckpointer.scala| 97 +++ .../impl/PeriodicGraphCheckpointerSuite.scala | 16 +- .../impl/PeriodicRDDCheckpointerSuite.scala | 173 +++ 6 files changed, 452 insertions(+), 99 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/c5815930/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala index 7e75e70..4b90fbd 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala @@ -142,8 +142,8 @@ final class EMLDAOptimizer extends LDAOptimizer { this.k = k this.vocabSize = docs.take(1).head._2.size this.checkpointInterval = lda.getCheckpointInterval -this.graphCheckpointer = new - PeriodicGraphCheckpointer[TopicCounts, TokenCount](graph, checkpointInterval) +this.graphCheckpointer = new PeriodicGraphCheckpointer[TopicCounts, TokenCount]( + checkpointInterval, graph.vertices.sparkContext) this.globalTopicTotals = computeGlobalTopicTotals() this } @@ -188,7 +188,7 @@ final class EMLDAOptimizer extends LDAOptimizer { // Update the vertex descriptors with the new counts. val newGraph = GraphImpl.fromExistingRDDs(docTopicDistributions, graph.edges) graph = newGraph -graphCheckpointer.updateGraph(newGraph) +graphCheckpointer.update(newGraph) globalTopicTotals = computeGlobalTopicTotals() this } http://git-wip-us.apache.org/repos/asf/spark/blob/c5815930/mllib/src/main/scala/org/apache/spark/mllib/impl/PeriodicCheckpointer.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/impl
spark git commit: [SPARK-8671] [ML] Added isotonic regression to the pipeline API.
Repository: spark Updated Branches: refs/heads/master 0dbd6963d - 7f7a319c4 [SPARK-8671] [ML] Added isotonic regression to the pipeline API. Author: martinzapletal zapletal-mar...@email.cz Closes #7517 from zapletal-martin/SPARK-8671-isotonic-regression-api and squashes the following commits: 8c435c1 [martinzapletal] Review https://github.com/apache/spark/pull/7517 feedback update. bebbb86 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8671-isotonic-regression-api b68efc0 [martinzapletal] Added tests for param validation. 07c12bd [martinzapletal] Comments and refactoring. 834fcf7 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8671-isotonic-regression-api b611fee [martinzapletal] SPARK-8671. Added first version of isotonic regression to pipeline API Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7f7a319c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7f7a319c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7f7a319c Branch: refs/heads/master Commit: 7f7a319c4ce07f07a6bd68100cf0a4f1da66269e Parents: 0dbd696 Author: martinzapletal zapletal-mar...@email.cz Authored: Thu Jul 30 15:57:14 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Jul 30 15:57:14 2015 -0700 -- .../ml/regression/IsotonicRegression.scala | 144 ++ .../ml/regression/IsotonicRegressionSuite.scala | 148 +++ 2 files changed, 292 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/7f7a319c/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala new file mode 100644 index 000..4ece8cf --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.regression + +import org.apache.spark.annotation.Experimental +import org.apache.spark.ml.PredictorParams +import org.apache.spark.ml.param.{Param, ParamMap, BooleanParam} +import org.apache.spark.ml.util.{SchemaUtils, Identifiable} +import org.apache.spark.mllib.regression.{IsotonicRegression = MLlibIsotonicRegression} +import org.apache.spark.mllib.regression.{IsotonicRegressionModel = MLlibIsotonicRegressionModel} +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.types.{DoubleType, DataType} +import org.apache.spark.sql.{Row, DataFrame} +import org.apache.spark.storage.StorageLevel + +/** + * Params for isotonic regression. + */ +private[regression] trait IsotonicRegressionParams extends PredictorParams { + + /** + * Param for weight column name. + * TODO: Move weightCol to sharedParams. + * + * @group param + */ + final val weightCol: Param[String] = +new Param[String](this, weightCol, weight column name) + + /** @group getParam */ + final def getWeightCol: String = $(weightCol) + + /** + * Param for isotonic parameter. + * Isotonic (increasing) or antitonic (decreasing) sequence. + * @group param + */ + final val isotonic: BooleanParam = +new BooleanParam(this, isotonic, isotonic (increasing) or antitonic (decreasing) sequence) + + /** @group getParam */ + final def getIsotonicParam: Boolean = $(isotonic) +} + +/** + * :: Experimental :: + * Isotonic regression. + * + * Currently implemented using parallelized pool adjacent violators algorithm. + * Only univariate (single feature) algorithm supported. + * + * Uses [[org.apache.spark.mllib.regression.IsotonicRegression]]. + */ +@Experimental +class IsotonicRegression(override val uid: String) + extends Regressor[Double, IsotonicRegression, IsotonicRegressionModel] + with IsotonicRegressionParams { + + def this() = this(Identifiable.randomUID(isoReg)) + + /** + * Set the isotonic parameter
spark git commit: [SPARK-9463] [ML] Expose model coefficients with names in SparkR RFormula
Repository: spark Updated Branches: refs/heads/master be7be6d4c - e7905a939 [SPARK-9463] [ML] Expose model coefficients with names in SparkR RFormula Preview: ``` summary(m) features coefficients 1(Intercept)1.6765001 2 Sepal_Length0.3498801 3 Species.versicolor -0.9833885 4 Species.virginica -1.0075104 ``` Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit cc mengxr Author: Eric Liang e...@databricks.com Closes #7771 from ericl/summary and squashes the following commits: ccd54c3 [Eric Liang] second pass a5ca93b [Eric Liang] comments 2772111 [Eric Liang] clean up 70483ef [Eric Liang] fix test 7c247d4 [Eric Liang] Merge branch 'master' into summary 3c55024 [Eric Liang] working 8c539aa [Eric Liang] first pass Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e7905a93 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e7905a93 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e7905a93 Branch: refs/heads/master Commit: e7905a9395c1a002f50bab29e16a729e14d4ed6f Parents: be7be6d Author: Eric Liang e...@databricks.com Authored: Thu Jul 30 16:15:43 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Jul 30 16:15:43 2015 -0700 -- R/pkg/NAMESPACE | 3 ++- R/pkg/R/mllib.R | 26 +++ R/pkg/inst/tests/test_mllib.R | 11 .../apache/spark/ml/feature/OneHotEncoder.scala | 12 - .../org/apache/spark/ml/feature/RFormula.scala | 12 - .../org/apache/spark/ml/r/SparkRWrappers.scala | 27 ++-- .../spark/ml/regression/LinearRegression.scala | 8 -- .../spark/ml/feature/OneHotEncoderSuite.scala | 8 +++--- .../apache/spark/ml/feature/RFormulaSuite.scala | 18 + 9 files changed, 108 insertions(+), 17 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/e7905a93/R/pkg/NAMESPACE -- diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 7f7a8a2..a329e14 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -12,7 +12,8 @@ export(print.jobj) # MLlib integration exportMethods(glm, - predict) + predict, + summary) # Job group lifecycle management methods export(setJobGroup, http://git-wip-us.apache.org/repos/asf/spark/blob/e7905a93/R/pkg/R/mllib.R -- diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R index 6a8baca..efddcc1 100644 --- a/R/pkg/R/mllib.R +++ b/R/pkg/R/mllib.R @@ -71,3 +71,29 @@ setMethod(predict, signature(object = PipelineModel), function(object, newData) { return(dataFrame(callJMethod(object@model, transform, newData@sdf))) }) + +#' Get the summary of a model +#' +#' Returns the summary of a model produced by glm(), similarly to R's summary(). +#' +#' @param model A fitted MLlib model +#' @return a list with a 'coefficient' component, which is the matrix of coefficients. See +#' summary.glm for more information. +#' @rdname glm +#' @export +#' @examples +#'\dontrun{ +#' model - glm(y ~ x, trainingData) +#' summary(model) +#'} +setMethod(summary, signature(object = PipelineModel), + function(object) { +features - callJStatic(org.apache.spark.ml.api.r.SparkRWrappers, + getModelFeatures, object@model) +weights - callJStatic(org.apache.spark.ml.api.r.SparkRWrappers, + getModelWeights, object@model) +coefficients - as.matrix(unlist(weights)) +colnames(coefficients) - c(Estimate) +rownames(coefficients) - unlist(features) +return(list(coefficients = coefficients)) + }) http://git-wip-us.apache.org/repos/asf/spark/blob/e7905a93/R/pkg/inst/tests/test_mllib.R -- diff --git a/R/pkg/inst/tests/test_mllib.R b/R/pkg/inst/tests/test_mllib.R index 3bef693..f272de7 100644 --- a/R/pkg/inst/tests/test_mllib.R +++ b/R/pkg/inst/tests/test_mllib.R @@ -48,3 +48,14 @@ test_that(dot minus and intercept vs native glm, { rVals - predict(glm(Sepal.Width ~ . - Species + 0, data = iris), iris) expect_true(all(abs(rVals - vals) 1e-6), rVals - vals) }) + +test_that(summary coefficients match with native glm, { + training - createDataFrame(sqlContext, iris) + stats - summary(glm(Sepal_Width ~ Sepal_Length + Species, data = training)) + coefs - as.vector(stats$coefficients) + rCoefs - as.vector(coef(glm(Sepal.Width ~ Sepal.Length + Species, data = iris))) + expect_true(all
spark git commit: [SPARK-9225] [MLLIB] LDASuite needs unit tests for empty documents
Repository: spark Updated Branches: refs/heads/master 9c0501c5d - a6e53a9c8 [SPARK-9225] [MLLIB] LDASuite needs unit tests for empty documents Add unit tests for running LDA with empty documents. Both EMLDAOptimizer and OnlineLDAOptimizer are tested. feynmanliang Author: Meihua Wu meihu...@umich.edu Closes #7620 from rotationsymmetry/SPARK-9225 and squashes the following commits: 3ed7c88 [Meihua Wu] Incorporate reviewer's further comments f9432e8 [Meihua Wu] Incorporate reviewer's comments 8e1b9ec [Meihua Wu] Merge remote-tracking branch 'upstream/master' into SPARK-9225 ad55665 [Meihua Wu] Add unit tests for running LDA with empty documents Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a6e53a9c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a6e53a9c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a6e53a9c Branch: refs/heads/master Commit: a6e53a9c8b24326d1b6dca7a0e36ce6c643daa77 Parents: 9c0501c Author: Meihua Wu meihu...@umich.edu Authored: Thu Jul 30 08:52:01 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Jul 30 08:52:01 2015 -0700 -- .../spark/mllib/clustering/LDASuite.scala | 40 1 file changed, 40 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a6e53a9c/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala b/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala index b91c7ce..61d2edf 100644 --- a/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala +++ b/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala @@ -390,6 +390,46 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext { } } + test(EMLDAOptimizer with empty docs) { +val vocabSize = 6 +val emptyDocsArray = Array.fill(6)(Vectors.sparse(vocabSize, Array.empty, Array.empty)) +val emptyDocs = emptyDocsArray + .zipWithIndex.map { case (wordCounts, docId) = +(docId.toLong, wordCounts) +} +val distributedEmptyDocs = sc.parallelize(emptyDocs, 2) + +val op = new EMLDAOptimizer() +val lda = new LDA() + .setK(3) + .setMaxIterations(5) + .setSeed(12345) + .setOptimizer(op) + +val model = lda.run(distributedEmptyDocs) +assert(model.vocabSize === vocabSize) + } + + test(OnlineLDAOptimizer with empty docs) { +val vocabSize = 6 +val emptyDocsArray = Array.fill(6)(Vectors.sparse(vocabSize, Array.empty, Array.empty)) +val emptyDocs = emptyDocsArray + .zipWithIndex.map { case (wordCounts, docId) = +(docId.toLong, wordCounts) +} +val distributedEmptyDocs = sc.parallelize(emptyDocs, 2) + +val op = new OnlineLDAOptimizer() +val lda = new LDA() + .setK(3) + .setMaxIterations(5) + .setSeed(12345) + .setOptimizer(op) + +val model = lda.run(distributedEmptyDocs) +assert(model.vocabSize === vocabSize) + } + } private[clustering] object LDASuite { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9277] [MLLIB] SparseVector constructor must throw an error when declared number of elements less than array length
Repository: spark Updated Branches: refs/heads/master a6e53a9c8 - ed3cb1d21 [SPARK-9277] [MLLIB] SparseVector constructor must throw an error when declared number of elements less than array length Check that SparseVector size is at least as big as the number of indices/values provided. And add tests for constructor checks. CC MechCoder jkbradley -- I am not sure if a change needs to also happen in the Python API? I didn't see it had any similar checks to begin with, but I don't know it well. Author: Sean Owen so...@cloudera.com Closes #7794 from srowen/SPARK-9277 and squashes the following commits: e8dc31e [Sean Owen] Fix scalastyle 6ffe34a [Sean Owen] Check that SparseVector size is at least as big as the number of indices/values provided. And add tests for constructor checks. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ed3cb1d2 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ed3cb1d2 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ed3cb1d2 Branch: refs/heads/master Commit: ed3cb1d21c73645c8f6e6ee08181f876fc192e41 Parents: a6e53a9 Author: Sean Owen so...@cloudera.com Authored: Thu Jul 30 09:19:55 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Jul 30 09:19:55 2015 -0700 -- .../org/apache/spark/mllib/linalg/Vectors.scala | 2 ++ .../org/apache/spark/mllib/linalg/VectorsSuite.scala | 15 +++ 2 files changed, 17 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ed3cb1d2/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala index 0cb28d7..23c2c16 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala @@ -637,6 +637,8 @@ class SparseVector( require(indices.length == values.length, Sparse vectors require that the dimension of the + s indices match the dimension of the values. You provided ${indices.length} indices and + s ${values.length} values.) + require(indices.length = size, sYou provided ${indices.length} indices and values, + +swhich exceeds the specified vector size ${size}.) override def toString: String = s($size,${indices.mkString([, ,, ])},${values.mkString([, ,, ])}) http://git-wip-us.apache.org/repos/asf/spark/blob/ed3cb1d2/mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala -- diff --git a/mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala b/mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala index 03be411..1c37ea5 100644 --- a/mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala @@ -57,6 +57,21 @@ class VectorsSuite extends SparkFunSuite with Logging { assert(vec.values === values) } + test(sparse vector construction with mismatched indices/values array) { +intercept[IllegalArgumentException] { + Vectors.sparse(4, Array(1, 2, 3), Array(3.0, 5.0, 7.0, 9.0)) +} +intercept[IllegalArgumentException] { + Vectors.sparse(4, Array(1, 2, 3), Array(3.0, 5.0)) +} + } + + test(sparse vector construction with too many indices vs size) { +intercept[IllegalArgumentException] { + Vectors.sparse(3, Array(1, 2, 3, 4), Array(3.0, 5.0, 7.0, 9.0)) +} + } + test(dense to array) { val vec = Vectors.dense(arr).asInstanceOf[DenseVector] assert(vec.toArray.eq(arr)) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR] [MLLIB] fix doc for RegexTokenizer
Repository: spark Updated Branches: refs/heads/master ed3cb1d21 - 81464f2a8 [MINOR] [MLLIB] fix doc for RegexTokenizer This is #7791 for Python. hhbyyh Author: Xiangrui Meng m...@databricks.com Closes #7798 from mengxr/regex-tok-py and squashes the following commits: baa2dcd [Xiangrui Meng] fix doc for RegexTokenizer Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/81464f2a Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/81464f2a Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/81464f2a Branch: refs/heads/master Commit: 81464f2a8243c6ae2a39bac7ebdc50d4f60af451 Parents: ed3cb1d Author: Xiangrui Meng m...@databricks.com Authored: Thu Jul 30 09:45:17 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Jul 30 09:45:17 2015 -0700 -- python/pyspark/ml/feature.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/81464f2a/python/pyspark/ml/feature.py -- diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index 86e654d..015e7a9 100644 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -525,7 +525,7 @@ class RegexTokenizer(JavaTransformer, HasInputCol, HasOutputCol): A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text -(default) or repeatedly matching the regex (if gaps is true). +(default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [MINOR] [MLLIB] fix doc for RegexTokenizer
Repository: spark Updated Branches: refs/heads/branch-1.4 020dd30e5 - 6e85064f4 [MINOR] [MLLIB] fix doc for RegexTokenizer This is #7791 for Python. hhbyyh Author: Xiangrui Meng m...@databricks.com Closes #7798 from mengxr/regex-tok-py and squashes the following commits: baa2dcd [Xiangrui Meng] fix doc for RegexTokenizer (cherry picked from commit 81464f2a8243c6ae2a39bac7ebdc50d4f60af451) Signed-off-by: Xiangrui Meng m...@databricks.com Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6e85064f Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6e85064f Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6e85064f Branch: refs/heads/branch-1.4 Commit: 6e85064f416bf647ea463bffa621367647862c61 Parents: 020dd30 Author: Xiangrui Meng m...@databricks.com Authored: Thu Jul 30 09:45:17 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Jul 30 09:45:41 2015 -0700 -- python/pyspark/ml/feature.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/6e85064f/python/pyspark/ml/feature.py -- diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py index ddb33f4..7432108 100644 --- a/python/pyspark/ml/feature.py +++ b/python/pyspark/ml/feature.py @@ -456,7 +456,7 @@ class RegexTokenizer(JavaTransformer, HasInputCol, HasOutputCol): A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text -(default) or repeatedly matching the regex (if gaps is true). +(default) or repeatedly matching the regex (if gaps is false). Optional parameters also allow filtering tokens using a minimal length. It returns an array of strings that can be empty. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: [SPARK-9408] [PYSPARK] [MLLIB] Refactor linalg.py to /linalg
Repository: spark Updated Branches: refs/heads/master 1afdeb7b4 - ca71cc8c8 [SPARK-9408] [PYSPARK] [MLLIB] Refactor linalg.py to /linalg This is based on MechCoder 's PR https://github.com/apache/spark/pull/7731. Hopefully it could pass tests. MechCoder I tried to make minimal changes. If this passes Jenkins, we can merge this one first and then try to move `__init__.py` to `local.py` in a separate PR. Closes #7731 Author: Xiangrui Meng m...@databricks.com Closes #7746 from mengxr/SPARK-9408 and squashes the following commits: 0e05a3b [Xiangrui Meng] merge master 1135551 [Xiangrui Meng] add a comment for str(...) c48cae0 [Xiangrui Meng] update tests 173a805 [Xiangrui Meng] move linalg.py to linalg/__init__.py Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ca71cc8c Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ca71cc8c Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ca71cc8c Branch: refs/heads/master Commit: ca71cc8c8b2d64b7756ae697c06876cd18b536dc Parents: 1afdeb7 Author: Xiangrui Meng m...@databricks.com Authored: Thu Jul 30 16:57:38 2015 -0700 Committer: Xiangrui Meng m...@databricks.com Committed: Thu Jul 30 16:57:38 2015 -0700 -- dev/sparktestsupport/modules.py |2 +- python/pyspark/mllib/linalg.py | 1162 -- python/pyspark/mllib/linalg/__init__.py | 1162 ++ python/pyspark/sql/types.py |2 +- 4 files changed, 1164 insertions(+), 1164 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ca71cc8c/dev/sparktestsupport/modules.py -- diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index 030d982..44600cb 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -323,7 +323,7 @@ pyspark_mllib = Module( pyspark.mllib.evaluation, pyspark.mllib.feature, pyspark.mllib.fpm, -pyspark.mllib.linalg, +pyspark.mllib.linalg.__init__, pyspark.mllib.random, pyspark.mllib.recommendation, pyspark.mllib.regression, http://git-wip-us.apache.org/repos/asf/spark/blob/ca71cc8c/python/pyspark/mllib/linalg.py -- diff --git a/python/pyspark/mllib/linalg.py b/python/pyspark/mllib/linalg.py deleted file mode 100644 index 334dc8e..000 --- a/python/pyspark/mllib/linalg.py +++ /dev/null @@ -1,1162 +0,0 @@ -# -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the License); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -#http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an AS IS BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - - -MLlib utilities for linear algebra. For dense vectors, MLlib -uses the NumPy C{array} type, so you can simply pass NumPy arrays -around. For sparse vectors, users can construct a L{SparseVector} -object from MLlib or pass SciPy C{scipy.sparse} column vectors if -SciPy is available in their environment. - - -import sys -import array - -if sys.version = '3': -basestring = str -xrange = range -import copyreg as copy_reg -long = int -else: -from itertools import izip as zip -import copy_reg - -import numpy as np - -from pyspark.sql.types import UserDefinedType, StructField, StructType, ArrayType, DoubleType, \ -IntegerType, ByteType, BooleanType - - -__all__ = ['Vector', 'DenseVector', 'SparseVector', 'Vectors', - 'Matrix', 'DenseMatrix', 'SparseMatrix', 'Matrices'] - - -if sys.version_info[:2] == (2, 7): -# speed up pickling array in Python 2.7 -def fast_pickle_array(ar): -return array.array, (ar.typecode, ar.tostring()) -copy_reg.pickle(array.array, fast_pickle_array) - - -# Check whether we have SciPy. MLlib works without it too, but if we have it, some methods, -# such as _dot and _serialize_double_vector, start to support scipy.sparse matrices. - -try: -import scipy.sparse -_have_scipy = True -except: -# No SciPy in environment, but that's okay -_have_scipy = False - - -def _convert_to_vector(l): -if isinstance(l, Vector): -return l -elif type(l