from:"meng"

svn commit: r1696648 - in /spark: mllib/index.md site/mllib/index.html

2015-08-19 Thread meng

Author: meng
Date: Wed Aug 19 19:11:08 2015
New Revision: 1696648

URL: http://svn.apache.org/r1696648
Log:
update MLlib page for 1.5

Modified:
spark/mllib/index.md
spark/site/mllib/index.html

Modified: spark/mllib/index.md
URL: 
http://svn.apache.org/viewvc/spark/mllib/index.md?rev=1696648r1=1696647r2=1696648view=diff
==
--- spark/mllib/index.md (original)
+++ spark/mllib/index.md Wed Aug 19 19:11:08 2015
@@ -14,7 +14,7 @@ subproject: MLlib
   div class=col-md-7 col-sm-7
 h2Ease of Use/h2
 p class=lead
-  Usable in Java, Scala and Python.
+  Usable in Java, Scala, Python, and SparkR.
 /p
 p
   MLlib fits into a href={{site.url}}Spark/a's
@@ -83,22 +83,25 @@ subproject: MLlib
   div class=col-md-4 col-padded
 h3Algorithms/h3
 p
-  MLlib 1.3 contains the following algorithms:
+  MLlib contains the following algorithms and utilities:
 /p
 ul class=list-narrow
-  lilinear SVM and logistic regression/li
+  lilogistic regression and linear support vector machine (SVM)/li
   liclassification and regression tree/li
   lirandom forest and gradient-boosted trees/li
-  lirecommendation via alternating least squares/li
-  liclustering via k-means, Gaussian mixtures, and power iteration 
clustering/li
-  litopic modeling via latent Dirichlet allocation/li
-  lisingular value decomposition/li
-  lilinear regression with Lsub1/sub- and 
Lsub2/sub-regularization/li
+  lirecommendation via alternating least squares (ALS)/li
+  liclustering via k-means, Gaussian mixtures (GMM), and power iteration 
clustering/li
+  litopic modeling via latent Dirichlet allocation (LDA)/li
+  lisingular value decomposition (SVD) and QR decomposition/li
+  liprincipal component analysis (PCA)/li
+  lilinear regression with Lsub1/sub, Lsub2/sub, and elastic-net 
regularization/li
   liisotonic regression/li
-  limultinomial naive Bayes/li
-  lifrequent itemset mining via FP-growth/li
-  libasic statistics/li
+  limultinomial/binomial naive Bayes/li
+  lifrequent itemset mining via FP-growth and association rules/li
+  lisequential pattern mining via PrefixSpan/li
+  lisummary statistics and hypothesis testing/li
   lifeature transformations/li
+  limodel evaluation and hyper-parameter tuning/li
 /ul
 pRefer to the a href={{site.url}}docs/latest/mllib-guide.htmlMLlib 
guide/a for usage examples./p
   /div

Modified: spark/site/mllib/index.html
URL: 
http://svn.apache.org/viewvc/spark/site/mllib/index.html?rev=1696648r1=1696647r2=1696648view=diff
==
--- spark/site/mllib/index.html (original)
+++ spark/site/mllib/index.html Wed Aug 19 19:11:08 2015
@@ -178,7 +178,7 @@
   div class=col-md-7 col-sm-7
 h2Ease of Use/h2
 p class=lead
-  Usable in Java, Scala and Python.
+  Usable in Java, Scala, Python, and SparkR.
 /p
 p
   MLlib fits into a href=/Spark/a's
@@ -250,22 +250,25 @@
   div class=col-md-4 col-padded
 h3Algorithms/h3
 p
-  MLlib 1.3 contains the following algorithms:
+  MLlib contains the following algorithms and utilities:
 /p
 ul class=list-narrow
-  lilinear SVM and logistic regression/li
+  lilogistic regression and linear support vector machine (SVM)/li
   liclassification and regression tree/li
   lirandom forest and gradient-boosted trees/li
-  lirecommendation via alternating least squares/li
-  liclustering via k-means, Gaussian mixtures, and power iteration 
clustering/li
-  litopic modeling via latent Dirichlet allocation/li
-  lisingular value decomposition/li
-  lilinear regression with Lsub1/sub- and 
Lsub2/sub-regularization/li
+  lirecommendation via alternating least squares (ALS)/li
+  liclustering via k-means, Gaussian mixtures (GMM), and power iteration 
clustering/li
+  litopic modeling via latent Dirichlet allocation (LDA)/li
+  lisingular value decomposition (SVD) and QR decomposition/li
+  liprincipal component analysis (PCA)/li
+  lilinear regression with Lsub1/sub, Lsub2/sub, and elastic-net 
regularization/li
   liisotonic regression/li
-  limultinomial naive Bayes/li
-  lifrequent itemset mining via FP-growth/li
-  libasic statistics/li
+  limultinomial/binomial naive Bayes/li
+  lifrequent itemset mining via FP-growth and association rules/li
+  lisequential pattern mining via PrefixSpan/li
+  lisummary statistics and hypothesis testing/li
   lifeature transformations/li
+  limodel evaluation and hyper-parameter tuning/li
 /ul
 pRefer to the a href=/docs/latest/mllib-guide.htmlMLlib guide/a 
for usage examples./p
   /div



-
To unsubscribe, e-mail: commits-unsubscr

spark git commit: [SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clustering

2015-08-19 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master d898c33f7 - 5b62bef8c


[SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clustering

This continues the work from #8256. I removed `since` tags from 
private/protected/local methods/variables (see 
https://github.com/apache/spark/commit/72fdeb64630470f6f46cf3eed8ffbfe83a7c4659).
 MechCoder

Closes #8256

Author: Xiangrui Meng m...@databricks.com
Author: Xiaoqing Wang spark...@126.com
Author: MechCoder manojkumarsivaraj...@gmail.com

Closes #8288 from mengxr/SPARK-8918.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5b62bef8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5b62bef8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5b62bef8

Branch: refs/heads/master
Commit: 5b62bef8cbf73f910513ef3b1f557aa94b384854
Parents: d898c33
Author: Xiangrui Meng m...@databricks.com
Authored: Wed Aug 19 13:17:26 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 19 13:17:26 2015 -0700

--
 .../mllib/clustering/GaussianMixture.scala  | 56 +++
 .../mllib/clustering/GaussianMixtureModel.scala | 32 +++--
 .../apache/spark/mllib/clustering/KMeans.scala  | 36 +-
 .../spark/mllib/clustering/KMeansModel.scala| 37 --
 .../org/apache/spark/mllib/clustering/LDA.scala | 71 +---
 .../spark/mllib/clustering/LDAModel.scala   | 64 --
 .../spark/mllib/clustering/LDAOptimizer.scala   | 12 +++-
 .../clustering/PowerIterationClustering.scala   | 29 +++-
 .../mllib/clustering/StreamingKMeans.scala  | 53 ---
 9 files changed, 338 insertions(+), 52 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5b62bef8/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
index e459367..bc27b1f 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
@@ -62,6 +62,7 @@ class GaussianMixture private (
   /**
* Constructs a default instance. The default parameters are {k: 2, 
convergenceTol: 0.01,
* maxIterations: 100, seed: random}.
+   * @since 1.3.0
*/
   def this() = this(2, 0.01, 100, Utils.random.nextLong())
 
@@ -72,9 +73,11 @@ class GaussianMixture private (
   // default random starting point
   private var initialModel: Option[GaussianMixtureModel] = None
 
-  /** Set the initial GMM starting point, bypassing the random initialization.
-   *  You must call setK() prior to calling this method, and the condition
-   *  (model.k == this.k) must be met; failure will result in an 
IllegalArgumentException
+  /**
+   * Set the initial GMM starting point, bypassing the random initialization.
+   * You must call setK() prior to calling this method, and the condition
+   * (model.k == this.k) must be met; failure will result in an 
IllegalArgumentException
+   * @since 1.3.0
*/
   def setInitialModel(model: GaussianMixtureModel): this.type = {
 if (model.k == k) {
@@ -85,30 +88,46 @@ class GaussianMixture private (
 this
   }
 
-  /** Return the user supplied initial GMM, if supplied */
+  /**
+   * Return the user supplied initial GMM, if supplied
+   * @since 1.3.0
+   */
   def getInitialModel: Option[GaussianMixtureModel] = initialModel
 
-  /** Set the number of Gaussians in the mixture model.  Default: 2 */
+  /**
+   * Set the number of Gaussians in the mixture model.  Default: 2
+   * @since 1.3.0
+   */
   def setK(k: Int): this.type = {
 this.k = k
 this
   }
 
-  /** Return the number of Gaussians in the mixture model */
+  /**
+   * Return the number of Gaussians in the mixture model
+   * @since 1.3.0
+   */
   def getK: Int = k
 
-  /** Set the maximum number of iterations to run. Default: 100 */
+  /**
+   * Set the maximum number of iterations to run. Default: 100
+   * @since 1.3.0
+   */
   def setMaxIterations(maxIterations: Int): this.type = {
 this.maxIterations = maxIterations
 this
   }
 
-  /** Return the maximum number of iterations to run */
+  /**
+   * Return the maximum number of iterations to run
+   * @since 1.3.0
+   */
   def getMaxIterations: Int = maxIterations
 
   /**
* Set the largest change in log-likelihood at which convergence is
* considered to have occurred.
+   * @since 1.3.0
*/
   def setConvergenceTol(convergenceTol: Double): this.type = {
 this.convergenceTol = convergenceTol
@@ -118,19 +137,29 @@ class GaussianMixture private (
   /**
* Return the largest

spark git commit: [SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clustering

2015-08-19 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 ba369258d - 8c0a5a248


[SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clustering

This continues the work from #8256. I removed `since` tags from 
private/protected/local methods/variables (see 
https://github.com/apache/spark/commit/72fdeb64630470f6f46cf3eed8ffbfe83a7c4659).
 MechCoder

Closes #8256

Author: Xiangrui Meng m...@databricks.com
Author: Xiaoqing Wang spark...@126.com
Author: MechCoder manojkumarsivaraj...@gmail.com

Closes #8288 from mengxr/SPARK-8918.

(cherry picked from commit 5b62bef8cbf73f910513ef3b1f557aa94b384854)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8c0a5a24
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8c0a5a24
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8c0a5a24

Branch: refs/heads/branch-1.5
Commit: 8c0a5a2485d899e9a58d431b395d2a3f3bf4c5c1
Parents: ba36925
Author: Xiangrui Meng m...@databricks.com
Authored: Wed Aug 19 13:17:26 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 19 13:17:34 2015 -0700

--
 .../mllib/clustering/GaussianMixture.scala  | 56 +++
 .../mllib/clustering/GaussianMixtureModel.scala | 32 +++--
 .../apache/spark/mllib/clustering/KMeans.scala  | 36 +-
 .../spark/mllib/clustering/KMeansModel.scala| 37 --
 .../org/apache/spark/mllib/clustering/LDA.scala | 71 +---
 .../spark/mllib/clustering/LDAModel.scala   | 64 --
 .../spark/mllib/clustering/LDAOptimizer.scala   | 12 +++-
 .../clustering/PowerIterationClustering.scala   | 29 +++-
 .../mllib/clustering/StreamingKMeans.scala  | 53 ---
 9 files changed, 338 insertions(+), 52 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8c0a5a24/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
index e459367..bc27b1f 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala
@@ -62,6 +62,7 @@ class GaussianMixture private (
   /**
* Constructs a default instance. The default parameters are {k: 2, 
convergenceTol: 0.01,
* maxIterations: 100, seed: random}.
+   * @since 1.3.0
*/
   def this() = this(2, 0.01, 100, Utils.random.nextLong())
 
@@ -72,9 +73,11 @@ class GaussianMixture private (
   // default random starting point
   private var initialModel: Option[GaussianMixtureModel] = None
 
-  /** Set the initial GMM starting point, bypassing the random initialization.
-   *  You must call setK() prior to calling this method, and the condition
-   *  (model.k == this.k) must be met; failure will result in an 
IllegalArgumentException
+  /**
+   * Set the initial GMM starting point, bypassing the random initialization.
+   * You must call setK() prior to calling this method, and the condition
+   * (model.k == this.k) must be met; failure will result in an 
IllegalArgumentException
+   * @since 1.3.0
*/
   def setInitialModel(model: GaussianMixtureModel): this.type = {
 if (model.k == k) {
@@ -85,30 +88,46 @@ class GaussianMixture private (
 this
   }
 
-  /** Return the user supplied initial GMM, if supplied */
+  /**
+   * Return the user supplied initial GMM, if supplied
+   * @since 1.3.0
+   */
   def getInitialModel: Option[GaussianMixtureModel] = initialModel
 
-  /** Set the number of Gaussians in the mixture model.  Default: 2 */
+  /**
+   * Set the number of Gaussians in the mixture model.  Default: 2
+   * @since 1.3.0
+   */
   def setK(k: Int): this.type = {
 this.k = k
 this
   }
 
-  /** Return the number of Gaussians in the mixture model */
+  /**
+   * Return the number of Gaussians in the mixture model
+   * @since 1.3.0
+   */
   def getK: Int = k
 
-  /** Set the maximum number of iterations to run. Default: 100 */
+  /**
+   * Set the maximum number of iterations to run. Default: 100
+   * @since 1.3.0
+   */
   def setMaxIterations(maxIterations: Int): this.type = {
 this.maxIterations = maxIterations
 this
   }
 
-  /** Return the maximum number of iterations to run */
+  /**
+   * Return the maximum number of iterations to run
+   * @since 1.3.0
+   */
   def getMaxIterations: Int = maxIterations
 
   /**
* Set the largest change in log-likelihood at which convergence is
* considered to have occurred.
+   * @since 1.3.0
*/
   def setConvergenceTol(convergenceTol: Double): this.type

spark git commit: [SPARK-9895] User Guide for RFormula Feature Transformer

2015-08-19 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master b0dbaec4f - 8e0a072f7


[SPARK-9895] User Guide for RFormula Feature Transformer

mengxr

Author: Eric Liang e...@databricks.com

Closes #8293 from ericl/docs-2.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8e0a072f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8e0a072f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8e0a072f

Branch: refs/heads/master
Commit: 8e0a072f78b4902d5f7ccc6b15232ed202a117f9
Parents: b0dbaec
Author: Eric Liang e...@databricks.com
Authored: Wed Aug 19 15:43:08 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 19 15:43:08 2015 -0700

--
 docs/ml-features.md | 108 +++
 .../org/apache/spark/ml/feature/RFormula.scala  |   4 +-
 2 files changed, 110 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8e0a072f/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index d0e8eeb..6309db9 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1477,3 +1477,111 @@ print(output.select(features, clicked).first())
 /div
 /div
 
+## RFormula
+
+`RFormula` selects columns specified by an [R model 
formula](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html).
 It produces a vector column of features and a double column of labels. Like 
when formulas are used in R for linear regression, string input columns will be 
one-hot encoded, and numeric columns will be cast to doubles. If not already 
present in the DataFrame, the output label column will be created from the 
specified response variable in the formula.
+
+**Examples**
+
+Assume that we have a DataFrame with the columns `id`, `country`, `hour`, and 
`clicked`:
+
+~~~
+id | country | hour | clicked
+---|-|--|-
+ 7 | US| 18   | 1.0
+ 8 | CA| 12   | 0.0
+ 9 | NZ| 15   | 0.0
+~~~
+
+If we use `RFormula` with a formula string of `clicked ~ country + hour`, 
which indicates that we want to
+predict `clicked` based on `country` and `hour`, after transformation we 
should get the following DataFrame:
+
+~~~
+id | country | hour | clicked | features | label
+---|-|--|-|--|---
+ 7 | US| 18   | 1.0 | [0.0, 0.0, 18.0] | 1.0
+ 8 | CA| 12   | 0.0 | [0.0, 1.0, 12.0] | 0.0
+ 9 | NZ| 15   | 0.0 | [1.0, 0.0, 15.0] | 0.0
+~~~
+
+div class=codetabs
+div data-lang=scala markdown=1
+
+[`RFormula`](api/scala/index.html#org.apache.spark.ml.feature.RFormula) takes 
an R formula string, and optional parameters for the names of its output 
columns.
+
+{% highlight scala %}
+import org.apache.spark.ml.feature.RFormula
+
+val dataset = sqlContext.createDataFrame(Seq(
+  (7, US, 18, 1.0),
+  (8, CA, 12, 0.0),
+  (9, NZ, 15, 0.0)
+)).toDF(id, country, hour, clicked)
+val formula = new RFormula()
+  .setFormula(clicked ~ country + hour)
+  .setFeaturesCol(features)
+  .setLabelCol(label)
+val output = formula.fit(dataset).transform(dataset)
+output.select(features, label).show()
+{% endhighlight %}
+/div
+
+div data-lang=java markdown=1
+
+[`RFormula`](api/java/org/apache/spark/ml/feature/RFormula.html) takes an R 
formula string, and optional parameters for the names of its output columns.
+
+{% highlight java %}
+import java.util.Arrays;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.ml.feature.RFormula;
+import org.apache.spark.sql.DataFrame;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.types.*;
+import static org.apache.spark.sql.types.DataTypes.*;
+
+StructType schema = createStructType(new StructField[] {
+  createStructField(id, IntegerType, false),
+  createStructField(country, StringType, false),
+  createStructField(hour, IntegerType, false),
+  createStructField(clicked, DoubleType, false)
+});
+JavaRDDRow rdd = jsc.parallelize(Arrays.asList(
+  RowFactory.create(7, US, 18, 1.0),
+  RowFactory.create(8, CA, 12, 0.0),
+  RowFactory.create(9, NZ, 15, 0.0)
+));
+DataFrame dataset = sqlContext.createDataFrame(rdd, schema);
+
+RFormula formula = new RFormula()
+  .setFormula(clicked ~ country + hour)
+  .setFeaturesCol(features)
+  .setLabelCol(label);
+
+DataFrame output = formula.fit(dataset).transform(dataset);
+output.select(features, label).show();
+{% endhighlight %}
+/div
+
+div data-lang=python markdown=1
+
+[`RFormula`](api/python/pyspark.ml.html#pyspark.ml.feature.RFormula) takes an 
R formula string, and optional parameters for the names of its output columns.
+
+{% highlight python %}
+from pyspark.ml.feature import RFormula
+
+dataset = sqlContext.createDataFrame(
+[(7

spark git commit: [SPARK-9895] User Guide for RFormula Feature Transformer

2015-08-19 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 5c749c82c - 56a37b01f


[SPARK-9895] User Guide for RFormula Feature Transformer

mengxr

Author: Eric Liang e...@databricks.com

Closes #8293 from ericl/docs-2.

(cherry picked from commit 8e0a072f78b4902d5f7ccc6b15232ed202a117f9)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/56a37b01
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/56a37b01
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/56a37b01

Branch: refs/heads/branch-1.5
Commit: 56a37b01fd07f4f1a8cb4e07b55e1a02cf23a5f7
Parents: 5c749c8
Author: Eric Liang e...@databricks.com
Authored: Wed Aug 19 15:43:08 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 19 15:43:15 2015 -0700

--
 docs/ml-features.md | 108 +++
 .../org/apache/spark/ml/feature/RFormula.scala  |   4 +-
 2 files changed, 110 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/56a37b01/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index d0e8eeb..6309db9 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -1477,3 +1477,111 @@ print(output.select(features, clicked).first())
 /div
 /div
 
+## RFormula
+
+`RFormula` selects columns specified by an [R model 
formula](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html).
 It produces a vector column of features and a double column of labels. Like 
when formulas are used in R for linear regression, string input columns will be 
one-hot encoded, and numeric columns will be cast to doubles. If not already 
present in the DataFrame, the output label column will be created from the 
specified response variable in the formula.
+
+**Examples**
+
+Assume that we have a DataFrame with the columns `id`, `country`, `hour`, and 
`clicked`:
+
+~~~
+id | country | hour | clicked
+---|-|--|-
+ 7 | US| 18   | 1.0
+ 8 | CA| 12   | 0.0
+ 9 | NZ| 15   | 0.0
+~~~
+
+If we use `RFormula` with a formula string of `clicked ~ country + hour`, 
which indicates that we want to
+predict `clicked` based on `country` and `hour`, after transformation we 
should get the following DataFrame:
+
+~~~
+id | country | hour | clicked | features | label
+---|-|--|-|--|---
+ 7 | US| 18   | 1.0 | [0.0, 0.0, 18.0] | 1.0
+ 8 | CA| 12   | 0.0 | [0.0, 1.0, 12.0] | 0.0
+ 9 | NZ| 15   | 0.0 | [1.0, 0.0, 15.0] | 0.0
+~~~
+
+div class=codetabs
+div data-lang=scala markdown=1
+
+[`RFormula`](api/scala/index.html#org.apache.spark.ml.feature.RFormula) takes 
an R formula string, and optional parameters for the names of its output 
columns.
+
+{% highlight scala %}
+import org.apache.spark.ml.feature.RFormula
+
+val dataset = sqlContext.createDataFrame(Seq(
+  (7, US, 18, 1.0),
+  (8, CA, 12, 0.0),
+  (9, NZ, 15, 0.0)
+)).toDF(id, country, hour, clicked)
+val formula = new RFormula()
+  .setFormula(clicked ~ country + hour)
+  .setFeaturesCol(features)
+  .setLabelCol(label)
+val output = formula.fit(dataset).transform(dataset)
+output.select(features, label).show()
+{% endhighlight %}
+/div
+
+div data-lang=java markdown=1
+
+[`RFormula`](api/java/org/apache/spark/ml/feature/RFormula.html) takes an R 
formula string, and optional parameters for the names of its output columns.
+
+{% highlight java %}
+import java.util.Arrays;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.ml.feature.RFormula;
+import org.apache.spark.sql.DataFrame;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.types.*;
+import static org.apache.spark.sql.types.DataTypes.*;
+
+StructType schema = createStructType(new StructField[] {
+  createStructField(id, IntegerType, false),
+  createStructField(country, StringType, false),
+  createStructField(hour, IntegerType, false),
+  createStructField(clicked, DoubleType, false)
+});
+JavaRDDRow rdd = jsc.parallelize(Arrays.asList(
+  RowFactory.create(7, US, 18, 1.0),
+  RowFactory.create(8, CA, 12, 0.0),
+  RowFactory.create(9, NZ, 15, 0.0)
+));
+DataFrame dataset = sqlContext.createDataFrame(rdd, schema);
+
+RFormula formula = new RFormula()
+  .setFormula(clicked ~ country + hour)
+  .setFeaturesCol(features)
+  .setLabelCol(label);
+
+DataFrame output = formula.fit(dataset).transform(dataset);
+output.select(features, label).show();
+{% endhighlight %}
+/div
+
+div data-lang=python markdown=1
+
+[`RFormula`](api/python/pyspark.ml.html#pyspark.ml.feature.RFormula) takes an 
R formula string, and optional parameters for the names of its output

spark git commit: [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 5af3838d2 - dd0614fd6


[SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public

Fix the issue that ```layers``` and ```weights``` should be public variables of 
```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` 
and ```weights``` from a ```MultilayerPerceptronClassificationModel``` 
currently.

Author: Yanbo Liang yblia...@gmail.com

Closes #8263 from yanboliang/mlp-public.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dd0614fd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dd0614fd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dd0614fd

Branch: refs/heads/master
Commit: dd0614fd618ad28cb77aecfbd49bb319b98fdba0
Parents: 5af3838
Author: Yanbo Liang yblia...@gmail.com
Authored: Mon Aug 17 23:57:02 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 23:57:02 2015 -0700

--
 .../spark/ml/classification/MultilayerPerceptronClassifier.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dd0614fd/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
index c154561..ccca4ec 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
@@ -172,8 +172,8 @@ class MultilayerPerceptronClassifier(override val uid: 
String)
 @Experimental
 class MultilayerPerceptronClassificationModel private[ml] (
 override val uid: String,
-layers: Array[Int],
-weights: Vector)
+val layers: Array[Int],
+val weights: Vector)
   extends PredictionModel[Vector, MultilayerPerceptronClassificationModel]
   with Serializable {
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 e5fbe4f24 - 40b89c38a


[SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights public

Fix the issue that ```layers``` and ```weights``` should be public variables of 
```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` 
and ```weights``` from a ```MultilayerPerceptronClassificationModel``` 
currently.

Author: Yanbo Liang yblia...@gmail.com

Closes #8263 from yanboliang/mlp-public.

(cherry picked from commit dd0614fd618ad28cb77aecfbd49bb319b98fdba0)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/40b89c38
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/40b89c38
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/40b89c38

Branch: refs/heads/branch-1.5
Commit: 40b89c38ada5edfdd1478dc8f3c983ebcbcc56d5
Parents: e5fbe4f
Author: Yanbo Liang yblia...@gmail.com
Authored: Mon Aug 17 23:57:02 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 23:57:14 2015 -0700

--
 .../spark/ml/classification/MultilayerPerceptronClassifier.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/40b89c38/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
index c154561..ccca4ec 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
@@ -172,8 +172,8 @@ class MultilayerPerceptronClassifier(override val uid: 
String)
 @Experimental
 class MultilayerPerceptronClassificationModel private[ml] (
 override val uid: String,
-layers: Array[Int],
-weights: Vector)
+val layers: Array[Int],
+val weights: Vector)
   extends PredictionModel[Vector, MultilayerPerceptronClassificationModel]
   with Serializable {
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9900] [MLLIB] User guide for Association Rules

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 b86378cf2 - 7ff0e5d2f


[SPARK-9900] [MLLIB] User guide for Association Rules

Updates FPM user guide to include Association Rules.

Author: Feynman Liang fli...@databricks.com

Closes #8207 from feynmanliang/SPARK-9900-arules.

(cherry picked from commit f5ea3912900ccdf23e2eb419a342bfe3c0c0b61b)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7ff0e5d2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7ff0e5d2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7ff0e5d2

Branch: refs/heads/branch-1.5
Commit: 7ff0e5d2fe07d4a9518ade26b09bcc32f418ca1b
Parents: b86378c
Author: Feynman Liang fli...@databricks.com
Authored: Tue Aug 18 12:53:57 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 12:54:05 2015 -0700

--
 docs/mllib-frequent-pattern-mining.md   | 130 +--
 docs/mllib-guide.md |   1 +
 .../mllib/fpm/JavaAssociationRulesSuite.java|   2 +-
 3 files changed, 118 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7ff0e5d2/docs/mllib-frequent-pattern-mining.md
--
diff --git a/docs/mllib-frequent-pattern-mining.md 
b/docs/mllib-frequent-pattern-mining.md
index 8ea4389..6c06550 100644
--- a/docs/mllib-frequent-pattern-mining.md
+++ b/docs/mllib-frequent-pattern-mining.md
@@ -39,18 +39,30 @@ MLlib's FP-growth implementation takes the following 
(hyper-)parameters:
 div class=codetabs
 div data-lang=scala markdown=1
 
-[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) 
implements the
-FP-growth algorithm.
-It take a `JavaRDD` of transactions, where each transaction is an `Iterable` 
of items of a generic type.
+[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth)
+implements the FP-growth algorithm.  It take an `RDD` of transactions,
+where each transaction is an `Iterable` of items of a generic type.
 Calling `FPGrowth.run` with transactions returns an
 
[`FPGrowthModel`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowthModel)
-that stores the frequent itemsets with their frequencies.
+that stores the frequent itemsets with their frequencies.  The following
+example illustrates how to mine frequent itemsets and association rules
+(see [Association
+Rules](mllib-frequent-pattern-mining.html#association-rules) for
+details) from `transactions`.
+
 
 {% highlight scala %}
 import org.apache.spark.rdd.RDD
 import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel}
 
-val transactions: RDD[Array[String]] = ...
+val transactions: RDD[Array[String]] = sc.parallelize(Seq(
+  r z h k p,
+  z y x w v u t s,
+  s x o n r,
+  x z y m t s q e,
+  z,
+  x z y r q t p)
+  .map(_.split( )))
 
 val fpg = new FPGrowth()
   .setMinSupport(0.2)
@@ -60,29 +72,48 @@ val model = fpg.run(transactions)
 model.freqItemsets.collect().foreach { itemset =
   println(itemset.items.mkString([, ,, ]) + ,  + itemset.freq)
 }
+
+val minConfidence = 0.8
+model.generateAssociationRules(minConfidence).collect().foreach { rule =
+  println(
+rule.antecedent.mkString([, ,, ])
+  +  =  + rule.consequent .mkString([, ,, ])
+  + ,  + rule.confidence)
+}
 {% endhighlight %}
 
 /div
 
 div data-lang=java markdown=1
 
-[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
-FP-growth algorithm.
-It take an `RDD` of transactions, where each transaction is an `Array` of 
items of a generic type.
-Calling `FPGrowth.run` with transactions returns an
+[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html)
+implements the FP-growth algorithm.  It take a `JavaRDD` of
+transactions, where each transaction is an `Array` of items of a generic
+type.  Calling `FPGrowth.run` with transactions returns an
 [`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html)
-that stores the frequent itemsets with their frequencies.
+that stores the frequent itemsets with their frequencies.  The following
+example illustrates how to mine frequent itemsets and association rules
+(see [Association
+Rules](mllib-frequent-pattern-mining.html#association-rules) for
+details) from `transactions`.
 
 {% highlight java %}
+import java.util.Arrays;
 import java.util.List;
 
-import com.google.common.base.Joiner;
-
 import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.fpm.AssociationRules;
 import org.apache.spark.mllib.fpm.FPGrowth;
 import org.apache.spark.mllib.fpm.FPGrowthModel;
 
-JavaRDDListString transactions = ...
+JavaRDDListString transactions = sc.parallelize(Arrays.asList(
+  Arrays.asList(r z h

spark git commit: [SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 ec7079f9c - 9bd2e6f7c


[SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import

See https://issues.apache.org/jira/browse/SPARK-10085

Author: Piotr Migdal pmig...@gmail.com

Closes #8284 from stared/spark-10085.

(cherry picked from commit 8bae9015b7e7b4528ca2bc5180771cb95d2aac13)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9bd2e6f7
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9bd2e6f7
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9bd2e6f7

Branch: refs/heads/branch-1.5
Commit: 9bd2e6f7cbff1835f9abefe26dbe445eaa5b004b
Parents: ec7079f
Author: Piotr Migdal pmig...@gmail.com
Authored: Tue Aug 18 12:59:28 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 12:59:36 2015 -0700

--
 docs/mllib-linear-methods.md | 2 --
 1 file changed, 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9bd2e6f7/docs/mllib-linear-methods.md
--
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index 07655ba..e9b2d27 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -504,7 +504,6 @@ will in the future.
 {% highlight python %}
 from pyspark.mllib.classification import LogisticRegressionWithLBFGS, 
LogisticRegressionModel
 from pyspark.mllib.regression import LabeledPoint
-from numpy import array
 
 # Load and parse the data
 def parsePoint(line):
@@ -676,7 +675,6 @@ Note that the Python API does not yet support model 
save/load but will in the fu
 
 {% highlight python %}
 from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, 
LinearRegressionModel
-from numpy import array
 
 # Load and parse the data
 def parsePoint(line):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 747c2ba80 - 8bae9015b


[SPARK-10085] [MLLIB] [DOCS] removed unnecessary numpy array import

See https://issues.apache.org/jira/browse/SPARK-10085

Author: Piotr Migdal pmig...@gmail.com

Closes #8284 from stared/spark-10085.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8bae9015
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8bae9015
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8bae9015

Branch: refs/heads/master
Commit: 8bae9015b7e7b4528ca2bc5180771cb95d2aac13
Parents: 747c2ba
Author: Piotr Migdal pmig...@gmail.com
Authored: Tue Aug 18 12:59:28 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 12:59:28 2015 -0700

--
 docs/mllib-linear-methods.md | 2 --
 1 file changed, 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8bae9015/docs/mllib-linear-methods.md
--
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index 07655ba..e9b2d27 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -504,7 +504,6 @@ will in the future.
 {% highlight python %}
 from pyspark.mllib.classification import LogisticRegressionWithLBFGS, 
LogisticRegressionModel
 from pyspark.mllib.regression import LabeledPoint
-from numpy import array
 
 # Load and parse the data
 def parsePoint(line):
@@ -676,7 +675,6 @@ Note that the Python API does not yet support model 
save/load but will in the fu
 
 {% highlight python %}
 from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, 
LinearRegressionModel
-from numpy import array
 
 # Load and parse the data
 def parsePoint(line):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression user guide

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master f5ea39129 - f4fa61eff


[SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression 
user guide

Add Python examples for mllib IsotonicRegression user guide

Author: Yanbo Liang yblia...@gmail.com

Closes #8225 from yanboliang/spark-10029.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f4fa61ef
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f4fa61ef
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f4fa61ef

Branch: refs/heads/master
Commit: f4fa61effe34dae2f0eab0bef57b2dee220cf92f
Parents: f5ea391
Author: Yanbo Liang yblia...@gmail.com
Authored: Tue Aug 18 12:55:36 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 12:55:36 2015 -0700

--
 docs/mllib-isotonic-regression.md | 35 ++
 1 file changed, 35 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f4fa61ef/docs/mllib-isotonic-regression.md
--
diff --git a/docs/mllib-isotonic-regression.md 
b/docs/mllib-isotonic-regression.md
index 5732bc4..6aa881f 100644
--- a/docs/mllib-isotonic-regression.md
+++ b/docs/mllib-isotonic-regression.md
@@ -160,4 +160,39 @@ model.save(sc.sc(), myModelPath);
 IsotonicRegressionModel sameModel = IsotonicRegressionModel.load(sc.sc(), 
myModelPath);
 {% endhighlight %}
 /div
+
+div data-lang=python markdown=1
+Data are read from a file where each line has a format label,feature
+i.e. 4710.28,500.00. The data are split to training and testing set.
+Model is created using the training set and a mean squared error is calculated 
from the predicted
+labels and real labels in the test set.
+
+{% highlight python %}
+import math
+from pyspark.mllib.regression import IsotonicRegression, 
IsotonicRegressionModel
+
+data = sc.textFile(data/mllib/sample_isotonic_regression_data.txt)
+
+# Create label, feature, weight tuples from input data with weight set to 
default value 1.0.
+parsedData = data.map(lambda line: tuple([float(x) for x in line.split(',')]) 
+ (1.0,))
+
+# Split data into training (60%) and test (40%) sets.
+training, test = parsedData.randomSplit([0.6, 0.4], 11)
+
+# Create isotonic regression model from training data.
+# Isotonic parameter defaults to true so it is only shown for demonstration
+model = IsotonicRegression.train(training)
+
+# Create tuples of predicted and real labels.
+predictionAndLabel = test.map(lambda p: (model.predict(p[1]), p[0]))
+
+# Calculate mean squared error between predicted and real labels.
+meanSquaredError = predictionAndLabel.map(lambda pl: math.pow((pl[0] - pl[1]), 
2)).mean()
+print(Mean Squared Error =  + str(meanSquaredError))
+
+# Save and load model
+model.save(sc, myModelPath)
+sameModel = IsotonicRegressionModel.load(sc, myModelPath)
+{% endhighlight %}
+/div
 /div


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9900] [MLLIB] User guide for Association Rules

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master c1840a862 - f5ea39129


[SPARK-9900] [MLLIB] User guide for Association Rules

Updates FPM user guide to include Association Rules.

Author: Feynman Liang fli...@databricks.com

Closes #8207 from feynmanliang/SPARK-9900-arules.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f5ea3912
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f5ea3912
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f5ea3912

Branch: refs/heads/master
Commit: f5ea3912900ccdf23e2eb419a342bfe3c0c0b61b
Parents: c1840a8
Author: Feynman Liang fli...@databricks.com
Authored: Tue Aug 18 12:53:57 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 12:53:57 2015 -0700

--
 docs/mllib-frequent-pattern-mining.md   | 130 +--
 docs/mllib-guide.md |   1 +
 .../mllib/fpm/JavaAssociationRulesSuite.java|   2 +-
 3 files changed, 118 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f5ea3912/docs/mllib-frequent-pattern-mining.md
--
diff --git a/docs/mllib-frequent-pattern-mining.md 
b/docs/mllib-frequent-pattern-mining.md
index 8ea4389..6c06550 100644
--- a/docs/mllib-frequent-pattern-mining.md
+++ b/docs/mllib-frequent-pattern-mining.md
@@ -39,18 +39,30 @@ MLlib's FP-growth implementation takes the following 
(hyper-)parameters:
 div class=codetabs
 div data-lang=scala markdown=1
 
-[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) 
implements the
-FP-growth algorithm.
-It take a `JavaRDD` of transactions, where each transaction is an `Iterable` 
of items of a generic type.
+[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth)
+implements the FP-growth algorithm.  It take an `RDD` of transactions,
+where each transaction is an `Iterable` of items of a generic type.
 Calling `FPGrowth.run` with transactions returns an
 
[`FPGrowthModel`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowthModel)
-that stores the frequent itemsets with their frequencies.
+that stores the frequent itemsets with their frequencies.  The following
+example illustrates how to mine frequent itemsets and association rules
+(see [Association
+Rules](mllib-frequent-pattern-mining.html#association-rules) for
+details) from `transactions`.
+
 
 {% highlight scala %}
 import org.apache.spark.rdd.RDD
 import org.apache.spark.mllib.fpm.{FPGrowth, FPGrowthModel}
 
-val transactions: RDD[Array[String]] = ...
+val transactions: RDD[Array[String]] = sc.parallelize(Seq(
+  r z h k p,
+  z y x w v u t s,
+  s x o n r,
+  x z y m t s q e,
+  z,
+  x z y r q t p)
+  .map(_.split( )))
 
 val fpg = new FPGrowth()
   .setMinSupport(0.2)
@@ -60,29 +72,48 @@ val model = fpg.run(transactions)
 model.freqItemsets.collect().foreach { itemset =
   println(itemset.items.mkString([, ,, ]) + ,  + itemset.freq)
 }
+
+val minConfidence = 0.8
+model.generateAssociationRules(minConfidence).collect().foreach { rule =
+  println(
+rule.antecedent.mkString([, ,, ])
+  +  =  + rule.consequent .mkString([, ,, ])
+  + ,  + rule.confidence)
+}
 {% endhighlight %}
 
 /div
 
 div data-lang=java markdown=1
 
-[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
-FP-growth algorithm.
-It take an `RDD` of transactions, where each transaction is an `Array` of 
items of a generic type.
-Calling `FPGrowth.run` with transactions returns an
+[`FPGrowth`](api/java/org/apache/spark/mllib/fpm/FPGrowth.html)
+implements the FP-growth algorithm.  It take a `JavaRDD` of
+transactions, where each transaction is an `Array` of items of a generic
+type.  Calling `FPGrowth.run` with transactions returns an
 [`FPGrowthModel`](api/java/org/apache/spark/mllib/fpm/FPGrowthModel.html)
-that stores the frequent itemsets with their frequencies.
+that stores the frequent itemsets with their frequencies.  The following
+example illustrates how to mine frequent itemsets and association rules
+(see [Association
+Rules](mllib-frequent-pattern-mining.html#association-rules) for
+details) from `transactions`.
 
 {% highlight java %}
+import java.util.Arrays;
 import java.util.List;
 
-import com.google.common.base.Joiner;
-
 import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.fpm.AssociationRules;
 import org.apache.spark.mllib.fpm.FPGrowth;
 import org.apache.spark.mllib.fpm.FPGrowthModel;
 
-JavaRDDListString transactions = ...
+JavaRDDListString transactions = sc.parallelize(Arrays.asList(
+  Arrays.asList(r z h k p.split( )),
+  Arrays.asList(z y x w v u t s.split( )),
+  Arrays.asList(s x o n r.split( )),
+  Arrays.asList(x z y m t s

spark git commit: [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 8b0df5a5e - 56f4da263


[SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree

Added since tags to mllib.tree

Author: Bryan Cutler bjcut...@us.ibm.com

Closes #7380 from BryanCutler/sinceTag-mllibTree-8924.

(cherry picked from commit 1dbffba37a84c62202befd3911d25888f958191d)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/56f4da26
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/56f4da26
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/56f4da26

Branch: refs/heads/branch-1.5
Commit: 56f4da2633aab6d1f25c03b1cf567c2c68374fb5
Parents: 8b0df5a
Author: Bryan Cutler bjcut...@us.ibm.com
Authored: Tue Aug 18 14:58:30 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 14:58:37 2015 -0700

--
 .../apache/spark/mllib/tree/DecisionTree.scala  | 13 +++
 .../spark/mllib/tree/GradientBoostedTrees.scala | 10 ++
 .../apache/spark/mllib/tree/RandomForest.scala  | 10 ++
 .../spark/mllib/tree/configuration/Algo.scala   |  1 +
 .../tree/configuration/BoostingStrategy.scala   |  6 
 .../mllib/tree/configuration/FeatureType.scala  |  1 +
 .../tree/configuration/QuantileStrategy.scala   |  1 +
 .../mllib/tree/configuration/Strategy.scala | 20 ++-
 .../spark/mllib/tree/impurity/Entropy.scala |  4 +++
 .../apache/spark/mllib/tree/impurity/Gini.scala |  4 +++
 .../spark/mllib/tree/impurity/Impurity.scala|  3 ++
 .../spark/mllib/tree/impurity/Variance.scala|  4 +++
 .../spark/mllib/tree/loss/AbsoluteError.scala   |  2 ++
 .../apache/spark/mllib/tree/loss/LogLoss.scala  |  2 ++
 .../org/apache/spark/mllib/tree/loss/Loss.scala |  3 ++
 .../apache/spark/mllib/tree/loss/Losses.scala   |  6 
 .../spark/mllib/tree/loss/SquaredError.scala|  2 ++
 .../mllib/tree/model/DecisionTreeModel.scala| 22 
 .../mllib/tree/model/InformationGainStats.scala |  1 +
 .../apache/spark/mllib/tree/model/Node.scala|  3 ++
 .../apache/spark/mllib/tree/model/Predict.scala |  1 +
 .../apache/spark/mllib/tree/model/Split.scala   |  1 +
 .../mllib/tree/model/treeEnsembleModels.scala   | 37 
 .../org/apache/spark/mllib/tree/package.scala   |  1 +
 24 files changed, 157 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/56f4da26/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
index cecd1fe..e5200b8 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
@@ -43,6 +43,7 @@ import org.apache.spark.util.random.XORShiftRandom
  * @param strategy The configuration parameters for the tree algorithm which 
specify the type
  * of algorithm (classification, regression, etc.), feature 
type (continuous,
  * categorical), depth of the tree, quantile calculation 
strategy, etc.
+ * @since 1.0.0
  */
 @Experimental
 class DecisionTree (private val strategy: Strategy) extends Serializable with 
Logging {
@@ -53,6 +54,7 @@ class DecisionTree (private val strategy: Strategy) extends 
Serializable with Lo
* Method to train a decision tree model over an RDD
* @param input Training data: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]]
* @return DecisionTreeModel that can be used for prediction
+   * @since 1.2.0
*/
   def run(input: RDD[LabeledPoint]): DecisionTreeModel = {
 // Note: random seed will not be used since numTrees = 1.
@@ -62,6 +64,9 @@ class DecisionTree (private val strategy: Strategy) extends 
Serializable with Lo
   }
 }
 
+/**
+ * @since 1.0.0
+ */
 object DecisionTree extends Serializable with Logging {
 
   /**
@@ -79,6 +84,7 @@ object DecisionTree extends Serializable with Logging {
* of algorithm (classification, regression, etc.), feature 
type (continuous,
* categorical), depth of the tree, quantile calculation 
strategy, etc.
* @return DecisionTreeModel that can be used for prediction
+   * @since 1.0.0
   */
   def train(input: RDD[LabeledPoint], strategy: Strategy): DecisionTreeModel = 
{
 new DecisionTree(strategy).run(input)
@@ -100,6 +106,7 @@ object DecisionTree extends Serializable with Logging {
* @param maxDepth Maximum depth of the tree.
* E.g., depth 0 means 1 leaf node; depth 1 means 1 internal 
node + 2 leaf nodes.
* @return DecisionTreeModel that can be used for prediction
+   * @since

spark git commit: [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 492ac1fac - 1dbffba37


[SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.tree

Added since tags to mllib.tree

Author: Bryan Cutler bjcut...@us.ibm.com

Closes #7380 from BryanCutler/sinceTag-mllibTree-8924.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1dbffba3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1dbffba3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1dbffba3

Branch: refs/heads/master
Commit: 1dbffba37a84c62202befd3911d25888f958191d
Parents: 492ac1f
Author: Bryan Cutler bjcut...@us.ibm.com
Authored: Tue Aug 18 14:58:30 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 14:58:30 2015 -0700

--
 .../apache/spark/mllib/tree/DecisionTree.scala  | 13 +++
 .../spark/mllib/tree/GradientBoostedTrees.scala | 10 ++
 .../apache/spark/mllib/tree/RandomForest.scala  | 10 ++
 .../spark/mllib/tree/configuration/Algo.scala   |  1 +
 .../tree/configuration/BoostingStrategy.scala   |  6 
 .../mllib/tree/configuration/FeatureType.scala  |  1 +
 .../tree/configuration/QuantileStrategy.scala   |  1 +
 .../mllib/tree/configuration/Strategy.scala | 20 ++-
 .../spark/mllib/tree/impurity/Entropy.scala |  4 +++
 .../apache/spark/mllib/tree/impurity/Gini.scala |  4 +++
 .../spark/mllib/tree/impurity/Impurity.scala|  3 ++
 .../spark/mllib/tree/impurity/Variance.scala|  4 +++
 .../spark/mllib/tree/loss/AbsoluteError.scala   |  2 ++
 .../apache/spark/mllib/tree/loss/LogLoss.scala  |  2 ++
 .../org/apache/spark/mllib/tree/loss/Loss.scala |  3 ++
 .../apache/spark/mllib/tree/loss/Losses.scala   |  6 
 .../spark/mllib/tree/loss/SquaredError.scala|  2 ++
 .../mllib/tree/model/DecisionTreeModel.scala| 22 
 .../mllib/tree/model/InformationGainStats.scala |  1 +
 .../apache/spark/mllib/tree/model/Node.scala|  3 ++
 .../apache/spark/mllib/tree/model/Predict.scala |  1 +
 .../apache/spark/mllib/tree/model/Split.scala   |  1 +
 .../mllib/tree/model/treeEnsembleModels.scala   | 37 
 .../org/apache/spark/mllib/tree/package.scala   |  1 +
 24 files changed, 157 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/1dbffba3/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
index cecd1fe..e5200b8 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
@@ -43,6 +43,7 @@ import org.apache.spark.util.random.XORShiftRandom
  * @param strategy The configuration parameters for the tree algorithm which 
specify the type
  * of algorithm (classification, regression, etc.), feature 
type (continuous,
  * categorical), depth of the tree, quantile calculation 
strategy, etc.
+ * @since 1.0.0
  */
 @Experimental
 class DecisionTree (private val strategy: Strategy) extends Serializable with 
Logging {
@@ -53,6 +54,7 @@ class DecisionTree (private val strategy: Strategy) extends 
Serializable with Lo
* Method to train a decision tree model over an RDD
* @param input Training data: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]]
* @return DecisionTreeModel that can be used for prediction
+   * @since 1.2.0
*/
   def run(input: RDD[LabeledPoint]): DecisionTreeModel = {
 // Note: random seed will not be used since numTrees = 1.
@@ -62,6 +64,9 @@ class DecisionTree (private val strategy: Strategy) extends 
Serializable with Lo
   }
 }
 
+/**
+ * @since 1.0.0
+ */
 object DecisionTree extends Serializable with Logging {
 
   /**
@@ -79,6 +84,7 @@ object DecisionTree extends Serializable with Logging {
* of algorithm (classification, regression, etc.), feature 
type (continuous,
* categorical), depth of the tree, quantile calculation 
strategy, etc.
* @return DecisionTreeModel that can be used for prediction
+   * @since 1.0.0
   */
   def train(input: RDD[LabeledPoint], strategy: Strategy): DecisionTreeModel = 
{
 new DecisionTree(strategy).run(input)
@@ -100,6 +106,7 @@ object DecisionTree extends Serializable with Logging {
* @param maxDepth Maximum depth of the tree.
* E.g., depth 0 means 1 leaf node; depth 1 means 1 internal 
node + 2 leaf nodes.
* @return DecisionTreeModel that can be used for prediction
+   * @since 1.0.0
*/
   def train(
   input: RDD[LabeledPoint],
@@ -127,6 +134,7 @@ object DecisionTree extends Serializable

spark git commit: [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master f4fa61eff - 747c2ba80


[SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide

Add Python example for mllib LDAModel user guide

Author: Yanbo Liang yblia...@gmail.com

Closes #8227 from yanboliang/spark-10032.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/747c2ba8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/747c2ba8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/747c2ba8

Branch: refs/heads/master
Commit: 747c2ba8006d5b86f3be8dfa9ace639042a35628
Parents: f4fa61e
Author: Yanbo Liang yblia...@gmail.com
Authored: Tue Aug 18 12:56:36 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 12:56:36 2015 -0700

--
 docs/mllib-clustering.md | 28 
 1 file changed, 28 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/747c2ba8/docs/mllib-clustering.md
--
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index bb875ae..fd9ab25 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -564,6 +564,34 @@ public class JavaLDAExample {
 {% endhighlight %}
 /div
 
+div data-lang=python markdown=1
+{% highlight python %}
+from pyspark.mllib.clustering import LDA, LDAModel
+from pyspark.mllib.linalg import Vectors
+
+# Load and parse the data
+data = sc.textFile(data/mllib/sample_lda_data.txt)
+parsedData = data.map(lambda line: Vectors.dense([float(x) for x in 
line.strip().split(' ')]))
+# Index documents with unique IDs
+corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
+
+# Cluster the documents into three topics using LDA
+ldaModel = LDA.train(corpus, k=3)
+
+# Output topics. Each is a distribution over words (matching word count 
vectors)
+print(Learned topics (as distributions over vocab of  + 
str(ldaModel.vocabSize()) +  words):)
+topics = ldaModel.topicsMatrix()
+for topic in range(3):
+print(Topic  + str(topic) + :)
+for word in range(0, ldaModel.vocabSize()):
+print(  + str(topics[word][topic]))
+   
+# Save and load model
+model.save(sc, myModelPath)
+sameModel = LDAModel.load(sc, myModelPath)
+{% endhighlight %}
+/div
+
 /div
 
 ## Streaming k-means


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 80debff12 - ec7079f9c


[SPARK-10032] [PYSPARK] [DOC] Add Python example for mllib LDAModel user guide

Add Python example for mllib LDAModel user guide

Author: Yanbo Liang yblia...@gmail.com

Closes #8227 from yanboliang/spark-10032.

(cherry picked from commit 747c2ba8006d5b86f3be8dfa9ace639042a35628)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ec7079f9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ec7079f9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ec7079f9

Branch: refs/heads/branch-1.5
Commit: ec7079f9c94cb98efdac6f92b7c85efb0e67492e
Parents: 80debff
Author: Yanbo Liang yblia...@gmail.com
Authored: Tue Aug 18 12:56:36 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 12:56:43 2015 -0700

--
 docs/mllib-clustering.md | 28 
 1 file changed, 28 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ec7079f9/docs/mllib-clustering.md
--
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index bb875ae..fd9ab25 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -564,6 +564,34 @@ public class JavaLDAExample {
 {% endhighlight %}
 /div
 
+div data-lang=python markdown=1
+{% highlight python %}
+from pyspark.mllib.clustering import LDA, LDAModel
+from pyspark.mllib.linalg import Vectors
+
+# Load and parse the data
+data = sc.textFile(data/mllib/sample_lda_data.txt)
+parsedData = data.map(lambda line: Vectors.dense([float(x) for x in 
line.strip().split(' ')]))
+# Index documents with unique IDs
+corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
+
+# Cluster the documents into three topics using LDA
+ldaModel = LDA.train(corpus, k=3)
+
+# Output topics. Each is a distribution over words (matching word count 
vectors)
+print(Learned topics (as distributions over vocab of  + 
str(ldaModel.vocabSize()) +  words):)
+topics = ldaModel.topicsMatrix()
+for topic in range(3):
+print(Topic  + str(topic) + :)
+for word in range(0, ldaModel.vocabSize()):
+print(  + str(topics[word][topic]))
+   
+# Save and load model
+model.save(sc, myModelPath)
+sameModel = LDAModel.load(sc, myModelPath)
+{% endhighlight %}
+/div
+
 /div
 
 ## Streaming k-means


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression user guide

2015-08-18 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 7ff0e5d2f - 80debff12


[SPARK-10029] [MLLIB] [DOC] Add Python examples for mllib IsotonicRegression 
user guide

Add Python examples for mllib IsotonicRegression user guide

Author: Yanbo Liang yblia...@gmail.com

Closes #8225 from yanboliang/spark-10029.

(cherry picked from commit f4fa61effe34dae2f0eab0bef57b2dee220cf92f)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/80debff1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/80debff1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/80debff1

Branch: refs/heads/branch-1.5
Commit: 80debff123e0b5dcc4e6f5899753a736de2c8e75
Parents: 7ff0e5d
Author: Yanbo Liang yblia...@gmail.com
Authored: Tue Aug 18 12:55:36 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 18 12:55:42 2015 -0700

--
 docs/mllib-isotonic-regression.md | 35 ++
 1 file changed, 35 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/80debff1/docs/mllib-isotonic-regression.md
--
diff --git a/docs/mllib-isotonic-regression.md 
b/docs/mllib-isotonic-regression.md
index 5732bc4..6aa881f 100644
--- a/docs/mllib-isotonic-regression.md
+++ b/docs/mllib-isotonic-regression.md
@@ -160,4 +160,39 @@ model.save(sc.sc(), myModelPath);
 IsotonicRegressionModel sameModel = IsotonicRegressionModel.load(sc.sc(), 
myModelPath);
 {% endhighlight %}
 /div
+
+div data-lang=python markdown=1
+Data are read from a file where each line has a format label,feature
+i.e. 4710.28,500.00. The data are split to training and testing set.
+Model is created using the training set and a mean squared error is calculated 
from the predicted
+labels and real labels in the test set.
+
+{% highlight python %}
+import math
+from pyspark.mllib.regression import IsotonicRegression, 
IsotonicRegressionModel
+
+data = sc.textFile(data/mllib/sample_isotonic_regression_data.txt)
+
+# Create label, feature, weight tuples from input data with weight set to 
default value 1.0.
+parsedData = data.map(lambda line: tuple([float(x) for x in line.split(',')]) 
+ (1.0,))
+
+# Split data into training (60%) and test (40%) sets.
+training, test = parsedData.randomSplit([0.6, 0.4], 11)
+
+# Create isotonic regression model from training data.
+# Isotonic parameter defaults to true so it is only shown for demonstration
+model = IsotonicRegression.train(training)
+
+# Create tuples of predicted and real labels.
+predictionAndLabel = test.map(lambda p: (model.predict(p[1]), p[0]))
+
+# Calculate mean squared error between predicted and real labels.
+meanSquaredError = predictionAndLabel.map(lambda pl: math.pow((pl[0] - pl[1]), 
2)).mean()
+print(Mean Squared Error =  + str(meanSquaredError))
+
+# Save and load model
+model.save(sc, myModelPath)
+sameModel = IsotonicRegressionModel.load(sc, myModelPath)
+{% endhighlight %}
+/div
 /div


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-8920] [MLLIB] Add @since tags to mllib.linalg

2015-08-17 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master fdaf17f63 - 088b11ec5


[SPARK-8920] [MLLIB] Add @since tags to mllib.linalg

Author: Sameer Abhyankar sabhyankar@sabhyankar-MBP.Samavihome
Author: Sameer Abhyankar sabhyankar@sabhyankar-MBP.local

Closes #7729 from sabhyankar/branch_8920.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/088b11ec
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/088b11ec
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/088b11ec

Branch: refs/heads/master
Commit: 088b11ec5949e135cb3db2a1ce136837e046c288
Parents: fdaf17f
Author: Sameer Abhyankar sabhyankar@sabhyankar-MBP.Samavihome
Authored: Mon Aug 17 16:00:23 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 16:00:23 2015 -0700

--
 .../apache/spark/mllib/linalg/Matrices.scala| 63 
 .../linalg/SingularValueDecomposition.scala |  1 +
 .../org/apache/spark/mllib/linalg/Vectors.scala | 60 +++
 .../mllib/linalg/distributed/BlockMatrix.scala  | 43 +++--
 .../linalg/distributed/CoordinateMatrix.scala   | 28 +++--
 .../linalg/distributed/DistributedMatrix.scala  |  1 +
 .../linalg/distributed/IndexedRowMatrix.scala   | 24 +++-
 .../mllib/linalg/distributed/RowMatrix.scala| 24 +++-
 8 files changed, 227 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/088b11ec/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
index 1139ce3..dfa8910 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
@@ -227,6 +227,7 @@ private[spark] class MatrixUDT extends 
UserDefinedType[Matrix] {
  * @param values matrix entries in column major if not transposed or in row 
major otherwise
  * @param isTransposed whether the matrix is transposed. If true, `values` 
stores the matrix in
  * row major.
+ * @since 1.0.0
  */
 @SQLUserDefinedType(udt = classOf[MatrixUDT])
 class DenseMatrix(
@@ -252,6 +253,7 @@ class DenseMatrix(
* @param numRows number of rows
* @param numCols number of columns
* @param values matrix entries in column major
+   * @since 1.3.0
*/
   def this(numRows: Int, numCols: Int, values: Array[Double]) =
 this(numRows, numCols, values, false)
@@ -276,6 +278,9 @@ class DenseMatrix(
 
   private[mllib] def apply(i: Int): Double = values(i)
 
+  /**
+   * @since 1.3.0
+   */
   override def apply(i: Int, j: Int): Double = values(index(i, j))
 
   private[mllib] def index(i: Int, j: Int): Int = {
@@ -286,6 +291,9 @@ class DenseMatrix(
 values(index(i, j)) = v
   }
 
+  /**
+   * @since 1.4.0
+   */
   override def copy: DenseMatrix = new DenseMatrix(numRows, numCols, 
values.clone())
 
   private[spark] def map(f: Double = Double) = new DenseMatrix(numRows, 
numCols, values.map(f),
@@ -301,6 +309,9 @@ class DenseMatrix(
 this
   }
 
+  /**
+   * @since 1.3.0
+   */
   override def transpose: DenseMatrix = new DenseMatrix(numCols, numRows, 
values, !isTransposed)
 
   private[spark] override def foreachActive(f: (Int, Int, Double) = Unit): 
Unit = {
@@ -331,13 +342,20 @@ class DenseMatrix(
 }
   }
 
+  /**
+   * @since 1.5.0
+   */
   override def numNonzeros: Int = values.count(_ != 0)
 
+  /**
+   * @since 1.5.0
+   */
   override def numActives: Int = values.length
 
   /**
* Generate a `SparseMatrix` from the given `DenseMatrix`. The new matrix 
will have isTransposed
* set to false.
+   * @since 1.3.0
*/
   def toSparse: SparseMatrix = {
 val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble
@@ -365,6 +383,7 @@ class DenseMatrix(
 
 /**
  * Factory methods for [[org.apache.spark.mllib.linalg.DenseMatrix]].
+ * @since 1.3.0
  */
 object DenseMatrix {
 
@@ -373,6 +392,7 @@ object DenseMatrix {
* @param numRows number of rows of the matrix
* @param numCols number of columns of the matrix
* @return `DenseMatrix` with size `numRows` x `numCols` and values of zeros
+   * @since 1.3.0
*/
   def zeros(numRows: Int, numCols: Int): DenseMatrix = {
 require(numRows.toLong * numCols = Int.MaxValue,
@@ -385,6 +405,7 @@ object DenseMatrix {
* @param numRows number of rows of the matrix
* @param numCols number of columns of the matrix
* @return `DenseMatrix` with size `numRows` x `numCols` and values of ones
+   * @since 1.3.0
*/
   def ones(numRows: Int, numCols: Int): DenseMatrix = {
 require(numRows.toLong * numCols = Int.MaxValue,
@@ -396,6 +417,7 @@ object

spark git commit: [SPARK-7707] User guide and example code for KernelDensity

2015-08-17 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 18b3d11f7 - 5de0ffbd0


[SPARK-7707] User guide and example code for KernelDensity

Author: Sandy Ryza sa...@cloudera.com

Closes #8230 from sryza/sandy-spark-7707.

(cherry picked from commit f9d1a92aa1bac4494022d78559b871149579e6e8)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5de0ffbd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5de0ffbd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5de0ffbd

Branch: refs/heads/branch-1.5
Commit: 5de0ffbd0e0aef170171cec8808eb4ec1ba79b0f
Parents: 18b3d11
Author: Sandy Ryza sa...@cloudera.com
Authored: Mon Aug 17 17:57:51 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 17:58:06 2015 -0700

--
 docs/mllib-statistics.md | 77 +++
 1 file changed, 77 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5de0ffbd/docs/mllib-statistics.md
--
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
index be04d0b..80a9d06 100644
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@@ -528,5 +528,82 @@ u = RandomRDDs.uniformRDD(sc, 100L, 10)
 v = u.map(lambda x: 1.0 + 2.0 * x)
 {% endhighlight %}
 /div
+/div
+
+## Kernel density estimation
+
+[Kernel density 
estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) is a 
technique
+useful for visualizing empirical probability distributions without requiring 
assumptions about the
+particular distribution that the observed samples are drawn from. It computes 
an estimate of the
+probability density function of a random variables, evaluated at a given set 
of points. It achieves
+this estimate by expressing the PDF of the empirical distribution at a 
particular point as the the
+mean of PDFs of normal distributions centered around each of the samples.
+
+div class=codetabs
+
+div data-lang=scala markdown=1
+[`KernelDensity`](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity)
 provides methods
+to compute kernel density estimates from an RDD of samples. The following 
example demonstrates how
+to do so.
+
+{% highlight scala %}
+import org.apache.spark.mllib.stat.KernelDensity
+import org.apache.spark.rdd.RDD
+
+val data: RDD[Double] = ... // an RDD of sample data
+
+// Construct the density estimator with the sample data and a standard 
deviation for the Gaussian
+// kernels
+val kd = new KernelDensity()
+  .setSample(data)
+  .setBandwidth(3.0)
+
+// Find density estimates for the given values
+val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
+{% endhighlight %}
+/div
+
+div data-lang=java markdown=1
+[`KernelDensity`](api/java/index.html#org.apache.spark.mllib.stat.KernelDensity)
 provides methods
+to compute kernel density estimates from an RDD of samples. The following 
example demonstrates how
+to do so.
+
+{% highlight java %}
+import org.apache.spark.mllib.stat.KernelDensity;
+import org.apache.spark.rdd.RDD;
+
+RDDDouble data = ... // an RDD of sample data
+
+// Construct the density estimator with the sample data and a standard 
deviation for the Gaussian
+// kernels
+KernelDensity kd = new KernelDensity()
+  .setSample(data)
+  .setBandwidth(3.0);
+
+// Find density estimates for the given values
+double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0});
+{% endhighlight %}
+/div
+
+div data-lang=python markdown=1
+[`KernelDensity`](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity)
 provides methods
+to compute kernel density estimates from an RDD of samples. The following 
example demonstrates how
+to do so.
+
+{% highlight python %}
+from pyspark.mllib.stat import KernelDensity
+
+data = ... # an RDD of sample data
+
+# Construct the density estimator with the sample data and a standard 
deviation for the Gaussian
+# kernels
+kd = KernelDensity()
+kd.setSample(data)
+kd.setBandwidth(3.0)
+
+# Find density estimates for the given values
+densities = kd.estimate([-1.0, 2.0, 5.0])
+{% endhighlight %}
+/div
 
 /div


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-7808] [ML] add package doc for ml.feature

2015-08-17 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 bfb4c8425 - 35542504c


[SPARK-7808] [ML] add package doc for ml.feature

This PR adds a short description of `ml.feature` package with code example. The 
Java package doc will come in a separate PR. jkbradley

Author: Xiangrui Meng m...@databricks.com

Closes #8260 from mengxr/SPARK-7808.

(cherry picked from commit e290029a356222bddf4da1be0525a221a5a1630b)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/35542504
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/35542504
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/35542504

Branch: refs/heads/branch-1.5
Commit: 35542504c51c5754db7812cf7bec674a957e66ad
Parents: bfb4c84
Author: Xiangrui Meng m...@databricks.com
Authored: Mon Aug 17 19:40:51 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 19:40:58 2015 -0700

--
 .../org/apache/spark/ml/feature/package.scala   | 89 
 1 file changed, 89 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/35542504/mllib/src/main/scala/org/apache/spark/ml/feature/package.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/package.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/package.scala
new file mode 100644
index 000..4571ab2
--- /dev/null
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/package.scala
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import org.apache.spark.ml.feature.{HashingTF, IDF, IDFModel, VectorAssembler}
+import org.apache.spark.sql.DataFrame
+
+/**
+ * == Feature transformers ==
+ *
+ * The `ml.feature` package provides common feature transformers that help 
convert raw data or
+ * features into more suitable forms for model fitting.
+ * Most feature transformers are implemented as [[Transformer]]s, which 
transform one [[DataFrame]]
+ * into another, e.g., [[HashingTF]].
+ * Some feature transformers are implemented as [[Estimator]]s, because the 
transformation requires
+ * some aggregated information of the dataset, e.g., document frequencies in 
[[IDF]].
+ * For those feature transformers, calling [[Estimator!.fit]] is required to 
obtain the model first,
+ * e.g., [[IDFModel]], in order to apply transformation.
+ * The transformation is usually done by appending new columns to the input 
[[DataFrame]], so all
+ * input columns are carried over.
+ *
+ * We try to make each transformer minimal, so it becomes flexible to assemble 
feature
+ * transformation pipelines.
+ * [[Pipeline]] can be used to chain feature transformers, and 
[[VectorAssembler]] can be used to
+ * combine multiple feature transformations, for example:
+ *
+ * {{{
+ *   import org.apache.spark.ml.feature._
+ *   import org.apache.spark.ml.Pipeline
+ *
+ *   // a DataFrame with three columns: id (integer), text (string), and 
rating (double).
+ *   val df = sqlContext.createDataFrame(Seq(
+ * (0, Hi I heard about Spark, 3.0),
+ * (1, I wish Java could use case classes, 4.0),
+ * (2, Logistic regression models are neat, 4.0)
+ *   )).toDF(id, text, rating)
+ *
+ *   // define feature transformers
+ *   val tok = new RegexTokenizer()
+ * .setInputCol(text)
+ * .setOutputCol(words)
+ *   val sw = new StopWordsRemover()
+ * .setInputCol(words)
+ * .setOutputCol(filtered_words)
+ *   val tf = new HashingTF()
+ * .setInputCol(filtered_words)
+ * .setOutputCol(tf)
+ * .setNumFeatures(1)
+ *   val idf = new IDF()
+ * .setInputCol(tf)
+ * .setOutputCol(tf_idf)
+ *   val assembler = new VectorAssembler()
+ * .setInputCols(Array(tf_idf, rating))
+ * .setOutputCol(features)
+ *
+ *   // assemble and fit the feature transformation pipeline
+ *   val pipeline = new Pipeline()
+ * .setStages(Array(tok, sw, tf, idf, assembler))
+ *   val model = pipeline.fit(df)
+ *
+ *   // save

spark git commit: [SPARK-9902] [MLLIB] Add Java and Python examples to user guide for 1-sample KS test

2015-08-17 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 5de0ffbd0 - 9740d43d3


[SPARK-9902] [MLLIB] Add Java and Python examples to user guide for 1-sample KS 
test

added doc examples for python.

Author: jose.cambronero jose.cambron...@cloudera.com

Closes #8154 from josepablocam/spark_9902.

(cherry picked from commit c90c605dc6a876aef3cc204ac15cd65bab9743ad)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9740d43d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9740d43d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9740d43d

Branch: refs/heads/branch-1.5
Commit: 9740d43d3b5e1ca64f39515612e937f640eb436e
Parents: 5de0ffb
Author: jose.cambronero jose.cambron...@cloudera.com
Authored: Mon Aug 17 19:09:45 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 19:09:51 2015 -0700

--
 docs/mllib-statistics.md | 51 +++
 1 file changed, 47 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9740d43d/docs/mllib-statistics.md
--
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
index 80a9d06..6acfc71 100644
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@@ -438,22 +438,65 @@ run a 1-sample, 2-sided Kolmogorov-Smirnov test. The 
following example demonstra
 and interpret the hypothesis tests.
 
 {% highlight scala %}
-import org.apache.spark.SparkContext
-import org.apache.spark.mllib.stat.Statistics._
+import org.apache.spark.mllib.stat.Statistics
 
 val data: RDD[Double] = ... // an RDD of sample data
 
 // run a KS test for the sample versus a standard normal distribution
 val testResult = Statistics.kolmogorovSmirnovTest(data, norm, 0, 1)
 println(testResult) // summary of the test including the p-value, test 
statistic,
-  // and null hypothesis
-  // if our p-value indicates significance, we can reject 
the null hypothesis
+// and null hypothesis
+// if our p-value indicates significance, we can reject 
the null hypothesis
 
 // perform a KS test using a cumulative distribution function of our making
 val myCDF: Double = Double = ...
 val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
 {% endhighlight %}
 /div
+
+div data-lang=java markdown=1
+[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides 
methods to
+run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example 
demonstrates how to run
+and interpret the hypothesis tests.
+
+{% highlight java %}
+import java.util.Arrays;
+
+import org.apache.spark.api.java.JavaDoubleRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import org.apache.spark.mllib.stat.Statistics;
+import org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult;
+
+JavaSparkContext jsc = ...
+JavaDoubleRDD data = jsc.parallelizeDoubles(Arrays.asList(0.2, 1.0, ...));
+KolmogorovSmirnovTestResult testResult = 
Statistics.kolmogorovSmirnovTest(data, norm, 0.0, 1.0);
+// summary of the test including the p-value, test statistic,
+// and null hypothesis
+// if our p-value indicates significance, we can reject the null hypothesis
+System.out.println(testResult);
+{% endhighlight %}
+/div
+
+div data-lang=python markdown=1
+[`Statistics`](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) 
provides methods to
+run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example 
demonstrates how to run
+and interpret the hypothesis tests.
+
+{% highlight python %}
+from pyspark.mllib.stat import Statistics
+
+parallelData = sc.parallelize([1.0, 2.0, ... ])
+
+# run a KS test for the sample versus a standard normal distribution
+testResult = Statistics.kolmogorovSmirnovTest(parallelData, norm, 0, 1)
+print(testResult) # summary of the test including the p-value, test statistic,
+  # and null hypothesis
+  # if our p-value indicates significance, we can reject the 
null hypothesis
+# Note that the Scala functionality of calling 
Statistics.kolmogorovSmirnovTest with
+# a lambda to calculate the CDF is not made available in the Python API
+{% endhighlight %}
+/div
 /div
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9902] [MLLIB] Add Java and Python examples to user guide for 1-sample KS test

2015-08-17 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master f9d1a92aa - c90c605dc


[SPARK-9902] [MLLIB] Add Java and Python examples to user guide for 1-sample KS 
test

added doc examples for python.

Author: jose.cambronero jose.cambron...@cloudera.com

Closes #8154 from josepablocam/spark_9902.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c90c605d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c90c605d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c90c605d

Branch: refs/heads/master
Commit: c90c605dc6a876aef3cc204ac15cd65bab9743ad
Parents: f9d1a92
Author: jose.cambronero jose.cambron...@cloudera.com
Authored: Mon Aug 17 19:09:45 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 19:09:45 2015 -0700

--
 docs/mllib-statistics.md | 51 +++
 1 file changed, 47 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c90c605d/docs/mllib-statistics.md
--
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
index 80a9d06..6acfc71 100644
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@@ -438,22 +438,65 @@ run a 1-sample, 2-sided Kolmogorov-Smirnov test. The 
following example demonstra
 and interpret the hypothesis tests.
 
 {% highlight scala %}
-import org.apache.spark.SparkContext
-import org.apache.spark.mllib.stat.Statistics._
+import org.apache.spark.mllib.stat.Statistics
 
 val data: RDD[Double] = ... // an RDD of sample data
 
 // run a KS test for the sample versus a standard normal distribution
 val testResult = Statistics.kolmogorovSmirnovTest(data, norm, 0, 1)
 println(testResult) // summary of the test including the p-value, test 
statistic,
-  // and null hypothesis
-  // if our p-value indicates significance, we can reject 
the null hypothesis
+// and null hypothesis
+// if our p-value indicates significance, we can reject 
the null hypothesis
 
 // perform a KS test using a cumulative distribution function of our making
 val myCDF: Double = Double = ...
 val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
 {% endhighlight %}
 /div
+
+div data-lang=java markdown=1
+[`Statistics`](api/java/org/apache/spark/mllib/stat/Statistics.html) provides 
methods to
+run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example 
demonstrates how to run
+and interpret the hypothesis tests.
+
+{% highlight java %}
+import java.util.Arrays;
+
+import org.apache.spark.api.java.JavaDoubleRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import org.apache.spark.mllib.stat.Statistics;
+import org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult;
+
+JavaSparkContext jsc = ...
+JavaDoubleRDD data = jsc.parallelizeDoubles(Arrays.asList(0.2, 1.0, ...));
+KolmogorovSmirnovTestResult testResult = 
Statistics.kolmogorovSmirnovTest(data, norm, 0.0, 1.0);
+// summary of the test including the p-value, test statistic,
+// and null hypothesis
+// if our p-value indicates significance, we can reject the null hypothesis
+System.out.println(testResult);
+{% endhighlight %}
+/div
+
+div data-lang=python markdown=1
+[`Statistics`](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) 
provides methods to
+run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example 
demonstrates how to run
+and interpret the hypothesis tests.
+
+{% highlight python %}
+from pyspark.mllib.stat import Statistics
+
+parallelData = sc.parallelize([1.0, 2.0, ... ])
+
+# run a KS test for the sample versus a standard normal distribution
+testResult = Statistics.kolmogorovSmirnovTest(parallelData, norm, 0, 1)
+print(testResult) # summary of the test including the p-value, test statistic,
+  # and null hypothesis
+  # if our p-value indicates significance, we can reject the 
null hypothesis
+# Note that the Scala functionality of calling 
Statistics.kolmogorovSmirnovTest with
+# a lambda to calculate the CDF is not made available in the Python API
+{% endhighlight %}
+/div
 /div
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10068] [MLLIB] Adds links to MLlib types, algos, utilities listing

2015-08-17 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 772e7c18f - fdaf17f63


[SPARK-10068] [MLLIB] Adds links to MLlib types, algos, utilities listing

mengxr jkbradley

Author: Feynman Liang fli...@databricks.com

Closes #8255 from feynmanliang/SPARK-10068.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fdaf17f6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fdaf17f6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fdaf17f6

Branch: refs/heads/master
Commit: fdaf17f63f751f02623414fbc7d0a2f545364050
Parents: 772e7c1
Author: Feynman Liang fli...@databricks.com
Authored: Mon Aug 17 15:42:14 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 15:42:14 2015 -0700

--
 docs/mllib-guide.md | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fdaf17f6/docs/mllib-guide.md
--
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index eea864e..e8000ff 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -23,19 +23,19 @@ This lists functionality included in `spark.mllib`, the 
main MLlib API.
 
 * [Data types](mllib-data-types.html)
 * [Basic statistics](mllib-statistics.html)
-  * summary statistics
-  * correlations
-  * stratified sampling
-  * hypothesis testing
-  * random data generation  
+  * [summary statistics](mllib-statistics.html#summary-statistics)
+  * [correlations](mllib-statistics.html#correlations)
+  * [stratified sampling](mllib-statistics.html#stratified-sampling)
+  * [hypothesis testing](mllib-statistics.html#hypothesis-testing)
+  * [random data generation](mllib-statistics.html#random-data-generation)
 * [Classification and regression](mllib-classification-regression.html)
   * [linear models (SVMs, logistic regression, linear 
regression)](mllib-linear-methods.html)
   * [naive Bayes](mllib-naive-bayes.html)
   * [decision trees](mllib-decision-tree.html)
-  * [ensembles of trees](mllib-ensembles.html) (Random Forests and 
Gradient-Boosted Trees)
+  * [ensembles of trees (Random Forests and Gradient-Boosted 
Trees)](mllib-ensembles.html)
   * [isotonic regression](mllib-isotonic-regression.html)
 * [Collaborative filtering](mllib-collaborative-filtering.html)
-  * alternating least squares (ALS)
+  * [alternating least squares 
(ALS)](mllib-collaborative-filtering.html#collaborative-filtering)
 * [Clustering](mllib-clustering.html)
   * [k-means](mllib-clustering.html#k-means)
   * [Gaussian mixture](mllib-clustering.html#gaussian-mixture)
@@ -43,19 +43,19 @@ This lists functionality included in `spark.mllib`, the 
main MLlib API.
   * [latent Dirichlet allocation 
(LDA)](mllib-clustering.html#latent-dirichlet-allocation-lda)
   * [streaming k-means](mllib-clustering.html#streaming-k-means)
 * [Dimensionality reduction](mllib-dimensionality-reduction.html)
-  * singular value decomposition (SVD)
-  * principal component analysis (PCA)
+  * [singular value decomposition 
(SVD)](mllib-dimensionality-reduction.html#singular-value-decomposition-svd)
+  * [principal component analysis 
(PCA)](mllib-dimensionality-reduction.html#principal-component-analysis-pca)
 * [Feature extraction and transformation](mllib-feature-extraction.html)
 * [Frequent pattern mining](mllib-frequent-pattern-mining.html)
-  * FP-growth
+  * [FP-growth](mllib-frequent-pattern-mining.html#fp-growth)
 * [Evaluation Metrics](mllib-evaluation-metrics.html)
 * [Optimization (developer)](mllib-optimization.html)
-  * stochastic gradient descent
-  * limited-memory BFGS (L-BFGS)
+  * [stochastic gradient 
descent](mllib-optimization.html#stochastic-gradient-descent-sgd)
+  * [limited-memory BFGS 
(L-BFGS)](mllib-optimization.html#limited-memory-bfgs-l-bfgs)
 * [PMML model export](mllib-pmml-model-export.html)
 
 MLlib is under active development.
-The APIs marked `Experimental`/`DeveloperApi` may change in future releases, 
+The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
 and the migration guide below will explain all changes between releases.
 
 # spark.ml: high-level APIs for ML pipelines


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-10068] [MLLIB] Adds links to MLlib types, algos, utilities listing

2015-08-17 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 f77eaaf34 - bb3bb2a48


[SPARK-10068] [MLLIB] Adds links to MLlib types, algos, utilities listing

mengxr jkbradley

Author: Feynman Liang fli...@databricks.com

Closes #8255 from feynmanliang/SPARK-10068.

(cherry picked from commit fdaf17f63f751f02623414fbc7d0a2f545364050)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bb3bb2a4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bb3bb2a4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bb3bb2a4

Branch: refs/heads/branch-1.5
Commit: bb3bb2a48ee32a5de4637a73dd11930c72f9c77e
Parents: f77eaaf
Author: Feynman Liang fli...@databricks.com
Authored: Mon Aug 17 15:42:14 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 15:42:21 2015 -0700

--
 docs/mllib-guide.md | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bb3bb2a4/docs/mllib-guide.md
--
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index eea864e..e8000ff 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -23,19 +23,19 @@ This lists functionality included in `spark.mllib`, the 
main MLlib API.
 
 * [Data types](mllib-data-types.html)
 * [Basic statistics](mllib-statistics.html)
-  * summary statistics
-  * correlations
-  * stratified sampling
-  * hypothesis testing
-  * random data generation  
+  * [summary statistics](mllib-statistics.html#summary-statistics)
+  * [correlations](mllib-statistics.html#correlations)
+  * [stratified sampling](mllib-statistics.html#stratified-sampling)
+  * [hypothesis testing](mllib-statistics.html#hypothesis-testing)
+  * [random data generation](mllib-statistics.html#random-data-generation)
 * [Classification and regression](mllib-classification-regression.html)
   * [linear models (SVMs, logistic regression, linear 
regression)](mllib-linear-methods.html)
   * [naive Bayes](mllib-naive-bayes.html)
   * [decision trees](mllib-decision-tree.html)
-  * [ensembles of trees](mllib-ensembles.html) (Random Forests and 
Gradient-Boosted Trees)
+  * [ensembles of trees (Random Forests and Gradient-Boosted 
Trees)](mllib-ensembles.html)
   * [isotonic regression](mllib-isotonic-regression.html)
 * [Collaborative filtering](mllib-collaborative-filtering.html)
-  * alternating least squares (ALS)
+  * [alternating least squares 
(ALS)](mllib-collaborative-filtering.html#collaborative-filtering)
 * [Clustering](mllib-clustering.html)
   * [k-means](mllib-clustering.html#k-means)
   * [Gaussian mixture](mllib-clustering.html#gaussian-mixture)
@@ -43,19 +43,19 @@ This lists functionality included in `spark.mllib`, the 
main MLlib API.
   * [latent Dirichlet allocation 
(LDA)](mllib-clustering.html#latent-dirichlet-allocation-lda)
   * [streaming k-means](mllib-clustering.html#streaming-k-means)
 * [Dimensionality reduction](mllib-dimensionality-reduction.html)
-  * singular value decomposition (SVD)
-  * principal component analysis (PCA)
+  * [singular value decomposition 
(SVD)](mllib-dimensionality-reduction.html#singular-value-decomposition-svd)
+  * [principal component analysis 
(PCA)](mllib-dimensionality-reduction.html#principal-component-analysis-pca)
 * [Feature extraction and transformation](mllib-feature-extraction.html)
 * [Frequent pattern mining](mllib-frequent-pattern-mining.html)
-  * FP-growth
+  * [FP-growth](mllib-frequent-pattern-mining.html#fp-growth)
 * [Evaluation Metrics](mllib-evaluation-metrics.html)
 * [Optimization (developer)](mllib-optimization.html)
-  * stochastic gradient descent
-  * limited-memory BFGS (L-BFGS)
+  * [stochastic gradient 
descent](mllib-optimization.html#stochastic-gradient-descent-sgd)
+  * [limited-memory BFGS 
(L-BFGS)](mllib-optimization.html#limited-memory-bfgs-l-bfgs)
 * [PMML model export](mllib-pmml-model-export.html)
 
 MLlib is under active development.
-The APIs marked `Experimental`/`DeveloperApi` may change in future releases, 
+The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
 and the migration guide below will explain all changes between releases.
 
 # spark.ml: high-level APIs for ML pipelines


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-7707] User guide and example code for KernelDensity

2015-08-17 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 0b6b01761 - f9d1a92aa


[SPARK-7707] User guide and example code for KernelDensity

Author: Sandy Ryza sa...@cloudera.com

Closes #8230 from sryza/sandy-spark-7707.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f9d1a92a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f9d1a92a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f9d1a92a

Branch: refs/heads/master
Commit: f9d1a92aa1bac4494022d78559b871149579e6e8
Parents: 0b6b017
Author: Sandy Ryza sa...@cloudera.com
Authored: Mon Aug 17 17:57:51 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 17:57:51 2015 -0700

--
 docs/mllib-statistics.md | 77 +++
 1 file changed, 77 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f9d1a92a/docs/mllib-statistics.md
--
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
index be04d0b..80a9d06 100644
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@@ -528,5 +528,82 @@ u = RandomRDDs.uniformRDD(sc, 100L, 10)
 v = u.map(lambda x: 1.0 + 2.0 * x)
 {% endhighlight %}
 /div
+/div
+
+## Kernel density estimation
+
+[Kernel density 
estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) is a 
technique
+useful for visualizing empirical probability distributions without requiring 
assumptions about the
+particular distribution that the observed samples are drawn from. It computes 
an estimate of the
+probability density function of a random variables, evaluated at a given set 
of points. It achieves
+this estimate by expressing the PDF of the empirical distribution at a 
particular point as the the
+mean of PDFs of normal distributions centered around each of the samples.
+
+div class=codetabs
+
+div data-lang=scala markdown=1
+[`KernelDensity`](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity)
 provides methods
+to compute kernel density estimates from an RDD of samples. The following 
example demonstrates how
+to do so.
+
+{% highlight scala %}
+import org.apache.spark.mllib.stat.KernelDensity
+import org.apache.spark.rdd.RDD
+
+val data: RDD[Double] = ... // an RDD of sample data
+
+// Construct the density estimator with the sample data and a standard 
deviation for the Gaussian
+// kernels
+val kd = new KernelDensity()
+  .setSample(data)
+  .setBandwidth(3.0)
+
+// Find density estimates for the given values
+val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
+{% endhighlight %}
+/div
+
+div data-lang=java markdown=1
+[`KernelDensity`](api/java/index.html#org.apache.spark.mllib.stat.KernelDensity)
 provides methods
+to compute kernel density estimates from an RDD of samples. The following 
example demonstrates how
+to do so.
+
+{% highlight java %}
+import org.apache.spark.mllib.stat.KernelDensity;
+import org.apache.spark.rdd.RDD;
+
+RDDDouble data = ... // an RDD of sample data
+
+// Construct the density estimator with the sample data and a standard 
deviation for the Gaussian
+// kernels
+KernelDensity kd = new KernelDensity()
+  .setSample(data)
+  .setBandwidth(3.0);
+
+// Find density estimates for the given values
+double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0});
+{% endhighlight %}
+/div
+
+div data-lang=python markdown=1
+[`KernelDensity`](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity)
 provides methods
+to compute kernel density estimates from an RDD of samples. The following 
example demonstrates how
+to do so.
+
+{% highlight python %}
+from pyspark.mllib.stat import KernelDensity
+
+data = ... # an RDD of sample data
+
+# Construct the density estimator with the sample data and a standard 
deviation for the Gaussian
+# kernels
+kd = KernelDensity()
+kd.setSample(data)
+kd.setBandwidth(3.0)
+
+# Find density estimates for the given values
+densities = kd.estimate([-1.0, 2.0, 5.0])
+{% endhighlight %}
+/div
 
 /div


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-7707] User guide and example code for KernelDensity

2015-08-17 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.4 4fc3b8cd2 - f7f2ac69d


[SPARK-7707] User guide and example code for KernelDensity

Author: Sandy Ryza sa...@cloudera.com

Closes #8230 from sryza/sandy-spark-7707.

(cherry picked from commit f9d1a92aa1bac4494022d78559b871149579e6e8)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f7f2ac69
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f7f2ac69
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f7f2ac69

Branch: refs/heads/branch-1.4
Commit: f7f2ac69d7298a7eb4a89e94d1efddd97e036a2e
Parents: 4fc3b8c
Author: Sandy Ryza sa...@cloudera.com
Authored: Mon Aug 17 17:57:51 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 17:58:43 2015 -0700

--
 docs/mllib-statistics.md | 77 +++
 1 file changed, 77 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f7f2ac69/docs/mllib-statistics.md
--
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
index 887eae7..6b1b860 100644
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@@ -493,5 +493,82 @@ u = RandomRDDs.uniformRDD(sc, 100L, 10)
 v = u.map(lambda x: 1.0 + 2.0 * x)
 {% endhighlight %}
 /div
+/div
+
+## Kernel density estimation
+
+[Kernel density 
estimation](https://en.wikipedia.org/wiki/Kernel_density_estimation) is a 
technique
+useful for visualizing empirical probability distributions without requiring 
assumptions about the
+particular distribution that the observed samples are drawn from. It computes 
an estimate of the
+probability density function of a random variables, evaluated at a given set 
of points. It achieves
+this estimate by expressing the PDF of the empirical distribution at a 
particular point as the the
+mean of PDFs of normal distributions centered around each of the samples.
+
+div class=codetabs
+
+div data-lang=scala markdown=1
+[`KernelDensity`](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity)
 provides methods
+to compute kernel density estimates from an RDD of samples. The following 
example demonstrates how
+to do so.
+
+{% highlight scala %}
+import org.apache.spark.mllib.stat.KernelDensity
+import org.apache.spark.rdd.RDD
+
+val data: RDD[Double] = ... // an RDD of sample data
+
+// Construct the density estimator with the sample data and a standard 
deviation for the Gaussian
+// kernels
+val kd = new KernelDensity()
+  .setSample(data)
+  .setBandwidth(3.0)
+
+// Find density estimates for the given values
+val densities = kd.estimate(Array(-1.0, 2.0, 5.0))
+{% endhighlight %}
+/div
+
+div data-lang=java markdown=1
+[`KernelDensity`](api/java/index.html#org.apache.spark.mllib.stat.KernelDensity)
 provides methods
+to compute kernel density estimates from an RDD of samples. The following 
example demonstrates how
+to do so.
+
+{% highlight java %}
+import org.apache.spark.mllib.stat.KernelDensity;
+import org.apache.spark.rdd.RDD;
+
+RDDDouble data = ... // an RDD of sample data
+
+// Construct the density estimator with the sample data and a standard 
deviation for the Gaussian
+// kernels
+KernelDensity kd = new KernelDensity()
+  .setSample(data)
+  .setBandwidth(3.0);
+
+// Find density estimates for the given values
+double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0});
+{% endhighlight %}
+/div
+
+div data-lang=python markdown=1
+[`KernelDensity`](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity)
 provides methods
+to compute kernel density estimates from an RDD of samples. The following 
example demonstrates how
+to do so.
+
+{% highlight python %}
+from pyspark.mllib.stat import KernelDensity
+
+data = ... # an RDD of sample data
+
+# Construct the density estimator with the sample data and a standard 
deviation for the Gaussian
+# kernels
+kd = KernelDensity()
+kd.setSample(data)
+kd.setBandwidth(3.0)
+
+# Find density estimates for the given values
+densities = kd.estimate([-1.0, 2.0, 5.0])
+{% endhighlight %}
+/div
 
 /div


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-7808] [ML] add package doc for ml.feature

2015-08-17 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master ee093c8b9 - e290029a3


[SPARK-7808] [ML] add package doc for ml.feature

This PR adds a short description of `ml.feature` package with code example. The 
Java package doc will come in a separate PR. jkbradley

Author: Xiangrui Meng m...@databricks.com

Closes #8260 from mengxr/SPARK-7808.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e290029a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e290029a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e290029a

Branch: refs/heads/master
Commit: e290029a356222bddf4da1be0525a221a5a1630b
Parents: ee093c8
Author: Xiangrui Meng m...@databricks.com
Authored: Mon Aug 17 19:40:51 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 19:40:51 2015 -0700

--
 .../org/apache/spark/ml/feature/package.scala   | 89 
 1 file changed, 89 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e290029a/mllib/src/main/scala/org/apache/spark/ml/feature/package.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/package.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/package.scala
new file mode 100644
index 000..4571ab2
--- /dev/null
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/package.scala
@@ -0,0 +1,89 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml
+
+import org.apache.spark.ml.feature.{HashingTF, IDF, IDFModel, VectorAssembler}
+import org.apache.spark.sql.DataFrame
+
+/**
+ * == Feature transformers ==
+ *
+ * The `ml.feature` package provides common feature transformers that help 
convert raw data or
+ * features into more suitable forms for model fitting.
+ * Most feature transformers are implemented as [[Transformer]]s, which 
transform one [[DataFrame]]
+ * into another, e.g., [[HashingTF]].
+ * Some feature transformers are implemented as [[Estimator]]s, because the 
transformation requires
+ * some aggregated information of the dataset, e.g., document frequencies in 
[[IDF]].
+ * For those feature transformers, calling [[Estimator!.fit]] is required to 
obtain the model first,
+ * e.g., [[IDFModel]], in order to apply transformation.
+ * The transformation is usually done by appending new columns to the input 
[[DataFrame]], so all
+ * input columns are carried over.
+ *
+ * We try to make each transformer minimal, so it becomes flexible to assemble 
feature
+ * transformation pipelines.
+ * [[Pipeline]] can be used to chain feature transformers, and 
[[VectorAssembler]] can be used to
+ * combine multiple feature transformations, for example:
+ *
+ * {{{
+ *   import org.apache.spark.ml.feature._
+ *   import org.apache.spark.ml.Pipeline
+ *
+ *   // a DataFrame with three columns: id (integer), text (string), and 
rating (double).
+ *   val df = sqlContext.createDataFrame(Seq(
+ * (0, Hi I heard about Spark, 3.0),
+ * (1, I wish Java could use case classes, 4.0),
+ * (2, Logistic regression models are neat, 4.0)
+ *   )).toDF(id, text, rating)
+ *
+ *   // define feature transformers
+ *   val tok = new RegexTokenizer()
+ * .setInputCol(text)
+ * .setOutputCol(words)
+ *   val sw = new StopWordsRemover()
+ * .setInputCol(words)
+ * .setOutputCol(filtered_words)
+ *   val tf = new HashingTF()
+ * .setInputCol(filtered_words)
+ * .setOutputCol(tf)
+ * .setNumFeatures(1)
+ *   val idf = new IDF()
+ * .setInputCol(tf)
+ * .setOutputCol(tf_idf)
+ *   val assembler = new VectorAssembler()
+ * .setInputCols(Array(tf_idf, rating))
+ * .setOutputCol(features)
+ *
+ *   // assemble and fit the feature transformation pipeline
+ *   val pipeline = new Pipeline()
+ * .setStages(Array(tok, sw, tf, idf, assembler))
+ *   val model = pipeline.fit(df)
+ *
+ *   // save transformed features with raw data
+ *   model.transform(df)
+ * .select(id, text, rating, features)
+ * .write.format

spark git commit: [SPARK-8920] [MLLIB] Add @since tags to mllib.linalg

2015-08-17 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 bb3bb2a48 - 0f1417b6f


[SPARK-8920] [MLLIB] Add @since tags to mllib.linalg

Author: Sameer Abhyankar sabhyankar@sabhyankar-MBP.Samavihome
Author: Sameer Abhyankar sabhyankar@sabhyankar-MBP.local

Closes #7729 from sabhyankar/branch_8920.

(cherry picked from commit 088b11ec5949e135cb3db2a1ce136837e046c288)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0f1417b6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0f1417b6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0f1417b6

Branch: refs/heads/branch-1.5
Commit: 0f1417b6f31e53dd78aae2a0a661d9ba32dce5b7
Parents: bb3bb2a
Author: Sameer Abhyankar sabhyankar@sabhyankar-MBP.Samavihome
Authored: Mon Aug 17 16:00:23 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 16:00:31 2015 -0700

--
 .../apache/spark/mllib/linalg/Matrices.scala| 63 
 .../linalg/SingularValueDecomposition.scala |  1 +
 .../org/apache/spark/mllib/linalg/Vectors.scala | 60 +++
 .../mllib/linalg/distributed/BlockMatrix.scala  | 43 +++--
 .../linalg/distributed/CoordinateMatrix.scala   | 28 +++--
 .../linalg/distributed/DistributedMatrix.scala  |  1 +
 .../linalg/distributed/IndexedRowMatrix.scala   | 24 +++-
 .../mllib/linalg/distributed/RowMatrix.scala| 24 +++-
 8 files changed, 227 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0f1417b6/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
index 1139ce3..dfa8910 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
@@ -227,6 +227,7 @@ private[spark] class MatrixUDT extends 
UserDefinedType[Matrix] {
  * @param values matrix entries in column major if not transposed or in row 
major otherwise
  * @param isTransposed whether the matrix is transposed. If true, `values` 
stores the matrix in
  * row major.
+ * @since 1.0.0
  */
 @SQLUserDefinedType(udt = classOf[MatrixUDT])
 class DenseMatrix(
@@ -252,6 +253,7 @@ class DenseMatrix(
* @param numRows number of rows
* @param numCols number of columns
* @param values matrix entries in column major
+   * @since 1.3.0
*/
   def this(numRows: Int, numCols: Int, values: Array[Double]) =
 this(numRows, numCols, values, false)
@@ -276,6 +278,9 @@ class DenseMatrix(
 
   private[mllib] def apply(i: Int): Double = values(i)
 
+  /**
+   * @since 1.3.0
+   */
   override def apply(i: Int, j: Int): Double = values(index(i, j))
 
   private[mllib] def index(i: Int, j: Int): Int = {
@@ -286,6 +291,9 @@ class DenseMatrix(
 values(index(i, j)) = v
   }
 
+  /**
+   * @since 1.4.0
+   */
   override def copy: DenseMatrix = new DenseMatrix(numRows, numCols, 
values.clone())
 
   private[spark] def map(f: Double = Double) = new DenseMatrix(numRows, 
numCols, values.map(f),
@@ -301,6 +309,9 @@ class DenseMatrix(
 this
   }
 
+  /**
+   * @since 1.3.0
+   */
   override def transpose: DenseMatrix = new DenseMatrix(numCols, numRows, 
values, !isTransposed)
 
   private[spark] override def foreachActive(f: (Int, Int, Double) = Unit): 
Unit = {
@@ -331,13 +342,20 @@ class DenseMatrix(
 }
   }
 
+  /**
+   * @since 1.5.0
+   */
   override def numNonzeros: Int = values.count(_ != 0)
 
+  /**
+   * @since 1.5.0
+   */
   override def numActives: Int = values.length
 
   /**
* Generate a `SparseMatrix` from the given `DenseMatrix`. The new matrix 
will have isTransposed
* set to false.
+   * @since 1.3.0
*/
   def toSparse: SparseMatrix = {
 val spVals: MArrayBuilder[Double] = new MArrayBuilder.ofDouble
@@ -365,6 +383,7 @@ class DenseMatrix(
 
 /**
  * Factory methods for [[org.apache.spark.mllib.linalg.DenseMatrix]].
+ * @since 1.3.0
  */
 object DenseMatrix {
 
@@ -373,6 +392,7 @@ object DenseMatrix {
* @param numRows number of rows of the matrix
* @param numCols number of columns of the matrix
* @return `DenseMatrix` with size `numRows` x `numCols` and values of zeros
+   * @since 1.3.0
*/
   def zeros(numRows: Int, numCols: Int): DenseMatrix = {
 require(numRows.toLong * numCols = Int.MaxValue,
@@ -385,6 +405,7 @@ object DenseMatrix {
* @param numRows number of rows of the matrix
* @param numCols number of columns of the matrix
* @return `DenseMatrix` with size `numRows` x `numCols` and values of ones
+   * @since 1.3.0
*/
   def ones

spark git commit: [SPARK-9898] [MLLIB] Prefix Span user guide

2015-08-17 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 f5ed9ede9 - 18b3d11f7


[SPARK-9898] [MLLIB] Prefix Span user guide

Adds user guide for `PrefixSpan`, including Scala and Java example code.

mengxr zhangjiajin

Author: Feynman Liang fli...@databricks.com

Closes #8253 from feynmanliang/SPARK-9898.

(cherry picked from commit 0b6b01761370629ce387c143a25d41f3a334ff28)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/18b3d11f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/18b3d11f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/18b3d11f

Branch: refs/heads/branch-1.5
Commit: 18b3d11f787c48b429ffdef0075d398d7a0ab1a1
Parents: f5ed9ed
Author: Feynman Liang fli...@databricks.com
Authored: Mon Aug 17 17:53:24 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 17:53:31 2015 -0700

--
 docs/mllib-frequent-pattern-mining.md | 96 ++
 docs/mllib-guide.md   |  1 +
 2 files changed, 97 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/18b3d11f/docs/mllib-frequent-pattern-mining.md
--
diff --git a/docs/mllib-frequent-pattern-mining.md 
b/docs/mllib-frequent-pattern-mining.md
index bcc066a..8ea4389 100644
--- a/docs/mllib-frequent-pattern-mining.md
+++ b/docs/mllib-frequent-pattern-mining.md
@@ -96,3 +96,99 @@ for (FPGrowth.FreqItemsetString itemset: 
model.freqItemsets().toJavaRDD().coll
 
 /div
 /div
+
+## PrefixSpan
+
+PrefixSpan is a sequential pattern mining algorithm described in
+[Pei et al., Mining Sequential Patterns by Pattern-Growth: The
+PrefixSpan Approach](http://dx.doi.org/10.1109%2FTKDE.2004.77). We refer
+the reader to the referenced paper for formalizing the sequential
+pattern mining problem.
+
+MLlib's PrefixSpan implementation takes the following parameters:
+
+* `minSupport`: the minimum support required to be considered a frequent
+  sequential pattern.
+* `maxPatternLength`: the maximum length of a frequent sequential
+  pattern. Any frequent pattern exceeding this length will not be
+  included in the results.
+* `maxLocalProjDBSize`: the maximum number of items allowed in a
+  prefix-projected database before local iterative processing of the
+  projected databse begins. This parameter should be tuned with respect
+  to the size of your executors.
+
+**Examples**
+
+The following example illustrates PrefixSpan running on the sequences
+(using same notation as Pei et al):
+
+~~~
+  (12)3
+  1(32)(12)
+  (12)5
+  6
+~~~
+
+div class=codetabs
+div data-lang=scala markdown=1
+
+[`PrefixSpan`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpan) 
implements the
+PrefixSpan algorithm.
+Calling `PrefixSpan.run` returns a
+[`PrefixSpanModel`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpanModel)
+that stores the frequent sequences with their frequencies.
+
+{% highlight scala %}
+import org.apache.spark.mllib.fpm.PrefixSpan
+
+val sequences = sc.parallelize(Seq(
+Array(Array(1, 2), Array(3)),
+Array(Array(1), Array(3, 2), Array(1, 2)),
+Array(Array(1, 2), Array(5)),
+Array(Array(6))
+  ), 2).cache()
+val prefixSpan = new PrefixSpan()
+  .setMinSupport(0.5)
+  .setMaxPatternLength(5)
+val model = prefixSpan.run(sequences)
+model.freqSequences.collect().foreach { freqSequence =
+println(
+  freqSequence.sequence.map(_.mkString([, , , ])).mkString([, , , 
]) + ,  + freqSequence.freq)
+}
+{% endhighlight %}
+
+/div
+
+div data-lang=java markdown=1
+
+[`PrefixSpan`](api/java/org/apache/spark/mllib/fpm/PrefixSpan.html) implements 
the
+PrefixSpan algorithm.
+Calling `PrefixSpan.run` returns a
+[`PrefixSpanModel`](api/java/org/apache/spark/mllib/fpm/PrefixSpanModel.html)
+that stores the frequent sequences with their frequencies.
+
+{% highlight java %}
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.spark.mllib.fpm.PrefixSpan;
+import org.apache.spark.mllib.fpm.PrefixSpanModel;
+
+JavaRDDListListInteger sequences = sc.parallelize(Arrays.asList(
+  Arrays.asList(Arrays.asList(1, 2), Arrays.asList(3)),
+  Arrays.asList(Arrays.asList(1), Arrays.asList(3, 2), Arrays.asList(1, 2)),
+  Arrays.asList(Arrays.asList(1, 2), Arrays.asList(5)),
+  Arrays.asList(Arrays.asList(6))
+), 2);
+PrefixSpan prefixSpan = new PrefixSpan()
+  .setMinSupport(0.5)
+  .setMaxPatternLength(5);
+PrefixSpanModelInteger model = prefixSpan.run(sequences);
+for (PrefixSpan.FreqSequenceInteger freqSeq: 
model.freqSequences().toJavaRDD().collect()) {
+  System.out.println(freqSeq.javaSequence() + ,  + freqSeq.freq());
+}
+{% endhighlight %}
+
+/div
+/div
+

http://git-wip-us.apache.org/repos/asf/spark/blob/18b3d11f/docs/mllib-guide.md

spark git commit: [SPARK-9898] [MLLIB] Prefix Span user guide

2015-08-17 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 18523c130 - 0b6b01761


[SPARK-9898] [MLLIB] Prefix Span user guide

Adds user guide for `PrefixSpan`, including Scala and Java example code.

mengxr zhangjiajin

Author: Feynman Liang fli...@databricks.com

Closes #8253 from feynmanliang/SPARK-9898.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0b6b0176
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0b6b0176
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0b6b0176

Branch: refs/heads/master
Commit: 0b6b01761370629ce387c143a25d41f3a334ff28
Parents: 18523c1
Author: Feynman Liang fli...@databricks.com
Authored: Mon Aug 17 17:53:24 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 17:53:24 2015 -0700

--
 docs/mllib-frequent-pattern-mining.md | 96 ++
 docs/mllib-guide.md   |  1 +
 2 files changed, 97 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0b6b0176/docs/mllib-frequent-pattern-mining.md
--
diff --git a/docs/mllib-frequent-pattern-mining.md 
b/docs/mllib-frequent-pattern-mining.md
index bcc066a..8ea4389 100644
--- a/docs/mllib-frequent-pattern-mining.md
+++ b/docs/mllib-frequent-pattern-mining.md
@@ -96,3 +96,99 @@ for (FPGrowth.FreqItemsetString itemset: 
model.freqItemsets().toJavaRDD().coll
 
 /div
 /div
+
+## PrefixSpan
+
+PrefixSpan is a sequential pattern mining algorithm described in
+[Pei et al., Mining Sequential Patterns by Pattern-Growth: The
+PrefixSpan Approach](http://dx.doi.org/10.1109%2FTKDE.2004.77). We refer
+the reader to the referenced paper for formalizing the sequential
+pattern mining problem.
+
+MLlib's PrefixSpan implementation takes the following parameters:
+
+* `minSupport`: the minimum support required to be considered a frequent
+  sequential pattern.
+* `maxPatternLength`: the maximum length of a frequent sequential
+  pattern. Any frequent pattern exceeding this length will not be
+  included in the results.
+* `maxLocalProjDBSize`: the maximum number of items allowed in a
+  prefix-projected database before local iterative processing of the
+  projected databse begins. This parameter should be tuned with respect
+  to the size of your executors.
+
+**Examples**
+
+The following example illustrates PrefixSpan running on the sequences
+(using same notation as Pei et al):
+
+~~~
+  (12)3
+  1(32)(12)
+  (12)5
+  6
+~~~
+
+div class=codetabs
+div data-lang=scala markdown=1
+
+[`PrefixSpan`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpan) 
implements the
+PrefixSpan algorithm.
+Calling `PrefixSpan.run` returns a
+[`PrefixSpanModel`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpanModel)
+that stores the frequent sequences with their frequencies.
+
+{% highlight scala %}
+import org.apache.spark.mllib.fpm.PrefixSpan
+
+val sequences = sc.parallelize(Seq(
+Array(Array(1, 2), Array(3)),
+Array(Array(1), Array(3, 2), Array(1, 2)),
+Array(Array(1, 2), Array(5)),
+Array(Array(6))
+  ), 2).cache()
+val prefixSpan = new PrefixSpan()
+  .setMinSupport(0.5)
+  .setMaxPatternLength(5)
+val model = prefixSpan.run(sequences)
+model.freqSequences.collect().foreach { freqSequence =
+println(
+  freqSequence.sequence.map(_.mkString([, , , ])).mkString([, , , 
]) + ,  + freqSequence.freq)
+}
+{% endhighlight %}
+
+/div
+
+div data-lang=java markdown=1
+
+[`PrefixSpan`](api/java/org/apache/spark/mllib/fpm/PrefixSpan.html) implements 
the
+PrefixSpan algorithm.
+Calling `PrefixSpan.run` returns a
+[`PrefixSpanModel`](api/java/org/apache/spark/mllib/fpm/PrefixSpanModel.html)
+that stores the frequent sequences with their frequencies.
+
+{% highlight java %}
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.spark.mllib.fpm.PrefixSpan;
+import org.apache.spark.mllib.fpm.PrefixSpanModel;
+
+JavaRDDListListInteger sequences = sc.parallelize(Arrays.asList(
+  Arrays.asList(Arrays.asList(1, 2), Arrays.asList(3)),
+  Arrays.asList(Arrays.asList(1), Arrays.asList(3, 2), Arrays.asList(1, 2)),
+  Arrays.asList(Arrays.asList(1, 2), Arrays.asList(5)),
+  Arrays.asList(Arrays.asList(6))
+), 2);
+PrefixSpan prefixSpan = new PrefixSpan()
+  .setMinSupport(0.5)
+  .setMaxPatternLength(5);
+PrefixSpanModelInteger model = prefixSpan.run(sequences);
+for (PrefixSpan.FreqSequenceInteger freqSeq: 
model.freqSequences().toJavaRDD().collect()) {
+  System.out.println(freqSeq.javaSequence() + ,  + freqSeq.freq());
+}
+{% endhighlight %}
+
+/div
+/div
+

http://git-wip-us.apache.org/repos/asf/spark/blob/0b6b0176/docs/mllib-guide.md
--
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index e8000ff

spark git commit: [SPARK-9959] [MLLIB] Association Rules Java Compatibility

2015-08-17 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 3ff81ad2d - f7efda397


[SPARK-9959] [MLLIB] Association Rules Java Compatibility

mengxr

Author: Feynman Liang fli...@databricks.com

Closes #8206 from feynmanliang/SPARK-9959-arules-java.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f7efda39
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f7efda39
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f7efda39

Branch: refs/heads/master
Commit: f7efda3975d46a8ce4fd720b3730127ea482560b
Parents: 3ff81ad
Author: Feynman Liang fli...@databricks.com
Authored: Mon Aug 17 09:58:34 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 17 09:58:34 2015 -0700

--
 .../spark/mllib/fpm/AssociationRules.scala  | 30 ++--
 1 file changed, 28 insertions(+), 2 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/f7efda39/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala
index 72d0ea0..7f4de77 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/AssociationRules.scala
@@ -16,6 +16,7 @@
  */
 package org.apache.spark.mllib.fpm
 
+import scala.collection.JavaConverters._
 import scala.reflect.ClassTag
 
 import org.apache.spark.Logging
@@ -95,8 +96,10 @@ object AssociationRules {
* :: Experimental ::
*
* An association rule between sets of items.
-   * @param antecedent hypotheses of the rule
-   * @param consequent conclusion of the rule
+   * @param antecedent hypotheses of the rule. Java users should call 
[[Rule#javaAntecedent]]
+   *   instead.
+   * @param consequent conclusion of the rule. Java users should call 
[[Rule#javaConsequent]]
+   *   instead.
* @tparam Item item type
*
* @since 1.5.0
@@ -108,6 +111,11 @@ object AssociationRules {
   freqUnion: Double,
   freqAntecedent: Double) extends Serializable {
 
+/**
+ * Returns the confidence of the rule.
+ *
+ * @since 1.5.0
+ */
 def confidence: Double = freqUnion.toDouble / freqAntecedent
 
 require(antecedent.toSet.intersect(consequent.toSet).isEmpty, {
@@ -115,5 +123,23 @@ object AssociationRules {
   sA valid association rule must have disjoint antecedent and  +
 sconsequent but ${sharedItems} is present in both.
 })
+
+/**
+ * Returns antecedent in a Java List.
+ *
+ * @since 1.5.0
+ */
+def javaAntecedent: java.util.List[Item] = {
+  antecedent.toList.asJava
+}
+
+/**
+ * Returns consequent in a Java List.
+ *
+ * @since 1.5.0
+ */
+def javaConsequent: java.util.List[Item] = {
+  consequent.toList.asJava
+}
   }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9828] [PYSPARK] Mutable values should not be default arguments

2015-08-14 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master ece00566e - ffa05c84f


[SPARK-9828] [PYSPARK] Mutable values should not be default arguments

Author: MechCoder manojkumarsivaraj...@gmail.com

Closes #8110 from MechCoder/spark-9828.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ffa05c84
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ffa05c84
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ffa05c84

Branch: refs/heads/master
Commit: ffa05c84fe75663fc33f3d954d1cb1e084ab3280
Parents: ece0056
Author: MechCoder manojkumarsivaraj...@gmail.com
Authored: Fri Aug 14 12:46:05 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Fri Aug 14 12:46:05 2015 -0700

--
 python/pyspark/ml/evaluation.py |  4 +++-
 python/pyspark/ml/param/__init__.py | 26 +-
 python/pyspark/ml/pipeline.py   |  4 ++--
 python/pyspark/ml/tuning.py |  8 ++--
 python/pyspark/rdd.py   |  5 -
 python/pyspark/sql/readwriter.py|  8 ++--
 python/pyspark/statcounter.py   |  4 +++-
 python/pyspark/streaming/kafka.py   | 12 +---
 8 files changed, 50 insertions(+), 21 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ffa05c84/python/pyspark/ml/evaluation.py
--
diff --git a/python/pyspark/ml/evaluation.py b/python/pyspark/ml/evaluation.py
index 2734092..e23ce05 100644
--- a/python/pyspark/ml/evaluation.py
+++ b/python/pyspark/ml/evaluation.py
@@ -46,7 +46,7 @@ class Evaluator(Params):
 
 raise NotImplementedError()
 
-def evaluate(self, dataset, params={}):
+def evaluate(self, dataset, params=None):
 
 Evaluates the output with optional parameters.
 
@@ -56,6 +56,8 @@ class Evaluator(Params):
params
 :return: metric
 
+if params is None:
+params = dict()
 if isinstance(params, dict):
 if params:
 return self.copy(params)._evaluate(dataset)

http://git-wip-us.apache.org/repos/asf/spark/blob/ffa05c84/python/pyspark/ml/param/__init__.py
--
diff --git a/python/pyspark/ml/param/__init__.py 
b/python/pyspark/ml/param/__init__.py
index 7845536..eeeac49 100644
--- a/python/pyspark/ml/param/__init__.py
+++ b/python/pyspark/ml/param/__init__.py
@@ -60,14 +60,16 @@ class Params(Identifiable):
 
 __metaclass__ = ABCMeta
 
-#: internal param map for user-supplied values param map
-_paramMap = {}
+def __init__(self):
+super(Params, self).__init__()
+#: internal param map for user-supplied values param map
+self._paramMap = {}
 
-#: internal param map for default values
-_defaultParamMap = {}
+#: internal param map for default values
+self._defaultParamMap = {}
 
-#: value returned by :py:func:`params`
-_params = None
+#: value returned by :py:func:`params`
+self._params = None
 
 @property
 def params(self):
@@ -155,7 +157,7 @@ class Params(Identifiable):
 else:
 return self._defaultParamMap[param]
 
-def extractParamMap(self, extra={}):
+def extractParamMap(self, extra=None):
 
 Extracts the embedded default param values and user-supplied
 values, and then merges them with extra values from input into
@@ -165,12 +167,14 @@ class Params(Identifiable):
 :param extra: extra param values
 :return: merged param map
 
+if extra is None:
+extra = dict()
 paramMap = self._defaultParamMap.copy()
 paramMap.update(self._paramMap)
 paramMap.update(extra)
 return paramMap
 
-def copy(self, extra={}):
+def copy(self, extra=None):
 
 Creates a copy of this instance with the same uid and some
 extra params. The default implementation creates a
@@ -181,6 +185,8 @@ class Params(Identifiable):
 :param extra: Extra parameters to copy to the new instance
 :return: Copy of this instance
 
+if extra is None:
+extra = dict()
 that = copy.copy(self)
 that._paramMap = self.extractParamMap(extra)
 return that
@@ -233,7 +239,7 @@ class Params(Identifiable):
 self._defaultParamMap[getattr(self, param)] = value
 return self
 
-def _copyValues(self, to, extra={}):
+def _copyValues(self, to, extra=None):
 
 Copies param values from this instance to another instance for
 params shared by them.
@@ -241,6 +247,8 @@ class Params(Identifiable):
 :param extra: extra params to be copied
 :return: the target

spark git commit: [SPARK-9828] [PYSPARK] Mutable values should not be default arguments

2015-08-14 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.4 db71ea482 - 969e8b31b


[SPARK-9828] [PYSPARK] Mutable values should not be default arguments

Author: MechCoder manojkumarsivaraj...@gmail.com

Closes #8110 from MechCoder/spark-9828.

(cherry picked from commit ffa05c84fe75663fc33f3d954d1cb1e084ab3280)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/969e8b31
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/969e8b31
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/969e8b31

Branch: refs/heads/branch-1.4
Commit: 969e8b31b48fe1b26fcc667b46ba97a538b1e382
Parents: db71ea4
Author: MechCoder manojkumarsivaraj...@gmail.com
Authored: Fri Aug 14 12:46:05 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Fri Aug 14 12:50:46 2015 -0700

--
 python/pyspark/ml/evaluation.py |  4 +++-
 python/pyspark/ml/param/__init__.py | 26 +-
 python/pyspark/ml/pipeline.py   |  4 ++--
 python/pyspark/ml/tuning.py |  8 ++--
 python/pyspark/rdd.py   |  5 -
 python/pyspark/sql/readwriter.py|  8 ++--
 python/pyspark/statcounter.py   |  4 +++-
 python/pyspark/streaming/kafka.py   | 12 +---
 8 files changed, 50 insertions(+), 21 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/969e8b31/python/pyspark/ml/evaluation.py
--
diff --git a/python/pyspark/ml/evaluation.py b/python/pyspark/ml/evaluation.py
index 595593a..7af447c 100644
--- a/python/pyspark/ml/evaluation.py
+++ b/python/pyspark/ml/evaluation.py
@@ -45,7 +45,7 @@ class Evaluator(Params):
 
 raise NotImplementedError()
 
-def evaluate(self, dataset, params={}):
+def evaluate(self, dataset, params=None):
 
 Evaluates the output with optional parameters.
 
@@ -55,6 +55,8 @@ class Evaluator(Params):
params
 :return: metric
 
+if params is None:
+params = dict()
 if isinstance(params, dict):
 if params:
 return self.copy(params)._evaluate(dataset)

http://git-wip-us.apache.org/repos/asf/spark/blob/969e8b31/python/pyspark/ml/param/__init__.py
--
diff --git a/python/pyspark/ml/param/__init__.py 
b/python/pyspark/ml/param/__init__.py
index 7845536..eeeac49 100644
--- a/python/pyspark/ml/param/__init__.py
+++ b/python/pyspark/ml/param/__init__.py
@@ -60,14 +60,16 @@ class Params(Identifiable):
 
 __metaclass__ = ABCMeta
 
-#: internal param map for user-supplied values param map
-_paramMap = {}
+def __init__(self):
+super(Params, self).__init__()
+#: internal param map for user-supplied values param map
+self._paramMap = {}
 
-#: internal param map for default values
-_defaultParamMap = {}
+#: internal param map for default values
+self._defaultParamMap = {}
 
-#: value returned by :py:func:`params`
-_params = None
+#: value returned by :py:func:`params`
+self._params = None
 
 @property
 def params(self):
@@ -155,7 +157,7 @@ class Params(Identifiable):
 else:
 return self._defaultParamMap[param]
 
-def extractParamMap(self, extra={}):
+def extractParamMap(self, extra=None):
 
 Extracts the embedded default param values and user-supplied
 values, and then merges them with extra values from input into
@@ -165,12 +167,14 @@ class Params(Identifiable):
 :param extra: extra param values
 :return: merged param map
 
+if extra is None:
+extra = dict()
 paramMap = self._defaultParamMap.copy()
 paramMap.update(self._paramMap)
 paramMap.update(extra)
 return paramMap
 
-def copy(self, extra={}):
+def copy(self, extra=None):
 
 Creates a copy of this instance with the same uid and some
 extra params. The default implementation creates a
@@ -181,6 +185,8 @@ class Params(Identifiable):
 :param extra: Extra parameters to copy to the new instance
 :return: Copy of this instance
 
+if extra is None:
+extra = dict()
 that = copy.copy(self)
 that._paramMap = self.extractParamMap(extra)
 return that
@@ -233,7 +239,7 @@ class Params(Identifiable):
 self._defaultParamMap[getattr(self, param)] = value
 return self
 
-def _copyValues(self, to, extra={}):
+def _copyValues(self, to, extra=None):
 
 Copies param values from this instance to another instance for
 params shared by them

spark git commit: [SPARK-9981] [ML] Made labels public for StringIndexerModel

2015-08-14 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 11ed2b180 - 2a6590e51


[SPARK-9981] [ML] Made labels public for StringIndexerModel

Also added unit test for integration between StringIndexerModel and 
IndexToString

CC: holdenk We realized we should have left in your unit test (to catch the 
issue with removing the inverse() method), so this adds it back.  mengxr

Author: Joseph K. Bradley jos...@databricks.com

Closes #8211 from jkbradley/stridx-labels.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2a6590e5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2a6590e5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2a6590e5

Branch: refs/heads/master
Commit: 2a6590e510aba3bfc6603d280023128b3f5ac702
Parents: 11ed2b1
Author: Joseph K. Bradley jos...@databricks.com
Authored: Fri Aug 14 14:05:03 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Fri Aug 14 14:05:03 2015 -0700

--
 .../apache/spark/ml/feature/StringIndexer.scala   |  5 -
 .../spark/ml/feature/StringIndexerSuite.scala | 18 ++
 2 files changed, 22 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2a6590e5/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
index 6347578..24250e4 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
@@ -97,14 +97,17 @@ class StringIndexer(override val uid: String) extends 
Estimator[StringIndexerMod
 /**
  * :: Experimental ::
  * Model fitted by [[StringIndexer]].
+ *
  * NOTE: During transformation, if the input column does not exist,
  * [[StringIndexerModel.transform]] would return the input dataset unmodified.
  * This is a temporary fix for the case when target labels do not exist during 
prediction.
+ *
+ * @param labels  Ordered list of labels, corresponding to indices to be 
assigned
  */
 @Experimental
 class StringIndexerModel (
 override val uid: String,
-labels: Array[String]) extends Model[StringIndexerModel] with 
StringIndexerBase {
+val labels: Array[String]) extends Model[StringIndexerModel] with 
StringIndexerBase {
 
   def this(labels: Array[String]) = this(Identifiable.randomUID(strIdx), 
labels)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/2a6590e5/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala
index 0b4c8ba..05e05bd 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala
@@ -147,4 +147,22 @@ class StringIndexerSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(actual === expected)
 }
   }
+
+  test(StringIndexer, IndexToString are inverses) {
+val data = sc.parallelize(Seq((0, a), (1, b), (2, c), (3, a), (4, 
a), (5, c)), 2)
+val df = sqlContext.createDataFrame(data).toDF(id, label)
+val indexer = new StringIndexer()
+  .setInputCol(label)
+  .setOutputCol(labelIndex)
+  .fit(df)
+val transformed = indexer.transform(df)
+val idx2str = new IndexToString()
+  .setInputCol(labelIndex)
+  .setOutputCol(sameLabel)
+  .setLabels(indexer.labels)
+idx2str.transform(transformed).select(label, 
sameLabel).collect().foreach {
+  case Row(a: String, b: String) =
+assert(a === b)
+}
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9981] [ML] Made labels public for StringIndexerModel

2015-08-14 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 59cdcc079 - 0f4ccdc4c


[SPARK-9981] [ML] Made labels public for StringIndexerModel

Also added unit test for integration between StringIndexerModel and 
IndexToString

CC: holdenk We realized we should have left in your unit test (to catch the 
issue with removing the inverse() method), so this adds it back.  mengxr

Author: Joseph K. Bradley jos...@databricks.com

Closes #8211 from jkbradley/stridx-labels.

(cherry picked from commit 2a6590e510aba3bfc6603d280023128b3f5ac702)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0f4ccdc4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0f4ccdc4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0f4ccdc4

Branch: refs/heads/branch-1.5
Commit: 0f4ccdc4cfa02ad78f2c4949ddb3822d07d65104
Parents: 59cdcc0
Author: Joseph K. Bradley jos...@databricks.com
Authored: Fri Aug 14 14:05:03 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Fri Aug 14 14:11:26 2015 -0700

--
 .../apache/spark/ml/feature/StringIndexer.scala   |  5 -
 .../spark/ml/feature/StringIndexerSuite.scala | 18 ++
 2 files changed, 22 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0f4ccdc4/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
index f5dfba1..76f017d 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
@@ -93,14 +93,17 @@ class StringIndexer(override val uid: String) extends 
Estimator[StringIndexerMod
 /**
  * :: Experimental ::
  * Model fitted by [[StringIndexer]].
+ *
  * NOTE: During transformation, if the input column does not exist,
  * [[StringIndexerModel.transform]] would return the input dataset unmodified.
  * This is a temporary fix for the case when target labels do not exist during 
prediction.
+ *
+ * @param labels  Ordered list of labels, corresponding to indices to be 
assigned
  */
 @Experimental
 class StringIndexerModel (
 override val uid: String,
-labels: Array[String]) extends Model[StringIndexerModel] with 
StringIndexerBase {
+val labels: Array[String]) extends Model[StringIndexerModel] with 
StringIndexerBase {
 
   def this(labels: Array[String]) = this(Identifiable.randomUID(strIdx), 
labels)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/0f4ccdc4/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala
index d960861..5fe66a3 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala
@@ -116,4 +116,22 @@ class StringIndexerSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(actual === expected)
 }
   }
+
+  test(StringIndexer, IndexToString are inverses) {
+val data = sc.parallelize(Seq((0, a), (1, b), (2, c), (3, a), (4, 
a), (5, c)), 2)
+val df = sqlContext.createDataFrame(data).toDF(id, label)
+val indexer = new StringIndexer()
+  .setInputCol(label)
+  .setOutputCol(labelIndex)
+  .fit(df)
+val transformed = indexer.transform(df)
+val idx2str = new IndexToString()
+  .setInputCol(labelIndex)
+  .setOutputCol(sameLabel)
+  .setLabels(indexer.labels)
+idx2str.transform(transformed).select(label, 
sameLabel).collect().foreach {
+  case Row(a: String, b: String) =
+assert(a === b)
+}
+  }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9918] [MLLIB] remove runs from k-means and rename epsilon to tol

2015-08-13 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 d213aa77c - ae18342a5


[SPARK-9918] [MLLIB] remove runs from k-means and rename epsilon to tol

This requires some discussion. I'm not sure whether `runs` is a useful 
parameter. It certainly complicates the implementation. We might want to 
optimize the k-means implementation with block matrix operations. In this case, 
having `runs` may not be worth the trade-off. Also it increases the 
communication cost in a single job, which might cause other issues.

This PR also renames `epsilon` to `tol` to have consistent naming among 
algorithms. The Python constructor is updated to include all parameters.

jkbradley yu-iskw

Author: Xiangrui Meng m...@databricks.com

Closes #8148 from mengxr/SPARK-9918 and squashes the following commits:

149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol
3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python
a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API

(cherry picked from commit 68f99571492f67596b3656e9f076deeb96616f4a)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ae18342a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ae18342a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ae18342a

Branch: refs/heads/branch-1.5
Commit: ae18342a5d54a4f13d88579aac45ca4544268112
Parents: d213aa7
Author: Xiangrui Meng m...@databricks.com
Authored: Wed Aug 12 23:04:59 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 23:05:06 2015 -0700

--
 .../org/apache/spark/ml/clustering/KMeans.scala | 51 
 .../spark/ml/clustering/KMeansSuite.scala   | 12 +---
 python/pyspark/ml/clustering.py | 63 
 3 files changed, 26 insertions(+), 100 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ae18342a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
index dc192ad..47a18cd 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
@@ -18,8 +18,8 @@
 package org.apache.spark.ml.clustering
 
 import org.apache.spark.annotation.Experimental
-import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
-import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, Params, IntParam, ParamMap}
+import org.apache.spark.ml.param.shared._
 import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
 import org.apache.spark.ml.{Estimator, Model}
 import org.apache.spark.mllib.clustering.{KMeans = MLlibKMeans, KMeansModel 
= MLlibKMeansModel}
@@ -27,14 +27,13 @@ import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
 import org.apache.spark.sql.functions.{col, udf}
 import org.apache.spark.sql.types.{IntegerType, StructType}
 import org.apache.spark.sql.{DataFrame, Row}
-import org.apache.spark.util.Utils
 
 
 /**
  * Common params for KMeans and KMeansModel
  */
-private[clustering] trait KMeansParams
-extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+private[clustering] trait KMeansParams extends Params with HasMaxIter with 
HasFeaturesCol
+  with HasSeed with HasPredictionCol with HasTol {
 
   /**
* Set the number of clusters to create (k). Must be  1. Default: 2.
@@ -46,31 +45,6 @@ private[clustering] trait KMeansParams
   def getK: Int = $(k)
 
   /**
-   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
-   * this many times with random starting conditions (configured by the 
initialization mode), then
-   * return the best clustering found over any run. Must be = 1. Default: 1.
-   * @group param
-   */
-  final val runs = new IntParam(this, runs,
-number of runs of the algorithm to execute in parallel, (value: Int) = 
value = 1)
-
-  /** @group getParam */
-  def getRuns: Int = $(runs)
-
-  /**
-   * Param the distance threshold within which we've consider centers to have 
converged.
-   * If all centers move less than this Euclidean distance, we stop iterating 
one run.
-   * Must be = 0.0. Default: 1e-4
-   * @group param
-   */
-  final val epsilon = new DoubleParam(this, epsilon,
-distance threshold within which we've consider centers to have converge,
-(value: Double) = value = 0.0)
-
-  /** @group getParam */
-  def getEpsilon: Double = $(epsilon)
-
-  /**
* Param

spark git commit: [SPARK-9918] [MLLIB] remove runs from k-means and rename epsilon to tol

2015-08-13 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master d0b18919d - 68f995714


[SPARK-9918] [MLLIB] remove runs from k-means and rename epsilon to tol

This requires some discussion. I'm not sure whether `runs` is a useful 
parameter. It certainly complicates the implementation. We might want to 
optimize the k-means implementation with block matrix operations. In this case, 
having `runs` may not be worth the trade-off. Also it increases the 
communication cost in a single job, which might cause other issues.

This PR also renames `epsilon` to `tol` to have consistent naming among 
algorithms. The Python constructor is updated to include all parameters.

jkbradley yu-iskw

Author: Xiangrui Meng m...@databricks.com

Closes #8148 from mengxr/SPARK-9918 and squashes the following commits:

149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol
3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python
a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/68f99571
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/68f99571
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/68f99571

Branch: refs/heads/master
Commit: 68f99571492f67596b3656e9f076deeb96616f4a
Parents: d0b1891
Author: Xiangrui Meng m...@databricks.com
Authored: Wed Aug 12 23:04:59 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 23:04:59 2015 -0700

--
 .../org/apache/spark/ml/clustering/KMeans.scala | 51 
 .../spark/ml/clustering/KMeansSuite.scala   | 12 +---
 python/pyspark/ml/clustering.py | 63 
 3 files changed, 26 insertions(+), 100 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/68f99571/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
index dc192ad..47a18cd 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala
@@ -18,8 +18,8 @@
 package org.apache.spark.ml.clustering
 
 import org.apache.spark.annotation.Experimental
-import org.apache.spark.ml.param.{Param, Params, IntParam, DoubleParam, 
ParamMap}
-import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasMaxIter, 
HasPredictionCol, HasSeed}
+import org.apache.spark.ml.param.{Param, Params, IntParam, ParamMap}
+import org.apache.spark.ml.param.shared._
 import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
 import org.apache.spark.ml.{Estimator, Model}
 import org.apache.spark.mllib.clustering.{KMeans = MLlibKMeans, KMeansModel 
= MLlibKMeansModel}
@@ -27,14 +27,13 @@ import org.apache.spark.mllib.linalg.{Vector, VectorUDT}
 import org.apache.spark.sql.functions.{col, udf}
 import org.apache.spark.sql.types.{IntegerType, StructType}
 import org.apache.spark.sql.{DataFrame, Row}
-import org.apache.spark.util.Utils
 
 
 /**
  * Common params for KMeans and KMeansModel
  */
-private[clustering] trait KMeansParams
-extends Params with HasMaxIter with HasFeaturesCol with HasSeed with 
HasPredictionCol {
+private[clustering] trait KMeansParams extends Params with HasMaxIter with 
HasFeaturesCol
+  with HasSeed with HasPredictionCol with HasTol {
 
   /**
* Set the number of clusters to create (k). Must be  1. Default: 2.
@@ -46,31 +45,6 @@ private[clustering] trait KMeansParams
   def getK: Int = $(k)
 
   /**
-   * Param the number of runs of the algorithm to execute in parallel. We 
initialize the algorithm
-   * this many times with random starting conditions (configured by the 
initialization mode), then
-   * return the best clustering found over any run. Must be = 1. Default: 1.
-   * @group param
-   */
-  final val runs = new IntParam(this, runs,
-number of runs of the algorithm to execute in parallel, (value: Int) = 
value = 1)
-
-  /** @group getParam */
-  def getRuns: Int = $(runs)
-
-  /**
-   * Param the distance threshold within which we've consider centers to have 
converged.
-   * If all centers move less than this Euclidean distance, we stop iterating 
one run.
-   * Must be = 0.0. Default: 1e-4
-   * @group param
-   */
-  final val epsilon = new DoubleParam(this, epsilon,
-distance threshold within which we've consider centers to have converge,
-(value: Double) = value = 0.0)
-
-  /** @group getParam */
-  def getEpsilon: Double = $(epsilon)
-
-  /**
* Param for the initialization algorithm. This can be either random to 
choose random points as
* initial cluster centers, or k-means|| to use

spark git commit: [MINOR] [ML] change MultilayerPerceptronClassifierModel to MultilayerPerceptronClassificationModel

2015-08-13 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 49085b56c - 2b1353249


[MINOR] [ML] change MultilayerPerceptronClassifierModel to 
MultilayerPerceptronClassificationModel

To follow the naming rule of ML, change `MultilayerPerceptronClassifierModel` 
to `MultilayerPerceptronClassificationModel` like 
`DecisionTreeClassificationModel`, `GBTClassificationModel` and so on.

Author: Yanbo Liang yblia...@gmail.com

Closes #8164 from yanboliang/mlp-name.

(cherry picked from commit 4b70798c96b0a784b85fda461426ec60f609be12)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2b135324
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2b135324
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2b135324

Branch: refs/heads/branch-1.5
Commit: 2b13532497b23eb6e02e4b0ef7503e73242f932d
Parents: 49085b5
Author: Yanbo Liang yblia...@gmail.com
Authored: Thu Aug 13 09:31:14 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Aug 13 09:31:24 2015 -0700

--
 .../MultilayerPerceptronClassifier.scala| 16 
 1 file changed, 8 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2b135324/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
index 8cd2103..c154561 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
@@ -131,7 +131,7 @@ private object LabelConverter {
  */
 @Experimental
 class MultilayerPerceptronClassifier(override val uid: String)
-  extends Predictor[Vector, MultilayerPerceptronClassifier, 
MultilayerPerceptronClassifierModel]
+  extends Predictor[Vector, MultilayerPerceptronClassifier, 
MultilayerPerceptronClassificationModel]
   with MultilayerPerceptronParams {
 
   def this() = this(Identifiable.randomUID(mlpc))
@@ -146,7 +146,7 @@ class MultilayerPerceptronClassifier(override val uid: 
String)
* @param dataset Training dataset
* @return Fitted model
*/
-  override protected def train(dataset: DataFrame): 
MultilayerPerceptronClassifierModel = {
+  override protected def train(dataset: DataFrame): 
MultilayerPerceptronClassificationModel = {
 val myLayers = $(layers)
 val labels = myLayers.last
 val lpData = extractLabeledPoints(dataset)
@@ -156,13 +156,13 @@ class MultilayerPerceptronClassifier(override val uid: 
String)
 
FeedForwardTrainer.LBFGSOptimizer.setConvergenceTol($(tol)).setNumIterations($(maxIter))
 FeedForwardTrainer.setStackSize($(blockSize))
 val mlpModel = FeedForwardTrainer.train(data)
-new MultilayerPerceptronClassifierModel(uid, myLayers, mlpModel.weights())
+new MultilayerPerceptronClassificationModel(uid, myLayers, 
mlpModel.weights())
   }
 }
 
 /**
  * :: Experimental ::
- * Classifier model based on the Multilayer Perceptron.
+ * Classification model based on the Multilayer Perceptron.
  * Each layer has sigmoid activation function, output layer has softmax.
  * @param uid uid
  * @param layers array of layer sizes including input and output layers
@@ -170,11 +170,11 @@ class MultilayerPerceptronClassifier(override val uid: 
String)
  * @return prediction model
  */
 @Experimental
-class MultilayerPerceptronClassifierModel private[ml] (
+class MultilayerPerceptronClassificationModel private[ml] (
 override val uid: String,
 layers: Array[Int],
 weights: Vector)
-  extends PredictionModel[Vector, MultilayerPerceptronClassifierModel]
+  extends PredictionModel[Vector, MultilayerPerceptronClassificationModel]
   with Serializable {
 
   private val mlpModel = FeedForwardTopology.multiLayerPerceptron(layers, 
true).getInstance(weights)
@@ -187,7 +187,7 @@ class MultilayerPerceptronClassifierModel private[ml] (
 LabelConverter.decodeLabel(mlpModel.predict(features))
   }
 
-  override def copy(extra: ParamMap): MultilayerPerceptronClassifierModel = {
-copyValues(new MultilayerPerceptronClassifierModel(uid, layers, weights), 
extra)
+  override def copy(extra: ParamMap): MultilayerPerceptronClassificationModel 
= {
+copyValues(new MultilayerPerceptronClassificationModel(uid, layers, 
weights), extra)
   }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR] [DOC] fix mllib pydoc warnings

2015-08-13 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 2b1353249 - 883c7d35f


[MINOR] [DOC] fix mllib pydoc warnings

Switch to correct Sphinx syntax. MechCoder

Author: Xiangrui Meng m...@databricks.com

Closes #8169 from mengxr/mllib-pydoc-fix.

(cherry picked from commit 65fec798ce52ca6b8b0fe14b78a16712778ad04c)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/883c7d35
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/883c7d35
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/883c7d35

Branch: refs/heads/branch-1.5
Commit: 883c7d35f978a7d8651aaf8e93bd0c9ba09a441d
Parents: 2b13532
Author: Xiangrui Meng m...@databricks.com
Authored: Thu Aug 13 10:16:40 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Aug 13 10:16:53 2015 -0700

--
 python/pyspark/mllib/regression.py | 14 ++
 python/pyspark/mllib/util.py   |  1 +
 2 files changed, 11 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/883c7d35/python/pyspark/mllib/regression.py
--
diff --git a/python/pyspark/mllib/regression.py 
b/python/pyspark/mllib/regression.py
index 5b7afc1..41946e3 100644
--- a/python/pyspark/mllib/regression.py
+++ b/python/pyspark/mllib/regression.py
@@ -207,8 +207,10 @@ class LinearRegressionWithSGD(object):
 Train a linear regression model using Stochastic Gradient
 Descent (SGD).
 This solves the least squares regression formulation
-f(weights) = 1/n ||A weights-y||^2^
-(which is the mean squared error).
+
+f(weights) = 1/(2n) ||A weights - y||^2,
+
+which is the mean squared error.
 Here the data matrix has n rows, and the input RDD holds the
 set of rows of A, each with its corresponding right hand side
 label y. See also the documentation for the precise formulation.
@@ -334,7 +336,9 @@ class LassoWithSGD(object):
 Stochastic Gradient Descent.
 This solves the l1-regularized least squares regression
 formulation
-f(weights) = 1/2n ||A weights-y||^2^  + regParam ||weights||_1
+
+f(weights) = 1/(2n) ||A weights - y||^2  + regParam ||weights||_1.
+
 Here the data matrix has n rows, and the input RDD holds the
 set of rows of A, each with its corresponding right hand side
 label y. See also the documentation for the precise formulation.
@@ -451,7 +455,9 @@ class RidgeRegressionWithSGD(object):
 Stochastic Gradient Descent.
 This solves the l2-regularized least squares regression
 formulation
-f(weights) = 1/2n ||A weights-y||^2^  + regParam/2 ||weights||^2^
+
+f(weights) = 1/(2n) ||A weights - y||^2 + regParam/2 ||weights||^2.
+
 Here the data matrix has n rows, and the input RDD holds the
 set of rows of A, each with its corresponding right hand side
 label y. See also the documentation for the precise formulation.

http://git-wip-us.apache.org/repos/asf/spark/blob/883c7d35/python/pyspark/mllib/util.py
--
diff --git a/python/pyspark/mllib/util.py b/python/pyspark/mllib/util.py
index 916de2d..10a1e4b 100644
--- a/python/pyspark/mllib/util.py
+++ b/python/pyspark/mllib/util.py
@@ -300,6 +300,7 @@ class LinearDataGenerator(object):
 :param: seed  Random Seed
 :param: eps   Used to scale the noise. If eps is set high,
   the amount of gaussian noise added is more.
+
 Returns a list of LabeledPoints of length nPoints
 
 weights = [float(weight) for weight in weights]


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR] [ML] change MultilayerPerceptronClassifierModel to MultilayerPerceptronClassificationModel

2015-08-13 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 7a539ef3b - 4b70798c9


[MINOR] [ML] change MultilayerPerceptronClassifierModel to 
MultilayerPerceptronClassificationModel

To follow the naming rule of ML, change `MultilayerPerceptronClassifierModel` 
to `MultilayerPerceptronClassificationModel` like 
`DecisionTreeClassificationModel`, `GBTClassificationModel` and so on.

Author: Yanbo Liang yblia...@gmail.com

Closes #8164 from yanboliang/mlp-name.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/4b70798c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/4b70798c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/4b70798c

Branch: refs/heads/master
Commit: 4b70798c96b0a784b85fda461426ec60f609be12
Parents: 7a539ef
Author: Yanbo Liang yblia...@gmail.com
Authored: Thu Aug 13 09:31:14 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Aug 13 09:31:14 2015 -0700

--
 .../MultilayerPerceptronClassifier.scala| 16 
 1 file changed, 8 insertions(+), 8 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/4b70798c/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
index 8cd2103..c154561 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala
@@ -131,7 +131,7 @@ private object LabelConverter {
  */
 @Experimental
 class MultilayerPerceptronClassifier(override val uid: String)
-  extends Predictor[Vector, MultilayerPerceptronClassifier, 
MultilayerPerceptronClassifierModel]
+  extends Predictor[Vector, MultilayerPerceptronClassifier, 
MultilayerPerceptronClassificationModel]
   with MultilayerPerceptronParams {
 
   def this() = this(Identifiable.randomUID(mlpc))
@@ -146,7 +146,7 @@ class MultilayerPerceptronClassifier(override val uid: 
String)
* @param dataset Training dataset
* @return Fitted model
*/
-  override protected def train(dataset: DataFrame): 
MultilayerPerceptronClassifierModel = {
+  override protected def train(dataset: DataFrame): 
MultilayerPerceptronClassificationModel = {
 val myLayers = $(layers)
 val labels = myLayers.last
 val lpData = extractLabeledPoints(dataset)
@@ -156,13 +156,13 @@ class MultilayerPerceptronClassifier(override val uid: 
String)
 
FeedForwardTrainer.LBFGSOptimizer.setConvergenceTol($(tol)).setNumIterations($(maxIter))
 FeedForwardTrainer.setStackSize($(blockSize))
 val mlpModel = FeedForwardTrainer.train(data)
-new MultilayerPerceptronClassifierModel(uid, myLayers, mlpModel.weights())
+new MultilayerPerceptronClassificationModel(uid, myLayers, 
mlpModel.weights())
   }
 }
 
 /**
  * :: Experimental ::
- * Classifier model based on the Multilayer Perceptron.
+ * Classification model based on the Multilayer Perceptron.
  * Each layer has sigmoid activation function, output layer has softmax.
  * @param uid uid
  * @param layers array of layer sizes including input and output layers
@@ -170,11 +170,11 @@ class MultilayerPerceptronClassifier(override val uid: 
String)
  * @return prediction model
  */
 @Experimental
-class MultilayerPerceptronClassifierModel private[ml] (
+class MultilayerPerceptronClassificationModel private[ml] (
 override val uid: String,
 layers: Array[Int],
 weights: Vector)
-  extends PredictionModel[Vector, MultilayerPerceptronClassifierModel]
+  extends PredictionModel[Vector, MultilayerPerceptronClassificationModel]
   with Serializable {
 
   private val mlpModel = FeedForwardTopology.multiLayerPerceptron(layers, 
true).getInstance(weights)
@@ -187,7 +187,7 @@ class MultilayerPerceptronClassifierModel private[ml] (
 LabelConverter.decodeLabel(mlpModel.predict(features))
   }
 
-  override def copy(extra: ParamMap): MultilayerPerceptronClassifierModel = {
-copyValues(new MultilayerPerceptronClassifierModel(uid, layers, weights), 
extra)
+  override def copy(extra: ParamMap): MultilayerPerceptronClassificationModel 
= {
+copyValues(new MultilayerPerceptronClassificationModel(uid, layers, 
weights), extra)
   }
 }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR] [DOC] fix mllib pydoc warnings

2015-08-13 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 4b70798c9 - 65fec798c


[MINOR] [DOC] fix mllib pydoc warnings

Switch to correct Sphinx syntax. MechCoder

Author: Xiangrui Meng m...@databricks.com

Closes #8169 from mengxr/mllib-pydoc-fix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/65fec798
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/65fec798
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/65fec798

Branch: refs/heads/master
Commit: 65fec798ce52ca6b8b0fe14b78a16712778ad04c
Parents: 4b70798
Author: Xiangrui Meng m...@databricks.com
Authored: Thu Aug 13 10:16:40 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Aug 13 10:16:40 2015 -0700

--
 python/pyspark/mllib/regression.py | 14 ++
 python/pyspark/mllib/util.py   |  1 +
 2 files changed, 11 insertions(+), 4 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/65fec798/python/pyspark/mllib/regression.py
--
diff --git a/python/pyspark/mllib/regression.py 
b/python/pyspark/mllib/regression.py
index 5b7afc1..41946e3 100644
--- a/python/pyspark/mllib/regression.py
+++ b/python/pyspark/mllib/regression.py
@@ -207,8 +207,10 @@ class LinearRegressionWithSGD(object):
 Train a linear regression model using Stochastic Gradient
 Descent (SGD).
 This solves the least squares regression formulation
-f(weights) = 1/n ||A weights-y||^2^
-(which is the mean squared error).
+
+f(weights) = 1/(2n) ||A weights - y||^2,
+
+which is the mean squared error.
 Here the data matrix has n rows, and the input RDD holds the
 set of rows of A, each with its corresponding right hand side
 label y. See also the documentation for the precise formulation.
@@ -334,7 +336,9 @@ class LassoWithSGD(object):
 Stochastic Gradient Descent.
 This solves the l1-regularized least squares regression
 formulation
-f(weights) = 1/2n ||A weights-y||^2^  + regParam ||weights||_1
+
+f(weights) = 1/(2n) ||A weights - y||^2  + regParam ||weights||_1.
+
 Here the data matrix has n rows, and the input RDD holds the
 set of rows of A, each with its corresponding right hand side
 label y. See also the documentation for the precise formulation.
@@ -451,7 +455,9 @@ class RidgeRegressionWithSGD(object):
 Stochastic Gradient Descent.
 This solves the l2-regularized least squares regression
 formulation
-f(weights) = 1/2n ||A weights-y||^2^  + regParam/2 ||weights||^2^
+
+f(weights) = 1/(2n) ||A weights - y||^2 + regParam/2 ||weights||^2.
+
 Here the data matrix has n rows, and the input RDD holds the
 set of rows of A, each with its corresponding right hand side
 label y. See also the documentation for the precise formulation.

http://git-wip-us.apache.org/repos/asf/spark/blob/65fec798/python/pyspark/mllib/util.py
--
diff --git a/python/pyspark/mllib/util.py b/python/pyspark/mllib/util.py
index 916de2d..10a1e4b 100644
--- a/python/pyspark/mllib/util.py
+++ b/python/pyspark/mllib/util.py
@@ -300,6 +300,7 @@ class LinearDataGenerator(object):
 :param: seed  Random Seed
 :param: eps   Used to scale the noise. If eps is set high,
   the amount of gaussian noise added is more.
+
 Returns a list of LabeledPoints of length nPoints
 
 weights = [float(weight) for weight in weights]


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9922] [ML] rename StringIndexerReverse to IndexToString

2015-08-13 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master c2520f501 - 6c5858bc6


[SPARK-9922] [ML] rename StringIndexerReverse to IndexToString

What `StringIndexerInverse` does is not strictly associated with 
`StringIndexer`, and the name is not clearly describing the transformation. 
Renaming to `IndexToString` might be better.

~~I also changed `invert` to `inverse` without arguments. `inputCol` and 
`outputCol` could be set after.~~
I also removed `invert`.

jkbradley holdenk

Author: Xiangrui Meng m...@databricks.com

Closes #8152 from mengxr/SPARK-9922.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6c5858bc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6c5858bc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6c5858bc

Branch: refs/heads/master
Commit: 6c5858bc65c8a8602422b46bfa9cf0a1fb296b88
Parents: c2520f5
Author: Xiangrui Meng m...@databricks.com
Authored: Thu Aug 13 16:52:17 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Aug 13 16:52:17 2015 -0700

--
 .../apache/spark/ml/feature/StringIndexer.scala | 34 +
 .../spark/ml/feature/StringIndexerSuite.scala   | 50 ++--
 2 files changed, 48 insertions(+), 36 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6c5858bc/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
index 9e4b0f0..9f6e7b6 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
@@ -24,7 +24,7 @@ import org.apache.spark.ml.attribute.{Attribute, 
NominalAttribute}
 import org.apache.spark.ml.param._
 import org.apache.spark.ml.param.shared._
 import org.apache.spark.ml.Transformer
-import org.apache.spark.ml.util.{Identifiable, MetadataUtils}
+import org.apache.spark.ml.util.Identifiable
 import org.apache.spark.sql.DataFrame
 import org.apache.spark.sql.functions._
 import org.apache.spark.sql.types.{DoubleType, NumericType, StringType, 
StructType}
@@ -59,6 +59,8 @@ private[feature] trait StringIndexerBase extends Params with 
HasInputCol with Ha
  * If the input column is numeric, we cast it to string and index the string 
values.
  * The indices are in [0, numLabels), ordered by label frequencies.
  * So the most frequent label gets index 0.
+ *
+ * @see [[IndexToString]] for the inverse transformation
  */
 @Experimental
 class StringIndexer(override val uid: String) extends 
Estimator[StringIndexerModel]
@@ -170,34 +172,24 @@ class StringIndexerModel private[ml] (
 val copied = new StringIndexerModel(uid, labels)
 copyValues(copied, extra).setParent(parent)
   }
-
-  /**
-   * Return a model to perform the inverse transformation.
-   * Note: By default we keep the original columns during this transformation, 
so the inverse
-   * should only be used on new columns such as predicted labels.
-   */
-  def invert(inputCol: String, outputCol: String): StringIndexerInverse = {
-new StringIndexerInverse()
-  .setInputCol(inputCol)
-  .setOutputCol(outputCol)
-  .setLabels(labels)
-  }
 }
 
 /**
  * :: Experimental ::
- * Transform a provided column back to the original input types using either 
the metadata
- * on the input column, or if provided using the labels supplied by the user.
- * Note: By default we keep the original columns during this transformation,
- * so the inverse should only be used on new columns such as predicted labels.
+ * A [[Transformer]] that maps a column of string indices back to a new column 
of corresponding
+ * string values using either the ML attributes of the input column, or if 
provided using the labels
+ * supplied by the user.
+ * All original columns are kept during transformation.
+ *
+ * @see [[StringIndexer]] for converting strings into indices
  */
 @Experimental
-class StringIndexerInverse private[ml] (
+class IndexToString private[ml] (
   override val uid: String) extends Transformer
 with HasInputCol with HasOutputCol {
 
   def this() =
-this(Identifiable.randomUID(strIdxInv))
+this(Identifiable.randomUID(idxToStr))
 
   /** @group setParam */
   def setInputCol(value: String): this.type = set(inputCol, value)
@@ -257,7 +249,7 @@ class StringIndexerInverse private[ml] (
 }
 val indexer = udf { index: Double =
   val idx = index.toInt
-  if (0 = idx  idx  values.size) {
+  if (0 = idx  idx  values.length) {
 values(idx)
   } else {
 throw new SparkException(sUnseen index: $index ??)
@@ -268,7 +260,7 @@ class StringIndexerInverse private

spark git commit: [SPARK-9922] [ML] rename StringIndexerReverse to IndexToString

2015-08-13 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 2c7f8da58 - 2b6b1d12f


[SPARK-9922] [ML] rename StringIndexerReverse to IndexToString

What `StringIndexerInverse` does is not strictly associated with 
`StringIndexer`, and the name is not clearly describing the transformation. 
Renaming to `IndexToString` might be better.

~~I also changed `invert` to `inverse` without arguments. `inputCol` and 
`outputCol` could be set after.~~
I also removed `invert`.

jkbradley holdenk

Author: Xiangrui Meng m...@databricks.com

Closes #8152 from mengxr/SPARK-9922.

(cherry picked from commit 6c5858bc65c8a8602422b46bfa9cf0a1fb296b88)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2b6b1d12
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2b6b1d12
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2b6b1d12

Branch: refs/heads/branch-1.5
Commit: 2b6b1d12fb6bd0bd86988babc4c807856011f246
Parents: 2c7f8da
Author: Xiangrui Meng m...@databricks.com
Authored: Thu Aug 13 16:52:17 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Aug 13 16:54:06 2015 -0700

--
 .../apache/spark/ml/feature/StringIndexer.scala | 34 ++
 .../spark/ml/feature/StringIndexerSuite.scala   | 47 ++--
 2 files changed, 47 insertions(+), 34 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2b6b1d12/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
index 569c834..b87e154 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala
@@ -24,7 +24,7 @@ import org.apache.spark.ml.attribute.{Attribute, 
NominalAttribute}
 import org.apache.spark.ml.param._
 import org.apache.spark.ml.param.shared._
 import org.apache.spark.ml.Transformer
-import org.apache.spark.ml.util.{Identifiable, MetadataUtils}
+import org.apache.spark.ml.util.Identifiable
 import org.apache.spark.sql.DataFrame
 import org.apache.spark.sql.functions._
 import org.apache.spark.sql.types.{DoubleType, NumericType, StringType, 
StructType}
@@ -58,6 +58,8 @@ private[feature] trait StringIndexerBase extends Params with 
HasInputCol with Ha
  * If the input column is numeric, we cast it to string and index the string 
values.
  * The indices are in [0, numLabels), ordered by label frequencies.
  * So the most frequent label gets index 0.
+ *
+ * @see [[IndexToString]] for the inverse transformation
  */
 @Experimental
 class StringIndexer(override val uid: String) extends 
Estimator[StringIndexerModel]
@@ -152,34 +154,24 @@ class StringIndexerModel private[ml] (
 val copied = new StringIndexerModel(uid, labels)
 copyValues(copied, extra).setParent(parent)
   }
-
-  /**
-   * Return a model to perform the inverse transformation.
-   * Note: By default we keep the original columns during this transformation, 
so the inverse
-   * should only be used on new columns such as predicted labels.
-   */
-  def invert(inputCol: String, outputCol: String): StringIndexerInverse = {
-new StringIndexerInverse()
-  .setInputCol(inputCol)
-  .setOutputCol(outputCol)
-  .setLabels(labels)
-  }
 }
 
 /**
  * :: Experimental ::
- * Transform a provided column back to the original input types using either 
the metadata
- * on the input column, or if provided using the labels supplied by the user.
- * Note: By default we keep the original columns during this transformation,
- * so the inverse should only be used on new columns such as predicted labels.
+ * A [[Transformer]] that maps a column of string indices back to a new column 
of corresponding
+ * string values using either the ML attributes of the input column, or if 
provided using the labels
+ * supplied by the user.
+ * All original columns are kept during transformation.
+ *
+ * @see [[StringIndexer]] for converting strings into indices
  */
 @Experimental
-class StringIndexerInverse private[ml] (
+class IndexToString private[ml] (
   override val uid: String) extends Transformer
 with HasInputCol with HasOutputCol {
 
   def this() =
-this(Identifiable.randomUID(strIdxInv))
+this(Identifiable.randomUID(idxToStr))
 
   /** @group setParam */
   def setInputCol(value: String): this.type = set(inputCol, value)
@@ -239,7 +231,7 @@ class StringIndexerInverse private[ml] (
 }
 val indexer = udf { index: Double =
   val idx = index.toInt
-  if (0 = idx  idx  values.size) {
+  if (0 = idx  idx  values.length) {
 values(idx

spark git commit: [SPARK-9909] [ML] [TRIVIAL] move weightCol to shared params

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master caa14d9dc - 6e409bc13


[SPARK-9909] [ML] [TRIVIAL] move weightCol to shared params

As per the TODO move weightCol to Shared Params.

Author: Holden Karau hol...@pigscanfly.ca

Closes #8144 from holdenk/SPARK-9909-move-weightCol-toSharedParams.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6e409bc1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6e409bc1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6e409bc1

Branch: refs/heads/master
Commit: 6e409bc1357f49de2efdfc4226d074b943fb1153
Parents: caa14d9
Author: Holden Karau hol...@pigscanfly.ca
Authored: Wed Aug 12 16:54:45 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 16:54:45 2015 -0700

--
 .../spark/ml/param/shared/SharedParamsCodeGen.scala |  4 +++-
 .../apache/spark/ml/param/shared/sharedParams.scala | 15 +++
 .../spark/ml/regression/IsotonicRegression.scala| 16 ++--
 3 files changed, 20 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6e409bc1/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
index 9e12f18..8c16c61 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
@@ -70,7 +70,9 @@ private[shared] object SharedParamsCodeGen {
  For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.,
 isValid = ParamValidators.inRange(0, 1)),
   ParamDesc[Double](tol, the convergence tolerance for iterative 
algorithms),
-  ParamDesc[Double](stepSize, Step size to be used for each iteration 
of optimization.))
+  ParamDesc[Double](stepSize, Step size to be used for each iteration 
of optimization.),
+  ParamDesc[String](weightCol, weight column name. If this is not set 
or empty, we treat  +
+all instance weights as 1.0.))
 
 val code = genSharedParams(params)
 val file = 
src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala

http://git-wip-us.apache.org/repos/asf/spark/blob/6e409bc1/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala 
b/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
index a17d4ea..c267689 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
@@ -342,4 +342,19 @@ private[ml] trait HasStepSize extends Params {
   /** @group getParam */
   final def getStepSize: Double = $(stepSize)
 }
+
+/**
+ * Trait for shared param weightCol.
+ */
+private[ml] trait HasWeightCol extends Params {
+
+  /**
+   * Param for weight column name. If this is not set or empty, we treat all 
instance weights as 1.0..
+   * @group param
+   */
+  final val weightCol: Param[String] = new Param[String](this, weightCol, 
weight column name. If this is not set or empty, we treat all instance weights 
as 1.0.)
+
+  /** @group getParam */
+  final def getWeightCol: String = $(weightCol)
+}
 // scalastyle:on

http://git-wip-us.apache.org/repos/asf/spark/blob/6e409bc1/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala 
b/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala
index f570590..0f33bae 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala
@@ -21,7 +21,7 @@ import org.apache.spark.Logging
 import org.apache.spark.annotation.Experimental
 import org.apache.spark.ml.{Estimator, Model}
 import org.apache.spark.ml.param._
-import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol, 
HasPredictionCol}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol, 
HasPredictionCol, HasWeightCol}
 import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
 import org.apache.spark.mllib.linalg.{Vector, VectorUDT, Vectors}
 import org.apache.spark.mllib.regression.{IsotonicRegression = 
MLlibIsotonicRegression, IsotonicRegressionModel = 
MLlibIsotonicRegressionModel

spark git commit: [SPARK-9909] [ML] [TRIVIAL] move weightCol to shared params

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 6aca0cf34 - 2f8793b5f


[SPARK-9909] [ML] [TRIVIAL] move weightCol to shared params

As per the TODO move weightCol to Shared Params.

Author: Holden Karau hol...@pigscanfly.ca

Closes #8144 from holdenk/SPARK-9909-move-weightCol-toSharedParams.

(cherry picked from commit 6e409bc1357f49de2efdfc4226d074b943fb1153)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2f8793b5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2f8793b5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2f8793b5

Branch: refs/heads/branch-1.5
Commit: 2f8793b5f47ec7c17b27715bc9b1026266061cea
Parents: 6aca0cf
Author: Holden Karau hol...@pigscanfly.ca
Authored: Wed Aug 12 16:54:45 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 16:54:52 2015 -0700

--
 .../spark/ml/param/shared/SharedParamsCodeGen.scala |  4 +++-
 .../apache/spark/ml/param/shared/sharedParams.scala | 15 +++
 .../spark/ml/regression/IsotonicRegression.scala| 16 ++--
 3 files changed, 20 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2f8793b5/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
index 5cb7235..3899df6 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala
@@ -66,7 +66,9 @@ private[shared] object SharedParamsCodeGen {
  For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an 
L1 penalty.,
 isValid = ParamValidators.inRange(0, 1)),
   ParamDesc[Double](tol, the convergence tolerance for iterative 
algorithms),
-  ParamDesc[Double](stepSize, Step size to be used for each iteration 
of optimization.))
+  ParamDesc[Double](stepSize, Step size to be used for each iteration 
of optimization.),
+  ParamDesc[String](weightCol, weight column name. If this is not set 
or empty, we treat  +
+all instance weights as 1.0.))
 
 val code = genSharedParams(params)
 val file = 
src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala

http://git-wip-us.apache.org/repos/asf/spark/blob/2f8793b5/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala 
b/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
index d4c89e6..e8e58aa 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala
@@ -327,4 +327,19 @@ private[ml] trait HasStepSize extends Params {
   /** @group getParam */
   final def getStepSize: Double = $(stepSize)
 }
+
+/**
+ * Trait for shared param weightCol.
+ */
+private[ml] trait HasWeightCol extends Params {
+
+  /**
+   * Param for weight column name. If this is not set or empty, we treat all 
instance weights as 1.0..
+   * @group param
+   */
+  final val weightCol: Param[String] = new Param[String](this, weightCol, 
weight column name. If this is not set or empty, we treat all instance weights 
as 1.0.)
+
+  /** @group getParam */
+  final def getWeightCol: String = $(weightCol)
+}
 // scalastyle:on

http://git-wip-us.apache.org/repos/asf/spark/blob/2f8793b5/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala 
b/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala
index f570590..0f33bae 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala
@@ -21,7 +21,7 @@ import org.apache.spark.Logging
 import org.apache.spark.annotation.Experimental
 import org.apache.spark.ml.{Estimator, Model}
 import org.apache.spark.ml.param._
-import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol, 
HasPredictionCol}
+import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol, 
HasPredictionCol, HasWeightCol}
 import org.apache.spark.ml.util.{Identifiable, SchemaUtils}
 import org.apache.spark.mllib.linalg.{Vector, VectorUDT, Vectors}
 import

spark git commit: [SPARK-9913] [MLLIB] LDAUtils should be private

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 08f767a1e - 6aca0cf34


[SPARK-9913] [MLLIB] LDAUtils should be private

feynmanliang

Author: Xiangrui Meng m...@databricks.com

Closes #8142 from mengxr/SPARK-9913.

(cherry picked from commit caa14d9dc9e2eb1102052b22445b63b0e004e3c7)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6aca0cf3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6aca0cf3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6aca0cf3

Branch: refs/heads/branch-1.5
Commit: 6aca0cf348ca0731ef72155f5a5d7739b796bb3b
Parents: 08f767a
Author: Xiangrui Meng m...@databricks.com
Authored: Wed Aug 12 16:53:47 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 16:53:56 2015 -0700

--
 .../main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6aca0cf3/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala
index f7e5ce1..a9ba7b6 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala
@@ -22,7 +22,7 @@ import breeze.numerics._
 /**
  * Utility methods for LDA.
  */
-object LDAUtils {
+private[clustering] object LDAUtils {
   /**
* Log Sum Exp with overflow protection using the identity:
* For any a: \log \sum_{n=1}^N \exp\{x_n\} = a + \log \sum_{n=1}^N 
\exp\{x_n - a\}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9913] [MLLIB] LDAUtils should be private

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 7035d880a - caa14d9dc


[SPARK-9913] [MLLIB] LDAUtils should be private

feynmanliang

Author: Xiangrui Meng m...@databricks.com

Closes #8142 from mengxr/SPARK-9913.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/caa14d9d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/caa14d9d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/caa14d9d

Branch: refs/heads/master
Commit: caa14d9dc9e2eb1102052b22445b63b0e004e3c7
Parents: 7035d88
Author: Xiangrui Meng m...@databricks.com
Authored: Wed Aug 12 16:53:47 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 16:53:47 2015 -0700

--
 .../main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/caa14d9d/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala
index f7e5ce1..a9ba7b6 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala
@@ -22,7 +22,7 @@ import breeze.numerics._
 /**
  * Utility methods for LDA.
  */
-object LDAUtils {
+private[clustering] object LDAUtils {
   /**
* Log Sum Exp with overflow protection using the identity:
* For any a: \log \sum_{n=1}^N \exp\{x_n\} = a + \log \sum_{n=1}^N 
\exp\{x_n - a\}


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9915] [ML] stopWords should use StringArrayParam

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master e6aef5576 - fc1c7fd66


[SPARK-9915] [ML] stopWords should use StringArrayParam

hhbyyh

Author: Xiangrui Meng m...@databricks.com

Closes #8141 from mengxr/SPARK-9915.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/fc1c7fd6
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/fc1c7fd6
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/fc1c7fd6

Branch: refs/heads/master
Commit: fc1c7fd66e64ccea53b31cd2fbb98bc6d307329c
Parents: e6aef55
Author: Xiangrui Meng m...@databricks.com
Authored: Wed Aug 12 17:06:12 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 17:06:12 2015 -0700

--
 .../scala/org/apache/spark/ml/feature/StopWordsRemover.scala   | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/fc1c7fd6/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala
index 3cc4142..5d77ea0 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala
@@ -19,12 +19,12 @@ package org.apache.spark.ml.feature
 
 import org.apache.spark.annotation.Experimental
 import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{BooleanParam, ParamMap, StringArrayParam}
 import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
-import org.apache.spark.ml.param.{ParamMap, BooleanParam, Param}
 import org.apache.spark.ml.util.Identifiable
 import org.apache.spark.sql.DataFrame
-import org.apache.spark.sql.types.{StringType, StructField, ArrayType, 
StructType}
 import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{ArrayType, StringType, StructField, 
StructType}
 
 /**
  * stop words list
@@ -100,7 +100,7 @@ class StopWordsRemover(override val uid: String)
* the stop words set to be filtered out
* @group param
*/
-  val stopWords: Param[Array[String]] = new Param(this, stopWords, stop 
words)
+  val stopWords: StringArrayParam = new StringArrayParam(this, stopWords, 
stop words)
 
   /** @group setParam */
   def setStopWords(value: Array[String]): this.type = set(stopWords, value)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-8967] [DOC] add Since annotation

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 bdf8dc15d - 6a7582ea2


[SPARK-8967] [DOC] add Since annotation

Add `Since` as a Scala annotation. The benefit is that we can use it without 
having explicit JavaDoc. This is useful for inherited methods. The limitation 
is that is doesn't show up in the generated Java API documentation. This might 
be fixed by modifying genjavadoc. I think we could leave it as a TODO.

This is how the generated Scala doc looks:

`since` JavaDoc tag:

![screen shot 2015-08-11 at 10 00 37 
pm](https://cloud.githubusercontent.com/assets/829644/9230761/fa72865c-40d8-11e5-807e-0f3c815c5acd.png)

`Since` annotation:

![screen shot 2015-08-11 at 10 00 28 
pm](https://cloud.githubusercontent.com/assets/829644/9230764/0041d7f4-40d9-11e5-8124-c3f3e5d5b31f.png)

rxin

Author: Xiangrui Meng m...@databricks.com

Closes #8131 from mengxr/SPARK-8967.

(cherry picked from commit 6f60298b1d7aa97268a42eca1e3b4851a7e88cb5)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6a7582ea
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6a7582ea
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6a7582ea

Branch: refs/heads/branch-1.5
Commit: 6a7582ea2d232982c3480e7d4ee357ea45d0b303
Parents: bdf8dc1
Author: Xiangrui Meng m...@databricks.com
Authored: Wed Aug 12 14:28:23 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 14:28:34 2015 -0700

--
 .../org/apache/spark/annotation/Since.scala | 28 
 1 file changed, 28 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6a7582ea/core/src/main/scala/org/apache/spark/annotation/Since.scala
--
diff --git a/core/src/main/scala/org/apache/spark/annotation/Since.scala 
b/core/src/main/scala/org/apache/spark/annotation/Since.scala
new file mode 100644
index 000..fa59393
--- /dev/null
+++ b/core/src/main/scala/org/apache/spark/annotation/Since.scala
@@ -0,0 +1,28 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.annotation
+
+import scala.annotation.StaticAnnotation
+
+/**
+ * A Scala annotation that specifies the Spark version when a definition was 
added.
+ * Different from the `@since` tag in JavaDoc, this annotation does not 
require explicit JavaDoc and
+ * hence works for overridden methods that inherit API documentation directly 
from parents.
+ * The limitation is that it does not show up in the generated Java API 
documentation.
+ */
+private[spark] class Since(version: String) extends StaticAnnotation


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9912] [MLLIB] QRDecomposition should use QType and RType for type names instead of UType and VType

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 6e409bc13 - e6aef5576


[SPARK-9912] [MLLIB] QRDecomposition should use QType and RType for type names 
instead of UType and VType

hhbyyh

Author: Xiangrui Meng m...@databricks.com

Closes #8140 from mengxr/SPARK-9912.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e6aef557
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e6aef557
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e6aef557

Branch: refs/heads/master
Commit: e6aef55766d0e2a48e0f9cb6eda0e31a71b962f3
Parents: 6e409bc
Author: Xiangrui Meng m...@databricks.com
Authored: Wed Aug 12 17:04:31 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 17:04:31 2015 -0700

--
 .../org/apache/spark/mllib/linalg/SingularValueDecomposition.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e6aef557/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala
index b416d50..cff5dbe 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala
@@ -31,5 +31,5 @@ case class SingularValueDecomposition[UType, VType](U: UType, 
s: Vector, V: VTyp
  * Represents QR factors.
  */
 @Experimental
-case class QRDecomposition[UType, VType](Q: UType, R: VType)
+case class QRDecomposition[QType, RType](Q: QType, R: RType)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9912] [MLLIB] QRDecomposition should use QType and RType for type names instead of UType and VType

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 2f8793b5f - 31b7fdc06


[SPARK-9912] [MLLIB] QRDecomposition should use QType and RType for type names 
instead of UType and VType

hhbyyh

Author: Xiangrui Meng m...@databricks.com

Closes #8140 from mengxr/SPARK-9912.

(cherry picked from commit e6aef55766d0e2a48e0f9cb6eda0e31a71b962f3)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/31b7fdc0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/31b7fdc0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/31b7fdc0

Branch: refs/heads/branch-1.5
Commit: 31b7fdc06fc21fa38ac4530f9c70dd27b3b71578
Parents: 2f8793b
Author: Xiangrui Meng m...@databricks.com
Authored: Wed Aug 12 17:04:31 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 17:04:37 2015 -0700

--
 .../org/apache/spark/mllib/linalg/SingularValueDecomposition.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/31b7fdc0/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala
index b416d50..cff5dbe 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala
@@ -31,5 +31,5 @@ case class SingularValueDecomposition[UType, VType](U: UType, 
s: Vector, V: VTyp
  * Represents QR factors.
  */
 @Experimental
-case class QRDecomposition[UType, VType](Q: UType, R: VType)
+case class QRDecomposition[QType, RType](Q: QType, R: RType)
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9915] [ML] stopWords should use StringArrayParam

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 31b7fdc06 - ed73f5439


[SPARK-9915] [ML] stopWords should use StringArrayParam

hhbyyh

Author: Xiangrui Meng m...@databricks.com

Closes #8141 from mengxr/SPARK-9915.

(cherry picked from commit fc1c7fd66e64ccea53b31cd2fbb98bc6d307329c)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ed73f543
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ed73f543
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ed73f543

Branch: refs/heads/branch-1.5
Commit: ed73f5439bbe3a09adf9a770c34b5d87b35499c8
Parents: 31b7fdc
Author: Xiangrui Meng m...@databricks.com
Authored: Wed Aug 12 17:06:12 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 17:06:19 2015 -0700

--
 .../scala/org/apache/spark/ml/feature/StopWordsRemover.scala   | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ed73f543/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala
index 3cc4142..5d77ea0 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala
@@ -19,12 +19,12 @@ package org.apache.spark.ml.feature
 
 import org.apache.spark.annotation.Experimental
 import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.{BooleanParam, ParamMap, StringArrayParam}
 import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
-import org.apache.spark.ml.param.{ParamMap, BooleanParam, Param}
 import org.apache.spark.ml.util.Identifiable
 import org.apache.spark.sql.DataFrame
-import org.apache.spark.sql.types.{StringType, StructField, ArrayType, 
StructType}
 import org.apache.spark.sql.functions.{col, udf}
+import org.apache.spark.sql.types.{ArrayType, StringType, StructField, 
StructType}
 
 /**
  * stop words list
@@ -100,7 +100,7 @@ class StopWordsRemover(override val uid: String)
* the stop words set to be filtered out
* @group param
*/
-  val stopWords: Param[Array[String]] = new Param(this, stopWords, stop 
words)
+  val stopWords: StringArrayParam = new StringArrayParam(this, stopWords, 
stop words)
 
   /** @group setParam */
   def setStopWords(value: Array[String]): this.type = set(stopWords, value)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9903] [MLLIB] skip local processing in PrefixSpan if there are no small prefixes

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 a06860c2f - af470a757


[SPARK-9903] [MLLIB] skip local processing in PrefixSpan if there are no small 
prefixes

There exists a chance that the prefixes keep growing to the maximum pattern 
length. Then the final local processing step becomes unnecessary. feynmanliang

Author: Xiangrui Meng m...@databricks.com

Closes #8136 from mengxr/SPARK-9903.

(cherry picked from commit d7053bea985679c514b3add029631ea23e1730ce)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/af470a75
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/af470a75
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/af470a75

Branch: refs/heads/branch-1.5
Commit: af470a757c7aed81d626634590a0fb395f0241f5
Parents: a06860c
Author: Xiangrui Meng m...@databricks.com
Authored: Wed Aug 12 20:44:40 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 20:44:49 2015 -0700

--
 .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 37 +++-
 1 file changed, 21 insertions(+), 16 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/af470a75/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
index ad6715b5..dc4ae1d 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
@@ -282,25 +282,30 @@ object PrefixSpan extends Logging {
   largePrefixes = newLargePrefixes
 }
 
-// Switch to local processing.
-val bcSmallPrefixes = sc.broadcast(smallPrefixes)
-val distributedFreqPattern = postfixes.flatMap { postfix =
-  bcSmallPrefixes.value.values.map { prefix =
-(prefix.id, postfix.project(prefix).compressed)
-  }.filter(_._2.nonEmpty)
-}.groupByKey().flatMap { case (id, projPostfixes) =
-  val prefix = bcSmallPrefixes.value(id)
-  val localPrefixSpan = new LocalPrefixSpan(minCount, maxPatternLength - 
prefix.length)
-  // TODO: We collect projected postfixes into memory. We should also 
compare the performance
-  // TODO: of keeping them on shuffle files.
-  localPrefixSpan.run(projPostfixes.toArray).map { case (pattern, count) =
-(prefix.items ++ pattern, count)
+var freqPatterns = sc.parallelize(localFreqPatterns, 1)
+
+val numSmallPrefixes = smallPrefixes.size
+logInfo(snumber of small prefixes for local processing: 
$numSmallPrefixes)
+if (numSmallPrefixes  0) {
+  // Switch to local processing.
+  val bcSmallPrefixes = sc.broadcast(smallPrefixes)
+  val distributedFreqPattern = postfixes.flatMap { postfix =
+bcSmallPrefixes.value.values.map { prefix =
+  (prefix.id, postfix.project(prefix).compressed)
+}.filter(_._2.nonEmpty)
+  }.groupByKey().flatMap { case (id, projPostfixes) =
+val prefix = bcSmallPrefixes.value(id)
+val localPrefixSpan = new LocalPrefixSpan(minCount, maxPatternLength - 
prefix.length)
+// TODO: We collect projected postfixes into memory. We should also 
compare the performance
+// TODO: of keeping them on shuffle files.
+localPrefixSpan.run(projPostfixes.toArray).map { case (pattern, count) 
=
+  (prefix.items ++ pattern, count)
+}
   }
+  // Union local frequent patterns and distributed ones.
+  freqPatterns = freqPatterns ++ distributedFreqPattern
 }
 
-// Union local frequent patterns and distributed ones.
-val freqPatterns = (sc.parallelize(localFreqPatterns, 1) ++ 
distributedFreqPattern)
-  .persist(StorageLevel.MEMORY_AND_DISK)
 freqPatterns
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9917] [ML] add getMin/getMax and doc for originalMin/origianlMax in MinMaxScaler

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 8229437c3 - 16f4bf4ca


[SPARK-9917] [ML] add getMin/getMax and doc for originalMin/origianlMax in 
MinMaxScaler

hhbyyh

Author: Xiangrui Meng m...@databricks.com

Closes #8145 from mengxr/SPARK-9917.

(cherry picked from commit 5fc058a1fc5d83ad53feec936475484aef3800b3)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/16f4bf4c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/16f4bf4c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/16f4bf4c

Branch: refs/heads/branch-1.5
Commit: 16f4bf4caa9c6a1403252485470466266d6b1b65
Parents: 8229437
Author: Xiangrui Meng m...@databricks.com
Authored: Wed Aug 12 21:33:38 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 21:33:46 2015 -0700

--
 .../scala/org/apache/spark/ml/feature/MinMaxScaler.scala  | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/16f4bf4c/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
index b30adf3..9a473dd 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
@@ -41,6 +41,9 @@ private[feature] trait MinMaxScalerParams extends Params with 
HasInputCol with H
   val min: DoubleParam = new DoubleParam(this, min,
 lower bound of the output feature range)
 
+  /** @group getParam */
+  def getMin: Double = $(min)
+
   /**
* upper bound after transformation, shared by all features
* Default: 1.0
@@ -49,6 +52,9 @@ private[feature] trait MinMaxScalerParams extends Params with 
HasInputCol with H
   val max: DoubleParam = new DoubleParam(this, max,
 upper bound of the output feature range)
 
+  /** @group getParam */
+  def getMax: Double = $(max)
+
   /** Validates and transforms the input schema. */
   protected def validateAndTransformSchema(schema: StructType): StructType = {
 val inputType = schema($(inputCol)).dataType
@@ -115,6 +121,9 @@ class MinMaxScaler(override val uid: String)
  * :: Experimental ::
  * Model fitted by [[MinMaxScaler]].
  *
+ * @param originalMin min value for each original column during fitting
+ * @param originalMax max value for each original column during fitting
+ *
  * TODO: The transformer does not yet set the metadata in the output column 
(SPARK-8529).
  */
 @Experimental
@@ -136,7 +145,6 @@ class MinMaxScalerModel private[ml] (
   /** @group setParam */
   def setMax(value: Double): this.type = set(max, value)
 
-
   override def transform(dataset: DataFrame): DataFrame = {
 val originalRange = (originalMax.toBreeze - originalMin.toBreeze).toArray
 val minArray = originalMin.toArray


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9917] [ML] add getMin/getMax and doc for originalMin/origianlMax in MinMaxScaler

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master a8ab2634c - 5fc058a1f


[SPARK-9917] [ML] add getMin/getMax and doc for originalMin/origianlMax in 
MinMaxScaler

hhbyyh

Author: Xiangrui Meng m...@databricks.com

Closes #8145 from mengxr/SPARK-9917.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/5fc058a1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/5fc058a1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/5fc058a1

Branch: refs/heads/master
Commit: 5fc058a1fc5d83ad53feec936475484aef3800b3
Parents: a8ab263
Author: Xiangrui Meng m...@databricks.com
Authored: Wed Aug 12 21:33:38 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 21:33:38 2015 -0700

--
 .../scala/org/apache/spark/ml/feature/MinMaxScaler.scala  | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/5fc058a1/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
index b30adf3..9a473dd 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
@@ -41,6 +41,9 @@ private[feature] trait MinMaxScalerParams extends Params with 
HasInputCol with H
   val min: DoubleParam = new DoubleParam(this, min,
 lower bound of the output feature range)
 
+  /** @group getParam */
+  def getMin: Double = $(min)
+
   /**
* upper bound after transformation, shared by all features
* Default: 1.0
@@ -49,6 +52,9 @@ private[feature] trait MinMaxScalerParams extends Params with 
HasInputCol with H
   val max: DoubleParam = new DoubleParam(this, max,
 upper bound of the output feature range)
 
+  /** @group getParam */
+  def getMax: Double = $(max)
+
   /** Validates and transforms the input schema. */
   protected def validateAndTransformSchema(schema: StructType): StructType = {
 val inputType = schema($(inputCol)).dataType
@@ -115,6 +121,9 @@ class MinMaxScaler(override val uid: String)
  * :: Experimental ::
  * Model fitted by [[MinMaxScaler]].
  *
+ * @param originalMin min value for each original column during fitting
+ * @param originalMax max value for each original column during fitting
+ *
  * TODO: The transformer does not yet set the metadata in the output column 
(SPARK-8529).
  */
 @Experimental
@@ -136,7 +145,6 @@ class MinMaxScalerModel private[ml] (
   /** @group setParam */
   def setMax(value: Double): this.type = set(max, value)
 
-
   override def transform(dataset: DataFrame): DataFrame = {
 val originalRange = (originalMax.toBreeze - originalMin.toBreeze).toArray
 val minArray = originalMin.toArray


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-8922] [DOCUMENTATION, MLLIB] Add @since tags to mllib.evaluation

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 5fc058a1f - df5438921


[SPARK-8922] [DOCUMENTATION, MLLIB] Add @since tags to mllib.evaluation

Author: shikai.tang tar.sk...@gmail.com

Closes #7429 from mosessky/master.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/df543892
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/df543892
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/df543892

Branch: refs/heads/master
Commit: df543892122342b97e5137b266959ba97589b3ef
Parents: 5fc058a
Author: shikai.tang tar.sk...@gmail.com
Authored: Wed Aug 12 21:53:15 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 21:53:15 2015 -0700

--
 .../BinaryClassificationMetrics.scala   | 32 +---
 .../mllib/evaluation/MulticlassMetrics.scala|  9 ++
 .../mllib/evaluation/MultilabelMetrics.scala|  4 +++
 .../spark/mllib/evaluation/RankingMetrics.scala |  4 +++
 .../mllib/evaluation/RegressionMetrics.scala|  6 
 5 files changed, 50 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/df543892/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
index c1d1a22..486741e 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
@@ -41,6 +41,7 @@ import org.apache.spark.sql.DataFrame
  *of bins may not exactly equal numBins. The last bin in each 
partition may
  *be smaller as a result, meaning there may be an extra sample 
at
  *partition boundaries.
+ * @since 1.3.0
  */
 @Experimental
 class BinaryClassificationMetrics(
@@ -51,6 +52,7 @@ class BinaryClassificationMetrics(
 
   /**
* Defaults `numBins` to 0.
+   * @since 1.0.0
*/
   def this(scoreAndLabels: RDD[(Double, Double)]) = this(scoreAndLabels, 0)
 
@@ -61,12 +63,18 @@ class BinaryClassificationMetrics(
   private[mllib] def this(scoreAndLabels: DataFrame) =
 this(scoreAndLabels.map(r = (r.getDouble(0), r.getDouble(1
 
-  /** Unpersist intermediate RDDs used in the computation. */
+  /**
+   * Unpersist intermediate RDDs used in the computation.
+   * @since 1.0.0
+   */
   def unpersist() {
 cumulativeCounts.unpersist()
   }
 
-  /** Returns thresholds in descending order. */
+  /**
+   * Returns thresholds in descending order.
+   * @since 1.0.0
+   */
   def thresholds(): RDD[Double] = cumulativeCounts.map(_._1)
 
   /**
@@ -74,6 +82,7 @@ class BinaryClassificationMetrics(
* which is an RDD of (false positive rate, true positive rate)
* with (0.0, 0.0) prepended and (1.0, 1.0) appended to it.
* @see http://en.wikipedia.org/wiki/Receiver_operating_characteristic
+   * @since 1.0.0
*/
   def roc(): RDD[(Double, Double)] = {
 val rocCurve = createCurve(FalsePositiveRate, Recall)
@@ -85,6 +94,7 @@ class BinaryClassificationMetrics(
 
   /**
* Computes the area under the receiver operating characteristic (ROC) curve.
+   * @since 1.0.0
*/
   def areaUnderROC(): Double = AreaUnderCurve.of(roc())
 
@@ -92,6 +102,7 @@ class BinaryClassificationMetrics(
* Returns the precision-recall curve, which is an RDD of (recall, 
precision),
* NOT (precision, recall), with (0.0, 1.0) prepended to it.
* @see http://en.wikipedia.org/wiki/Precision_and_recall
+   * @since 1.0.0
*/
   def pr(): RDD[(Double, Double)] = {
 val prCurve = createCurve(Recall, Precision)
@@ -102,6 +113,7 @@ class BinaryClassificationMetrics(
 
   /**
* Computes the area under the precision-recall curve.
+   * @since 1.0.0
*/
   def areaUnderPR(): Double = AreaUnderCurve.of(pr())
 
@@ -110,16 +122,26 @@ class BinaryClassificationMetrics(
* @param beta the beta factor in F-Measure computation.
* @return an RDD of (threshold, F-Measure) pairs.
* @see http://en.wikipedia.org/wiki/F1_score
+   * @since 1.0.0
*/
   def fMeasureByThreshold(beta: Double): RDD[(Double, Double)] = 
createCurve(FMeasure(beta))
 
-  /** Returns the (threshold, F-Measure) curve with beta = 1.0. */
+  /**
+   * Returns the (threshold, F-Measure) curve with beta = 1.0.
+   * @since 1.0.0
+   */
   def fMeasureByThreshold(): RDD[(Double, Double)] = fMeasureByThreshold(1.0)
 
-  /** Returns the (threshold, precision) curve. */
+  /**
+   * Returns the (threshold, precision) curve.
+   * @since 1.0.0
+   */
   def precisionByThreshold(): RDD[(Double

spark git commit: [SPARK-8922] [DOCUMENTATION, MLLIB] Add @since tags to mllib.evaluation

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 8f055e595 - 690284037


[SPARK-8922] [DOCUMENTATION, MLLIB] Add @since tags to mllib.evaluation

Author: shikai.tang tar.sk...@gmail.com

Closes #7429 from mosessky/master.

(cherry picked from commit df543892122342b97e5137b266959ba97589b3ef)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/69028403
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/69028403
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/69028403

Branch: refs/heads/branch-1.5
Commit: 690284037ecd880d48d5e835b150a2f31feb7c73
Parents: 8f055e5
Author: shikai.tang tar.sk...@gmail.com
Authored: Wed Aug 12 21:53:15 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 21:53:24 2015 -0700

--
 .../BinaryClassificationMetrics.scala   | 32 +---
 .../mllib/evaluation/MulticlassMetrics.scala|  9 ++
 .../mllib/evaluation/MultilabelMetrics.scala|  4 +++
 .../spark/mllib/evaluation/RankingMetrics.scala |  4 +++
 .../mllib/evaluation/RegressionMetrics.scala|  6 
 5 files changed, 50 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/69028403/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
index c1d1a22..486741e 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
@@ -41,6 +41,7 @@ import org.apache.spark.sql.DataFrame
  *of bins may not exactly equal numBins. The last bin in each 
partition may
  *be smaller as a result, meaning there may be an extra sample 
at
  *partition boundaries.
+ * @since 1.3.0
  */
 @Experimental
 class BinaryClassificationMetrics(
@@ -51,6 +52,7 @@ class BinaryClassificationMetrics(
 
   /**
* Defaults `numBins` to 0.
+   * @since 1.0.0
*/
   def this(scoreAndLabels: RDD[(Double, Double)]) = this(scoreAndLabels, 0)
 
@@ -61,12 +63,18 @@ class BinaryClassificationMetrics(
   private[mllib] def this(scoreAndLabels: DataFrame) =
 this(scoreAndLabels.map(r = (r.getDouble(0), r.getDouble(1
 
-  /** Unpersist intermediate RDDs used in the computation. */
+  /**
+   * Unpersist intermediate RDDs used in the computation.
+   * @since 1.0.0
+   */
   def unpersist() {
 cumulativeCounts.unpersist()
   }
 
-  /** Returns thresholds in descending order. */
+  /**
+   * Returns thresholds in descending order.
+   * @since 1.0.0
+   */
   def thresholds(): RDD[Double] = cumulativeCounts.map(_._1)
 
   /**
@@ -74,6 +82,7 @@ class BinaryClassificationMetrics(
* which is an RDD of (false positive rate, true positive rate)
* with (0.0, 0.0) prepended and (1.0, 1.0) appended to it.
* @see http://en.wikipedia.org/wiki/Receiver_operating_characteristic
+   * @since 1.0.0
*/
   def roc(): RDD[(Double, Double)] = {
 val rocCurve = createCurve(FalsePositiveRate, Recall)
@@ -85,6 +94,7 @@ class BinaryClassificationMetrics(
 
   /**
* Computes the area under the receiver operating characteristic (ROC) curve.
+   * @since 1.0.0
*/
   def areaUnderROC(): Double = AreaUnderCurve.of(roc())
 
@@ -92,6 +102,7 @@ class BinaryClassificationMetrics(
* Returns the precision-recall curve, which is an RDD of (recall, 
precision),
* NOT (precision, recall), with (0.0, 1.0) prepended to it.
* @see http://en.wikipedia.org/wiki/Precision_and_recall
+   * @since 1.0.0
*/
   def pr(): RDD[(Double, Double)] = {
 val prCurve = createCurve(Recall, Precision)
@@ -102,6 +113,7 @@ class BinaryClassificationMetrics(
 
   /**
* Computes the area under the precision-recall curve.
+   * @since 1.0.0
*/
   def areaUnderPR(): Double = AreaUnderCurve.of(pr())
 
@@ -110,16 +122,26 @@ class BinaryClassificationMetrics(
* @param beta the beta factor in F-Measure computation.
* @return an RDD of (threshold, F-Measure) pairs.
* @see http://en.wikipedia.org/wiki/F1_score
+   * @since 1.0.0
*/
   def fMeasureByThreshold(beta: Double): RDD[(Double, Double)] = 
createCurve(FMeasure(beta))
 
-  /** Returns the (threshold, F-Measure) curve with beta = 1.0. */
+  /**
+   * Returns the (threshold, F-Measure) curve with beta = 1.0.
+   * @since 1.0.0
+   */
   def fMeasureByThreshold(): RDD[(Double, Double)] = fMeasureByThreshold(1.0)
 
-  /** Returns the (threshold, precision) curve

spark git commit: [SPARK-9903] [MLLIB] skip local processing in PrefixSpan if there are no small prefixes

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master d2d5e7fe2 - d7053bea9


[SPARK-9903] [MLLIB] skip local processing in PrefixSpan if there are no small 
prefixes

There exists a chance that the prefixes keep growing to the maximum pattern 
length. Then the final local processing step becomes unnecessary. feynmanliang

Author: Xiangrui Meng m...@databricks.com

Closes #8136 from mengxr/SPARK-9903.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d7053bea
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d7053bea
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d7053bea

Branch: refs/heads/master
Commit: d7053bea985679c514b3add029631ea23e1730ce
Parents: d2d5e7f
Author: Xiangrui Meng m...@databricks.com
Authored: Wed Aug 12 20:44:40 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 20:44:40 2015 -0700

--
 .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 37 +++-
 1 file changed, 21 insertions(+), 16 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d7053bea/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
index ad6715b5..dc4ae1d 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
@@ -282,25 +282,30 @@ object PrefixSpan extends Logging {
   largePrefixes = newLargePrefixes
 }
 
-// Switch to local processing.
-val bcSmallPrefixes = sc.broadcast(smallPrefixes)
-val distributedFreqPattern = postfixes.flatMap { postfix =
-  bcSmallPrefixes.value.values.map { prefix =
-(prefix.id, postfix.project(prefix).compressed)
-  }.filter(_._2.nonEmpty)
-}.groupByKey().flatMap { case (id, projPostfixes) =
-  val prefix = bcSmallPrefixes.value(id)
-  val localPrefixSpan = new LocalPrefixSpan(minCount, maxPatternLength - 
prefix.length)
-  // TODO: We collect projected postfixes into memory. We should also 
compare the performance
-  // TODO: of keeping them on shuffle files.
-  localPrefixSpan.run(projPostfixes.toArray).map { case (pattern, count) =
-(prefix.items ++ pattern, count)
+var freqPatterns = sc.parallelize(localFreqPatterns, 1)
+
+val numSmallPrefixes = smallPrefixes.size
+logInfo(snumber of small prefixes for local processing: 
$numSmallPrefixes)
+if (numSmallPrefixes  0) {
+  // Switch to local processing.
+  val bcSmallPrefixes = sc.broadcast(smallPrefixes)
+  val distributedFreqPattern = postfixes.flatMap { postfix =
+bcSmallPrefixes.value.values.map { prefix =
+  (prefix.id, postfix.project(prefix).compressed)
+}.filter(_._2.nonEmpty)
+  }.groupByKey().flatMap { case (id, projPostfixes) =
+val prefix = bcSmallPrefixes.value(id)
+val localPrefixSpan = new LocalPrefixSpan(minCount, maxPatternLength - 
prefix.length)
+// TODO: We collect projected postfixes into memory. We should also 
compare the performance
+// TODO: of keeping them on shuffle files.
+localPrefixSpan.run(projPostfixes.toArray).map { case (pattern, count) 
=
+  (prefix.items ++ pattern, count)
+}
   }
+  // Union local frequent patterns and distributed ones.
+  freqPatterns = freqPatterns ++ distributedFreqPattern
 }
 
-// Union local frequent patterns and distributed ones.
-val freqPatterns = (sc.parallelize(localFreqPatterns, 1) ++ 
distributedFreqPattern)
-  .persist(StorageLevel.MEMORY_AND_DISK)
 freqPatterns
   }
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9914] [ML] define setters explicitly for Java and use setParam group in RFormula

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master df5438921 - d7eb371eb


[SPARK-9914] [ML] define setters explicitly for Java and use setParam group in 
RFormula

The problem with defining setters in the base class is that it doesn't return 
the correct type in Java.

ericl

Author: Xiangrui Meng m...@databricks.com

Closes #8143 from mengxr/SPARK-9914 and squashes the following commits:

d36c887 [Xiangrui Meng] remove setters from model
a49021b [Xiangrui Meng] define setters explicitly for Java and use setParam 
group


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d7eb371e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d7eb371e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d7eb371e

Branch: refs/heads/master
Commit: d7eb371eb6369a34e58a09179efe058c4101de9e
Parents: df54389
Author: Xiangrui Meng m...@databricks.com
Authored: Wed Aug 12 22:30:33 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 22:30:33 2015 -0700

--
 .../scala/org/apache/spark/ml/feature/RFormula.scala | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d7eb371e/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
index d5360c9..a752dac 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
@@ -33,11 +33,6 @@ import org.apache.spark.sql.types._
  * Base trait for [[RFormula]] and [[RFormulaModel]].
  */
 private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
-  /** @group getParam */
-  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
-
-  /** @group getParam */
-  def setLabelCol(value: String): this.type = set(labelCol, value)
 
   protected def hasLabelCol(schema: StructType): Boolean = {
 schema.map(_.name).contains($(labelCol))
@@ -71,6 +66,12 @@ class RFormula(override val uid: String) extends 
Estimator[RFormulaModel] with R
   /** @group getParam */
   def getFormula: String = $(formula)
 
+  /** @group setParam */
+  def setFeaturesCol(value: String): this.type = set(featuresCol, value)
+
+  /** @group setParam */
+  def setLabelCol(value: String): this.type = set(labelCol, value)
+
   /** Whether the formula specifies fitting an intercept. */
   private[ml] def hasIntercept: Boolean = {
 require(isDefined(formula), Formula must be defined first.)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-7583] [MLLIB] User guide update for RegexTokenizer

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 bc4ac65d4 - 2d86faddd


[SPARK-7583] [MLLIB] User guide update for RegexTokenizer

jira: https://issues.apache.org/jira/browse/SPARK-7583

User guide update for RegexTokenizer

Author: Yuhao Yang hhb...@gmail.com

Closes #7828 from hhbyyh/regexTokenizerDoc.

(cherry picked from commit 66d87c1d76bea2b81993156ac1fa7dad6c312ebf)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/2d86fadd
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/2d86fadd
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/2d86fadd

Branch: refs/heads/branch-1.5
Commit: 2d86faddd87b6e61565cbdf18dadaf4aeb2b223e
Parents: bc4ac65
Author: Yuhao Yang hhb...@gmail.com
Authored: Wed Aug 12 09:35:32 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 09:35:41 2015 -0700

--
 docs/ml-features.md | 41 ++---
 1 file changed, 30 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/2d86fadd/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index fa0ad1f..cec2cbe 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -217,21 +217,32 @@ for feature in result.select(result).take(3):
 
 [Tokenization](http://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is 
the process of taking text (such as a sentence) and breaking it into individual 
terms (usually words).  A simple 
[Tokenizer](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer) class 
provides this functionality.  The example below shows how to split sentences 
into sequences of words.
 
-Note: A more advanced tokenizer is provided via 
[RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer).
+[RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer)
 allows more
+ advanced tokenization based on regular expression (regex) matching.
+ By default, the parameter pattern (regex, default: \\s+) is used as 
delimiters to split the input text.
+ Alternatively, users can set parameter gaps to false indicating the regex 
pattern denotes
+ tokens rather than splitting gaps, and find all matching occurrences as the 
tokenization result.
 
 div class=codetabs
 div data-lang=scala markdown=1
 {% highlight scala %}
-import org.apache.spark.ml.feature.Tokenizer
+import org.apache.spark.ml.feature.{Tokenizer, RegexTokenizer}
 
 val sentenceDataFrame = sqlContext.createDataFrame(Seq(
   (0, Hi I heard about Spark),
-  (0, I wish Java could use case classes),
-  (1, Logistic regression models are neat)
+  (1, I wish Java could use case classes),
+  (2, Logistic,regression,models,are,neat)
 )).toDF(label, sentence)
 val tokenizer = new Tokenizer().setInputCol(sentence).setOutputCol(words)
-val wordsDataFrame = tokenizer.transform(sentenceDataFrame)
-wordsDataFrame.select(words, label).take(3).foreach(println)
+val regexTokenizer = new RegexTokenizer()
+  .setInputCol(sentence)
+  .setOutputCol(words)
+  .setPattern(\\W)  // alternatively .setPattern(\\w+).setGaps(false)
+
+val tokenized = tokenizer.transform(sentenceDataFrame)
+tokenized.select(words, label).take(3).foreach(println)
+val regexTokenized = regexTokenizer.transform(sentenceDataFrame)
+regexTokenized.select(words, label).take(3).foreach(println)
 {% endhighlight %}
 /div
 
@@ -240,6 +251,7 @@ wordsDataFrame.select(words, 
label).take(3).foreach(println)
 import com.google.common.collect.Lists;
 
 import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.ml.feature.RegexTokenizer;
 import org.apache.spark.ml.feature.Tokenizer;
 import org.apache.spark.mllib.linalg.Vector;
 import org.apache.spark.sql.DataFrame;
@@ -252,8 +264,8 @@ import org.apache.spark.sql.types.StructType;
 
 JavaRDDRow jrdd = jsc.parallelize(Lists.newArrayList(
   RowFactory.create(0, Hi I heard about Spark),
-  RowFactory.create(0, I wish Java could use case classes),
-  RowFactory.create(1, Logistic regression models are neat)
+  RowFactory.create(1, I wish Java could use case classes),
+  RowFactory.create(2, Logistic,regression,models,are,neat)
 ));
 StructType schema = new StructType(new StructField[]{
   new StructField(label, DataTypes.DoubleType, false, Metadata.empty()),
@@ -267,22 +279,29 @@ for (Row r : wordsDataFrame.select(words, 
label).take(3)) {
   for (String word : words) System.out.print(word +  );
   System.out.println();
 }
+
+RegexTokenizer regexTokenizer = new RegexTokenizer()
+  .setInputCol(sentence)
+  .setOutputCol(words)
+  .setPattern(\\W);  // alternatively .setPattern(\\w+).setGaps(false);
 {% endhighlight %}
 /div
 
 div data-lang=python markdown=1
 {% highlight

spark git commit: [SPARK-7583] [MLLIB] User guide update for RegexTokenizer

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master be5d19120 - 66d87c1d7


[SPARK-7583] [MLLIB] User guide update for RegexTokenizer

jira: https://issues.apache.org/jira/browse/SPARK-7583

User guide update for RegexTokenizer

Author: Yuhao Yang hhb...@gmail.com

Closes #7828 from hhbyyh/regexTokenizerDoc.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/66d87c1d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/66d87c1d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/66d87c1d

Branch: refs/heads/master
Commit: 66d87c1d76bea2b81993156ac1fa7dad6c312ebf
Parents: be5d191
Author: Yuhao Yang hhb...@gmail.com
Authored: Wed Aug 12 09:35:32 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 09:35:32 2015 -0700

--
 docs/ml-features.md | 41 ++---
 1 file changed, 30 insertions(+), 11 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/66d87c1d/docs/ml-features.md
--
diff --git a/docs/ml-features.md b/docs/ml-features.md
index fa0ad1f..cec2cbe 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -217,21 +217,32 @@ for feature in result.select(result).take(3):
 
 [Tokenization](http://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is 
the process of taking text (such as a sentence) and breaking it into individual 
terms (usually words).  A simple 
[Tokenizer](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer) class 
provides this functionality.  The example below shows how to split sentences 
into sequences of words.
 
-Note: A more advanced tokenizer is provided via 
[RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer).
+[RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer)
 allows more
+ advanced tokenization based on regular expression (regex) matching.
+ By default, the parameter pattern (regex, default: \\s+) is used as 
delimiters to split the input text.
+ Alternatively, users can set parameter gaps to false indicating the regex 
pattern denotes
+ tokens rather than splitting gaps, and find all matching occurrences as the 
tokenization result.
 
 div class=codetabs
 div data-lang=scala markdown=1
 {% highlight scala %}
-import org.apache.spark.ml.feature.Tokenizer
+import org.apache.spark.ml.feature.{Tokenizer, RegexTokenizer}
 
 val sentenceDataFrame = sqlContext.createDataFrame(Seq(
   (0, Hi I heard about Spark),
-  (0, I wish Java could use case classes),
-  (1, Logistic regression models are neat)
+  (1, I wish Java could use case classes),
+  (2, Logistic,regression,models,are,neat)
 )).toDF(label, sentence)
 val tokenizer = new Tokenizer().setInputCol(sentence).setOutputCol(words)
-val wordsDataFrame = tokenizer.transform(sentenceDataFrame)
-wordsDataFrame.select(words, label).take(3).foreach(println)
+val regexTokenizer = new RegexTokenizer()
+  .setInputCol(sentence)
+  .setOutputCol(words)
+  .setPattern(\\W)  // alternatively .setPattern(\\w+).setGaps(false)
+
+val tokenized = tokenizer.transform(sentenceDataFrame)
+tokenized.select(words, label).take(3).foreach(println)
+val regexTokenized = regexTokenizer.transform(sentenceDataFrame)
+regexTokenized.select(words, label).take(3).foreach(println)
 {% endhighlight %}
 /div
 
@@ -240,6 +251,7 @@ wordsDataFrame.select(words, 
label).take(3).foreach(println)
 import com.google.common.collect.Lists;
 
 import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.ml.feature.RegexTokenizer;
 import org.apache.spark.ml.feature.Tokenizer;
 import org.apache.spark.mllib.linalg.Vector;
 import org.apache.spark.sql.DataFrame;
@@ -252,8 +264,8 @@ import org.apache.spark.sql.types.StructType;
 
 JavaRDDRow jrdd = jsc.parallelize(Lists.newArrayList(
   RowFactory.create(0, Hi I heard about Spark),
-  RowFactory.create(0, I wish Java could use case classes),
-  RowFactory.create(1, Logistic regression models are neat)
+  RowFactory.create(1, I wish Java could use case classes),
+  RowFactory.create(2, Logistic,regression,models,are,neat)
 ));
 StructType schema = new StructType(new StructField[]{
   new StructField(label, DataTypes.DoubleType, false, Metadata.empty()),
@@ -267,22 +279,29 @@ for (Row r : wordsDataFrame.select(words, 
label).take(3)) {
   for (String word : words) System.out.print(word +  );
   System.out.println();
 }
+
+RegexTokenizer regexTokenizer = new RegexTokenizer()
+  .setInputCol(sentence)
+  .setOutputCol(words)
+  .setPattern(\\W);  // alternatively .setPattern(\\w+).setGaps(false);
 {% endhighlight %}
 /div
 
 div data-lang=python markdown=1
 {% highlight python %}
-from pyspark.ml.feature import Tokenizer
+from pyspark.ml.feature import Tokenizer, RegexTokenizer
 
 sentenceDataFrame

spark git commit: [SPARK-9847] [ML] Modified copyValues to distinguish between default, explicit param values

2015-08-12 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 57ec27dd7 - 70fe55886


[SPARK-9847] [ML] Modified copyValues to distinguish between default, explicit 
param values

From JIRA: Currently, Params.copyValues copies default parameter values to the 
paramMap of the target instance, rather than the defaultParamMap. It should 
copy to the defaultParamMap because explicitly setting a parameter can change 
the semantics.
This issue arose in SPARK-9789, where 2 params threshold and thresholds for 
LogisticRegression can have mutually exclusive values. If thresholds is set, 
then fit() will copy the default value of threshold as well, easily resulting 
in inconsistent settings for the 2 params.

CC: mengxr

Author: Joseph K. Bradley jos...@databricks.com

Closes #8115 from jkbradley/copyvalues-fix.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/70fe5588
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/70fe5588
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/70fe5588

Branch: refs/heads/master
Commit: 70fe558867ccb4bcff6ec673438b03608bb02252
Parents: 57ec27d
Author: Joseph K. Bradley jos...@databricks.com
Authored: Wed Aug 12 10:48:52 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 12 10:48:52 2015 -0700

--
 .../scala/org/apache/spark/ml/param/params.scala | 19 ---
 .../org/apache/spark/ml/param/ParamsSuite.scala  |  8 
 2 files changed, 24 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/70fe5588/mllib/src/main/scala/org/apache/spark/ml/param/params.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/param/params.scala 
b/mllib/src/main/scala/org/apache/spark/ml/param/params.scala
index d68f5ff..91c0a56 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/param/params.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/param/params.scala
@@ -559,13 +559,26 @@ trait Params extends Identifiable with Serializable {
 
   /**
* Copies param values from this instance to another instance for params 
shared by them.
-   * @param to the target instance
-   * @param extra extra params to be copied
+   *
+   * This handles default Params and explicitly set Params separately.
+   * Default Params are copied from and to [[defaultParamMap]], and explicitly 
set Params are
+   * copied from and to [[paramMap]].
+   * Warning: This implicitly assumes that this [[Params]] instance and the 
target instance
+   *  share the same set of default Params.
+   *
+   * @param to the target instance, which should work with the same set of 
default Params as this
+   *   source instance
+   * @param extra extra params to be copied to the target's [[paramMap]]
* @return the target instance with param values copied
*/
   protected def copyValues[T : Params](to: T, extra: ParamMap = 
ParamMap.empty): T = {
-val map = extractParamMap(extra)
+val map = paramMap ++ extra
 params.foreach { param =
+  // copy default Params
+  if (defaultParamMap.contains(param)  to.hasParam(param.name)) {
+to.defaultParamMap.put(to.getParam(param.name), defaultParamMap(param))
+  }
+  // copy explicitly set Params
   if (map.contains(param)  to.hasParam(param.name)) {
 to.set(param.name, map(param))
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/70fe5588/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala
--
diff --git a/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala 
b/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala
index 050d417..be95638 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/param/ParamsSuite.scala
@@ -200,6 +200,14 @@ class ParamsSuite extends SparkFunSuite {
 val inArray = ParamValidators.inArray[Int](Array(1, 2))
 assert(inArray(1)  inArray(2)  !inArray(0))
   }
+
+  test(Params.copyValues) {
+val t = new TestParams()
+val t2 = t.copy(ParamMap.empty)
+assert(!t2.isSet(t2.maxIter))
+val t3 = t.copy(ParamMap(t.maxIter - 20))
+assert(t3.isSet(t3.maxIter))
+  }
 }
 
 object ParamsSuite extends SparkFunSuite {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: Closes #1290 Closes #4934

2015-08-11 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master f16bc68df - 423cdfd83


Closes #1290
Closes #4934


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/423cdfd8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/423cdfd8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/423cdfd8

Branch: refs/heads/master
Commit: 423cdfd83d7fd02a4f8cf3e714db913fd3f9ca09
Parents: f16bc68
Author: Xiangrui Meng m...@databricks.com
Authored: Tue Aug 11 14:08:09 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 11 14:08:09 2015 -0700

--

--



-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-8925] [MLLIB] Add @since tags to mllib.util

2015-08-11 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 2273e7432 - ef961ed48


[SPARK-8925] [MLLIB] Add @since tags to mllib.util

Went thru the history of changes the file MLUtils.scala and picked up the 
version that the change went in.

Author: Sudhakar Thota sudhakarth...@yahoo.com
Author: Sudhakar Thota sudhakarth...@sudhakars-mbp-2.usca.ibm.com

Closes #7436 from sthota2014/SPARK-8925_thotas.

(cherry picked from commit 017b5de07ef6cff249e984a2ab781c520249ac76)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ef961ed4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ef961ed4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ef961ed4

Branch: refs/heads/branch-1.5
Commit: ef961ed48a4f45447f0e0ad256b040c7ab2d78d9
Parents: 2273e74
Author: Sudhakar Thota sudhakarth...@yahoo.com
Authored: Tue Aug 11 14:31:51 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 11 14:32:01 2015 -0700

--
 .../org/apache/spark/mllib/util/MLUtils.scala   | 22 +++-
 1 file changed, 21 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ef961ed4/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
index 7c5cfa7..26eb84a 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
@@ -64,6 +64,7 @@ object MLUtils {
*feature dimensions.
* @param minPartitions min number of partitions
* @return labeled data stored as an RDD[LabeledPoint]
+   * @since 1.0.0
*/
   def loadLibSVMFile(
   sc: SparkContext,
@@ -113,7 +114,10 @@ object MLUtils {
   }
 
   // Convenient methods for `loadLibSVMFile`.
-
+  
+  /**
+   * @since 1.0.0
+   */
   @deprecated(use method without multiclass argument, which no longer has 
effect, 1.1.0)
   def loadLibSVMFile(
   sc: SparkContext,
@@ -126,6 +130,7 @@ object MLUtils {
   /**
* Loads labeled data in the LIBSVM format into an RDD[LabeledPoint], with 
the default number of
* partitions.
+   * @since 1.0.0
*/
   def loadLibSVMFile(
   sc: SparkContext,
@@ -133,6 +138,9 @@ object MLUtils {
   numFeatures: Int): RDD[LabeledPoint] =
 loadLibSVMFile(sc, path, numFeatures, sc.defaultMinPartitions)
 
+  /**
+   * @since 1.0.0
+   */
   @deprecated(use method without multiclass argument, which no longer has 
effect, 1.1.0)
   def loadLibSVMFile(
   sc: SparkContext,
@@ -141,6 +149,9 @@ object MLUtils {
   numFeatures: Int): RDD[LabeledPoint] =
 loadLibSVMFile(sc, path, numFeatures)
 
+  /**
+   * @since 1.0.0
+   */
   @deprecated(use method without multiclass argument, which no longer has 
effect, 1.1.0)
   def loadLibSVMFile(
   sc: SparkContext,
@@ -151,6 +162,7 @@ object MLUtils {
   /**
* Loads binary labeled data in the LIBSVM format into an RDD[LabeledPoint], 
with number of
* features determined automatically and the default number of partitions.
+   * @since 1.0.0
*/
   def loadLibSVMFile(sc: SparkContext, path: String): RDD[LabeledPoint] =
 loadLibSVMFile(sc, path, -1)
@@ -181,12 +193,14 @@ object MLUtils {
* @param path file or directory path in any Hadoop-supported file system URI
* @param minPartitions min number of partitions
* @return vectors stored as an RDD[Vector]
+   * @since 1.1.0
*/
   def loadVectors(sc: SparkContext, path: String, minPartitions: Int): 
RDD[Vector] =
 sc.textFile(path, minPartitions).map(Vectors.parse)
 
   /**
* Loads vectors saved using `RDD[Vector].saveAsTextFile` with the default 
number of partitions.
+   * @since 1.1.0
*/
   def loadVectors(sc: SparkContext, path: String): RDD[Vector] =
 sc.textFile(path, sc.defaultMinPartitions).map(Vectors.parse)
@@ -197,6 +211,7 @@ object MLUtils {
* @param path file or directory path in any Hadoop-supported file system URI
* @param minPartitions min number of partitions
* @return labeled points stored as an RDD[LabeledPoint]
+   * @since 1.1.0
*/
   def loadLabeledPoints(sc: SparkContext, path: String, minPartitions: Int): 
RDD[LabeledPoint] =
 sc.textFile(path, minPartitions).map(LabeledPoint.parse)
@@ -204,6 +219,7 @@ object MLUtils {
   /**
* Loads labeled points saved using `RDD[LabeledPoint].saveAsTextFile` with 
the default number of
* partitions.
+   * @since 1.1.0
*/
   def loadLabeledPoints(sc: SparkContext, dir: String): RDD[LabeledPoint] =
 loadLabeledPoints(sc, dir, sc.defaultMinPartitions)
@@ -220,6 +236,7

spark git commit: [SPARK-8925] [MLLIB] Add @since tags to mllib.util

2015-08-11 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master be3e27164 - 017b5de07


[SPARK-8925] [MLLIB] Add @since tags to mllib.util

Went thru the history of changes the file MLUtils.scala and picked up the 
version that the change went in.

Author: Sudhakar Thota sudhakarth...@yahoo.com
Author: Sudhakar Thota sudhakarth...@sudhakars-mbp-2.usca.ibm.com

Closes #7436 from sthota2014/SPARK-8925_thotas.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/017b5de0
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/017b5de0
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/017b5de0

Branch: refs/heads/master
Commit: 017b5de07ef6cff249e984a2ab781c520249ac76
Parents: be3e271
Author: Sudhakar Thota sudhakarth...@yahoo.com
Authored: Tue Aug 11 14:31:51 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 11 14:31:51 2015 -0700

--
 .../org/apache/spark/mllib/util/MLUtils.scala   | 22 +++-
 1 file changed, 21 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/017b5de0/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
index 7c5cfa7..26eb84a 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
@@ -64,6 +64,7 @@ object MLUtils {
*feature dimensions.
* @param minPartitions min number of partitions
* @return labeled data stored as an RDD[LabeledPoint]
+   * @since 1.0.0
*/
   def loadLibSVMFile(
   sc: SparkContext,
@@ -113,7 +114,10 @@ object MLUtils {
   }
 
   // Convenient methods for `loadLibSVMFile`.
-
+  
+  /**
+   * @since 1.0.0
+   */
   @deprecated(use method without multiclass argument, which no longer has 
effect, 1.1.0)
   def loadLibSVMFile(
   sc: SparkContext,
@@ -126,6 +130,7 @@ object MLUtils {
   /**
* Loads labeled data in the LIBSVM format into an RDD[LabeledPoint], with 
the default number of
* partitions.
+   * @since 1.0.0
*/
   def loadLibSVMFile(
   sc: SparkContext,
@@ -133,6 +138,9 @@ object MLUtils {
   numFeatures: Int): RDD[LabeledPoint] =
 loadLibSVMFile(sc, path, numFeatures, sc.defaultMinPartitions)
 
+  /**
+   * @since 1.0.0
+   */
   @deprecated(use method without multiclass argument, which no longer has 
effect, 1.1.0)
   def loadLibSVMFile(
   sc: SparkContext,
@@ -141,6 +149,9 @@ object MLUtils {
   numFeatures: Int): RDD[LabeledPoint] =
 loadLibSVMFile(sc, path, numFeatures)
 
+  /**
+   * @since 1.0.0
+   */
   @deprecated(use method without multiclass argument, which no longer has 
effect, 1.1.0)
   def loadLibSVMFile(
   sc: SparkContext,
@@ -151,6 +162,7 @@ object MLUtils {
   /**
* Loads binary labeled data in the LIBSVM format into an RDD[LabeledPoint], 
with number of
* features determined automatically and the default number of partitions.
+   * @since 1.0.0
*/
   def loadLibSVMFile(sc: SparkContext, path: String): RDD[LabeledPoint] =
 loadLibSVMFile(sc, path, -1)
@@ -181,12 +193,14 @@ object MLUtils {
* @param path file or directory path in any Hadoop-supported file system URI
* @param minPartitions min number of partitions
* @return vectors stored as an RDD[Vector]
+   * @since 1.1.0
*/
   def loadVectors(sc: SparkContext, path: String, minPartitions: Int): 
RDD[Vector] =
 sc.textFile(path, minPartitions).map(Vectors.parse)
 
   /**
* Loads vectors saved using `RDD[Vector].saveAsTextFile` with the default 
number of partitions.
+   * @since 1.1.0
*/
   def loadVectors(sc: SparkContext, path: String): RDD[Vector] =
 sc.textFile(path, sc.defaultMinPartitions).map(Vectors.parse)
@@ -197,6 +211,7 @@ object MLUtils {
* @param path file or directory path in any Hadoop-supported file system URI
* @param minPartitions min number of partitions
* @return labeled points stored as an RDD[LabeledPoint]
+   * @since 1.1.0
*/
   def loadLabeledPoints(sc: SparkContext, path: String, minPartitions: Int): 
RDD[LabeledPoint] =
 sc.textFile(path, minPartitions).map(LabeledPoint.parse)
@@ -204,6 +219,7 @@ object MLUtils {
   /**
* Loads labeled points saved using `RDD[LabeledPoint].saveAsTextFile` with 
the default number of
* partitions.
+   * @since 1.1.0
*/
   def loadLabeledPoints(sc: SparkContext, dir: String): RDD[LabeledPoint] =
 loadLabeledPoints(sc, dir, sc.defaultMinPartitions)
@@ -220,6 +236,7 @@ object MLUtils {
*
* @deprecated Should use [[org.apache.spark.rdd.RDD#saveAsTextFile]] for 
saving

spark git commit: [SPARK-8345] [ML] Add an SQL node as a feature transformer

2015-08-11 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master bce72797f - 8cad854ef


[SPARK-8345] [ML] Add an SQL node as a feature transformer

Implements the transforms which are defined by SQL statement.
Currently we only support SQL syntax like 'SELECT ... FROM __THIS__'
where '__THIS__' represents the underlying table of the input dataset.

Author: Yanbo Liang yblia...@gmail.com

Closes #7465 from yanboliang/spark-8345 and squashes the following commits:

b403fcb [Yanbo Liang] address comments
0d4bb15 [Yanbo Liang] a better transformSchema() implementation
51eb9e7 [Yanbo Liang] Add an SQL node as a feature transformer


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8cad854e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8cad854e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8cad854e

Branch: refs/heads/master
Commit: 8cad854ef6a2066de5adffcca6b79a205ccfd5f3
Parents: bce7279
Author: Yanbo Liang yblia...@gmail.com
Authored: Tue Aug 11 11:01:59 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 11 11:01:59 2015 -0700

--
 .../spark/ml/feature/SQLTransformer.scala   | 72 
 .../spark/ml/feature/SQLTransformerSuite.scala  | 44 
 2 files changed, 116 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/8cad854e/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala
new file mode 100644
index 000..95e4305
--- /dev/null
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/SQLTransformer.scala
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkContext
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.param.{ParamMap, Param}
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.{SQLContext, DataFrame, Row}
+import org.apache.spark.sql.types.StructType
+
+/**
+ * :: Experimental ::
+ * Implements the transforms which are defined by SQL statement.
+ * Currently we only support SQL syntax like 'SELECT ... FROM __THIS__'
+ * where '__THIS__' represents the underlying table of the input dataset.
+ */
+@Experimental
+class SQLTransformer (override val uid: String) extends Transformer {
+
+  def this() = this(Identifiable.randomUID(sql))
+
+  /**
+   * SQL statement parameter. The statement is provided in string form.
+   * @group param
+   */
+  final val statement: Param[String] = new Param[String](this, statement, 
SQL statement)
+
+  /** @group setParam */
+  def setStatement(value: String): this.type = set(statement, value)
+
+  /** @group getParam */
+  def getStatement: String = $(statement)
+
+  private val tableIdentifier: String = __THIS__
+
+  override def transform(dataset: DataFrame): DataFrame = {
+val tableName = Identifiable.randomUID(uid)
+dataset.registerTempTable(tableName)
+val realStatement = $(statement).replace(tableIdentifier, tableName)
+val outputDF = dataset.sqlContext.sql(realStatement)
+outputDF
+  }
+
+  override def transformSchema(schema: StructType): StructType = {
+val sc = SparkContext.getOrCreate()
+val sqlContext = SQLContext.getOrCreate(sc)
+val dummyRDD = sc.parallelize(Seq(Row.empty))
+val dummyDF = sqlContext.createDataFrame(dummyRDD, schema)
+dummyDF.registerTempTable(tableIdentifier)
+val outputSchema = sqlContext.sql($(statement)).schema
+outputSchema
+  }
+
+  override def copy(extra: ParamMap): SQLTransformer = defaultCopy(extra)
+}

http://git-wip-us.apache.org/repos/asf/spark/blob/8cad854e/mllib/src/test/scala/org/apache/spark/ml/feature/SQLTransformerSuite.scala
--
diff --git 
a/mllib/src/test/scala

spark git commit: [SPARK-9713] [ML] Document SparkR MLlib glm() integration in Spark 1.5

2015-08-11 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 6ea33f5bf - 890c75bc2


[SPARK-9713] [ML] Document SparkR MLlib glm() integration in Spark 1.5

This documents the use of R model formulae in the SparkR guide. Also fixes some 
bugs in the R api doc.

mengxr

Author: Eric Liang e...@databricks.com

Closes #8085 from ericl/docs.

(cherry picked from commit 74a293f4537c6982345166f8883538f81d850872)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/890c75bc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/890c75bc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/890c75bc

Branch: refs/heads/branch-1.5
Commit: 890c75bc2c2e1405c00485a98c034342122b639f
Parents: 6ea33f5
Author: Eric Liang e...@databricks.com
Authored: Tue Aug 11 21:26:03 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 11 21:26:12 2015 -0700

--
 R/pkg/R/generics.R |  4 ++--
 R/pkg/R/mllib.R|  8 
 docs/sparkr.md | 37 -
 3 files changed, 42 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/890c75bc/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index c43b947..379a78b 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -535,8 +535,8 @@ setGeneric(showDF, function(x,...) { 
standardGeneric(showDF) })
 #' @export
 setGeneric(summarize, function(x,...) { standardGeneric(summarize) })
 
-##' rdname summary
-##' @export
+#' @rdname summary
+#' @export
 setGeneric(summary, function(x, ...) { standardGeneric(summary) })
 
 # @rdname tojson

http://git-wip-us.apache.org/repos/asf/spark/blob/890c75bc/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index b524d1f..cea3d76 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -56,10 +56,10 @@ setMethod(glm, signature(formula = formula, family = 
ANY, data = DataFram
 #'
 #' Makes predictions from a model produced by glm(), similarly to R's 
predict().
 #'
-#' @param model A fitted MLlib model
+#' @param object A fitted MLlib model
 #' @param newData DataFrame for testing
 #' @return DataFrame containing predicted values
-#' @rdname glm
+#' @rdname predict
 #' @export
 #' @examples
 #'\dontrun{
@@ -76,10 +76,10 @@ setMethod(predict, signature(object = PipelineModel),
 #'
 #' Returns the summary of a model produced by glm(), similarly to R's 
summary().
 #'
-#' @param model A fitted MLlib model
+#' @param x A fitted MLlib model
 #' @return a list with a 'coefficient' component, which is the matrix of 
coefficients. See
 #' summary.glm for more information.
-#' @rdname glm
+#' @rdname summary
 #' @export
 #' @examples
 #'\dontrun{

http://git-wip-us.apache.org/repos/asf/spark/blob/890c75bc/docs/sparkr.md
--
diff --git a/docs/sparkr.md b/docs/sparkr.md
index 4385a4e..7139d16 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -11,7 +11,8 @@ title: SparkR (R on Spark)
 SparkR is an R package that provides a light-weight frontend to use Apache 
Spark from R.
 In Spark {{site.SPARK_VERSION}}, SparkR provides a distributed data frame 
implementation that
 supports operations like selection, filtering, aggregation etc. (similar to R 
data frames,
-[dplyr](https://github.com/hadley/dplyr)) but on large datasets.
+[dplyr](https://github.com/hadley/dplyr)) but on large datasets. SparkR also 
supports distributed
+machine learning using MLlib.
 
 # SparkR DataFrames
 
@@ -230,3 +231,37 @@ head(teenagers)
 
 {% endhighlight %}
 /div
+
+# Machine Learning
+
+SparkR allows the fitting of generalized linear models over DataFrames using 
the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib to 
train a model of the specified family. Currently the gaussian and binomial 
families are supported. We support a subset of the available R formula 
operators for model fitting, including '~', '.', '+', and '-'. The example 
below shows the use of building a gaussian GLM model using SparkR.
+
+div data-lang=r  markdown=1
+{% highlight r %}
+# Create the DataFrame
+df - createDataFrame(sqlContext, iris)
+
+# Fit a linear model over the dataset.
+model - glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = 
gaussian)
+
+# Model coefficients are returned in a similar format to R's native glm().
+summary(model)
+##$coefficients
+##Estimate
+##(Intercept)2.2513930
+##Sepal_Width0.8035609
+##Species_versicolor 1.4587432
+##Species_virginica  1.9468169
+
+# Make predictions based on the model.
+predictions - predict(model, newData = df

spark git commit: [SPARK-9713] [ML] Document SparkR MLlib glm() integration in Spark 1.5

2015-08-11 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 3ef0f3292 - 74a293f45


[SPARK-9713] [ML] Document SparkR MLlib glm() integration in Spark 1.5

This documents the use of R model formulae in the SparkR guide. Also fixes some 
bugs in the R api doc.

mengxr

Author: Eric Liang e...@databricks.com

Closes #8085 from ericl/docs.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/74a293f4
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/74a293f4
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/74a293f4

Branch: refs/heads/master
Commit: 74a293f4537c6982345166f8883538f81d850872
Parents: 3ef0f32
Author: Eric Liang e...@databricks.com
Authored: Tue Aug 11 21:26:03 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 11 21:26:03 2015 -0700

--
 R/pkg/R/generics.R |  4 ++--
 R/pkg/R/mllib.R|  8 
 docs/sparkr.md | 37 -
 3 files changed, 42 insertions(+), 7 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/74a293f4/R/pkg/R/generics.R
--
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index c43b947..379a78b 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -535,8 +535,8 @@ setGeneric(showDF, function(x,...) { 
standardGeneric(showDF) })
 #' @export
 setGeneric(summarize, function(x,...) { standardGeneric(summarize) })
 
-##' rdname summary
-##' @export
+#' @rdname summary
+#' @export
 setGeneric(summary, function(x, ...) { standardGeneric(summary) })
 
 # @rdname tojson

http://git-wip-us.apache.org/repos/asf/spark/blob/74a293f4/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index b524d1f..cea3d76 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -56,10 +56,10 @@ setMethod(glm, signature(formula = formula, family = 
ANY, data = DataFram
 #'
 #' Makes predictions from a model produced by glm(), similarly to R's 
predict().
 #'
-#' @param model A fitted MLlib model
+#' @param object A fitted MLlib model
 #' @param newData DataFrame for testing
 #' @return DataFrame containing predicted values
-#' @rdname glm
+#' @rdname predict
 #' @export
 #' @examples
 #'\dontrun{
@@ -76,10 +76,10 @@ setMethod(predict, signature(object = PipelineModel),
 #'
 #' Returns the summary of a model produced by glm(), similarly to R's 
summary().
 #'
-#' @param model A fitted MLlib model
+#' @param x A fitted MLlib model
 #' @return a list with a 'coefficient' component, which is the matrix of 
coefficients. See
 #' summary.glm for more information.
-#' @rdname glm
+#' @rdname summary
 #' @export
 #' @examples
 #'\dontrun{

http://git-wip-us.apache.org/repos/asf/spark/blob/74a293f4/docs/sparkr.md
--
diff --git a/docs/sparkr.md b/docs/sparkr.md
index 4385a4e..7139d16 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -11,7 +11,8 @@ title: SparkR (R on Spark)
 SparkR is an R package that provides a light-weight frontend to use Apache 
Spark from R.
 In Spark {{site.SPARK_VERSION}}, SparkR provides a distributed data frame 
implementation that
 supports operations like selection, filtering, aggregation etc. (similar to R 
data frames,
-[dplyr](https://github.com/hadley/dplyr)) but on large datasets.
+[dplyr](https://github.com/hadley/dplyr)) but on large datasets. SparkR also 
supports distributed
+machine learning using MLlib.
 
 # SparkR DataFrames
 
@@ -230,3 +231,37 @@ head(teenagers)
 
 {% endhighlight %}
 /div
+
+# Machine Learning
+
+SparkR allows the fitting of generalized linear models over DataFrames using 
the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib to 
train a model of the specified family. Currently the gaussian and binomial 
families are supported. We support a subset of the available R formula 
operators for model fitting, including '~', '.', '+', and '-'. The example 
below shows the use of building a gaussian GLM model using SparkR.
+
+div data-lang=r  markdown=1
+{% highlight r %}
+# Create the DataFrame
+df - createDataFrame(sqlContext, iris)
+
+# Fit a linear model over the dataset.
+model - glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = 
gaussian)
+
+# Model coefficients are returned in a similar format to R's native glm().
+summary(model)
+##$coefficients
+##Estimate
+##(Intercept)2.2513930
+##Sepal_Width0.8035609
+##Species_versicolor 1.4587432
+##Species_virginica  1.9468169
+
+# Make predictions based on the model.
+predictions - predict(model, newData = df)
+head(select(predictions, Sepal_Length, prediction))
+##  Sepal_Length prediction
+##1  5.1   5.063856
+##2

spark git commit: [SPARK-9615] [SPARK-9616] [SQL] [MLLIB] Bugs related to FrequentItems when merging and with Tungsten

2015-08-06 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 e24b97650 - 78f168e97


[SPARK-9615] [SPARK-9616] [SQL] [MLLIB] Bugs related to FrequentItems when 
merging and with Tungsten

In short:
1- FrequentItems should not use the InternalRow representation, because the 
keys in the map get messed up. For example, every key in the Map correspond to 
the very last element observed in the partition, when the elements are strings.

2- Merging two partitions had a bug:

**Existing behavior with size 3**
Partition A - Map(1 - 3, 2 - 3, 3 - 4)
Partition B - Map(4 - 25)
Result - Map()

**Correct Behavior:**
Partition A - Map(1 - 3, 2 - 3, 3 - 4)
Partition B - Map(4 - 25)
Result - Map(3 - 1, 4 - 22)

cc mengxr rxin JoshRosen

Author: Burak Yavuz brk...@gmail.com

Closes #7945 from brkyvz/freq-fix and squashes the following commits:

07fa001 [Burak Yavuz] address 2
1dc61a8 [Burak Yavuz] address 1
506753e [Burak Yavuz] fixed and added reg test
47bfd50 [Burak Yavuz] pushing

(cherry picked from commit 98e69467d4fda2c26a951409b5b7c6f1e9345ce4)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/78f168e9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/78f168e9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/78f168e9

Branch: refs/heads/branch-1.5
Commit: 78f168e97238316e33ce0d3763ba655603928c32
Parents: e24b976
Author: Burak Yavuz brk...@gmail.com
Authored: Thu Aug 6 10:29:40 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Aug 6 10:29:47 2015 -0700

--
 .../sql/execution/stat/FrequentItems.scala  | 26 +++-
 .../apache/spark/sql/DataFrameStatSuite.scala   | 24 +++---
 2 files changed, 36 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/78f168e9/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala
index 9329148..db46302 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala
@@ -20,17 +20,15 @@ package org.apache.spark.sql.execution.stat
 import scala.collection.mutable.{Map = MutableMap}
 
 import org.apache.spark.Logging
-import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
 import org.apache.spark.sql.types._
-import org.apache.spark.sql.{Column, DataFrame}
+import org.apache.spark.sql.{Row, Column, DataFrame}
 
 private[sql] object FrequentItems extends Logging {
 
   /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */
   private class FreqItemCounter(size: Int) extends Serializable {
 val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long]
-
 /**
  * Add a new example to the counts if it exists, otherwise deduct the count
  * from existing items.
@@ -42,9 +40,15 @@ private[sql] object FrequentItems extends Logging {
 if (baseMap.size  size) {
   baseMap += key - count
 } else {
-  // TODO: Make this more efficient... A flatMap?
-  baseMap.retain((k, v) = v  count)
-  baseMap.transform((k, v) = v - count)
+  val minCount = baseMap.values.min
+  val remainder = count - minCount
+  if (remainder = 0) {
+baseMap += key - count // something will get kicked out, so we 
can add this
+baseMap.retain((k, v) = v  minCount)
+baseMap.transform((k, v) = v - minCount)
+  } else {
+baseMap.transform((k, v) = v - count)
+  }
 }
   }
   this
@@ -90,12 +94,12 @@ private[sql] object FrequentItems extends Logging {
   (name, originalSchema.fields(index).dataType)
 }.toArray
 
-val freqItems = df.select(cols.map(Column(_)) : 
_*).queryExecution.toRdd.aggregate(countMaps)(
+val freqItems = df.select(cols.map(Column(_)) : 
_*).rdd.aggregate(countMaps)(
   seqOp = (counts, row) = {
 var i = 0
 while (i  numCols) {
   val thisMap = counts(i)
-  val key = row.get(i, colInfo(i)._2)
+  val key = row.get(i)
   thisMap.add(key, 1L)
   i += 1
 }
@@ -110,13 +114,13 @@ private[sql] object FrequentItems extends Logging {
 baseCounts
   }
 )
-val justItems = freqItems.map(m = m.baseMap.keys.toArray).map(new 
GenericArrayData(_))
-val resultRow = InternalRow(justItems : _*)
+val justItems = freqItems.map(m = m.baseMap.keys.toArray

spark git commit: [SPARK-9615] [SPARK-9616] [SQL] [MLLIB] Bugs related to FrequentItems when merging and with Tungsten

2015-08-06 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 076ec0568 - 98e69467d


[SPARK-9615] [SPARK-9616] [SQL] [MLLIB] Bugs related to FrequentItems when 
merging and with Tungsten

In short:
1- FrequentItems should not use the InternalRow representation, because the 
keys in the map get messed up. For example, every key in the Map correspond to 
the very last element observed in the partition, when the elements are strings.

2- Merging two partitions had a bug:

**Existing behavior with size 3**
Partition A - Map(1 - 3, 2 - 3, 3 - 4)
Partition B - Map(4 - 25)
Result - Map()

**Correct Behavior:**
Partition A - Map(1 - 3, 2 - 3, 3 - 4)
Partition B - Map(4 - 25)
Result - Map(3 - 1, 4 - 22)

cc mengxr rxin JoshRosen

Author: Burak Yavuz brk...@gmail.com

Closes #7945 from brkyvz/freq-fix and squashes the following commits:

07fa001 [Burak Yavuz] address 2
1dc61a8 [Burak Yavuz] address 1
506753e [Burak Yavuz] fixed and added reg test
47bfd50 [Burak Yavuz] pushing


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/98e69467
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/98e69467
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/98e69467

Branch: refs/heads/master
Commit: 98e69467d4fda2c26a951409b5b7c6f1e9345ce4
Parents: 076ec05
Author: Burak Yavuz brk...@gmail.com
Authored: Thu Aug 6 10:29:40 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Aug 6 10:29:40 2015 -0700

--
 .../sql/execution/stat/FrequentItems.scala  | 26 +++-
 .../apache/spark/sql/DataFrameStatSuite.scala   | 24 +++---
 2 files changed, 36 insertions(+), 14 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/98e69467/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala
index 9329148..db46302 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/FrequentItems.scala
@@ -20,17 +20,15 @@ package org.apache.spark.sql.execution.stat
 import scala.collection.mutable.{Map = MutableMap}
 
 import org.apache.spark.Logging
-import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
 import org.apache.spark.sql.types._
-import org.apache.spark.sql.{Column, DataFrame}
+import org.apache.spark.sql.{Row, Column, DataFrame}
 
 private[sql] object FrequentItems extends Logging {
 
   /** A helper class wrapping `MutableMap[Any, Long]` for simplicity. */
   private class FreqItemCounter(size: Int) extends Serializable {
 val baseMap: MutableMap[Any, Long] = MutableMap.empty[Any, Long]
-
 /**
  * Add a new example to the counts if it exists, otherwise deduct the count
  * from existing items.
@@ -42,9 +40,15 @@ private[sql] object FrequentItems extends Logging {
 if (baseMap.size  size) {
   baseMap += key - count
 } else {
-  // TODO: Make this more efficient... A flatMap?
-  baseMap.retain((k, v) = v  count)
-  baseMap.transform((k, v) = v - count)
+  val minCount = baseMap.values.min
+  val remainder = count - minCount
+  if (remainder = 0) {
+baseMap += key - count // something will get kicked out, so we 
can add this
+baseMap.retain((k, v) = v  minCount)
+baseMap.transform((k, v) = v - minCount)
+  } else {
+baseMap.transform((k, v) = v - count)
+  }
 }
   }
   this
@@ -90,12 +94,12 @@ private[sql] object FrequentItems extends Logging {
   (name, originalSchema.fields(index).dataType)
 }.toArray
 
-val freqItems = df.select(cols.map(Column(_)) : 
_*).queryExecution.toRdd.aggregate(countMaps)(
+val freqItems = df.select(cols.map(Column(_)) : 
_*).rdd.aggregate(countMaps)(
   seqOp = (counts, row) = {
 var i = 0
 while (i  numCols) {
   val thisMap = counts(i)
-  val key = row.get(i, colInfo(i)._2)
+  val key = row.get(i)
   thisMap.add(key, 1L)
   i += 1
 }
@@ -110,13 +114,13 @@ private[sql] object FrequentItems extends Logging {
 baseCounts
   }
 )
-val justItems = freqItems.map(m = m.baseMap.keys.toArray).map(new 
GenericArrayData(_))
-val resultRow = InternalRow(justItems : _*)
+val justItems = freqItems.map(m = m.baseMap.keys.toArray)
+val resultRow = Row(justItems : _*)
 // append frequent Items to the column name for easy debugging
 val outputCols

spark git commit: [SPARK-6486] [MLLIB] [PYTHON] Add BlockMatrix to PySpark.

2015-08-05 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 350006497 - eedb996dd


[SPARK-6486] [MLLIB] [PYTHON] Add BlockMatrix to PySpark.

mengxr This adds the `BlockMatrix` to PySpark.  I have the conversions to 
`IndexedRowMatrix` and `CoordinateMatrix` ready as well, so once PR #7554 is 
completed (which relies on PR #7746), this PR can be finished.

Author: Mike Dusenberry mwdus...@us.ibm.com

Closes #7761 from dusenberrymw/SPARK-6486_Add_BlockMatrix_to_PySpark and 
squashes the following commits:

27195c2 [Mike Dusenberry] Adding one more check to 
_convert_to_matrix_block_tuple, and a few minor documentation changes.
ae50883 [Mike Dusenberry] Minor update: BlockMatrix should inherit from 
DistributedMatrix.
b8acc1c [Mike Dusenberry] Moving BlockMatrix to 
pyspark.mllib.linalg.distributed, updating the logic to match that of the other 
distributed matrices, adding conversions, and adding documentation.
c014002 [Mike Dusenberry] Using properties for better documentation.
3bda6ab [Mike Dusenberry] Adding documentation.
8fb3095 [Mike Dusenberry] Small cleanup.
e17af2e [Mike Dusenberry] Adding BlockMatrix to PySpark.

(cherry picked from commit 34dcf10104460816382908b2b8eeb6c925e862bf)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/eedb996d
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/eedb996d
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/eedb996d

Branch: refs/heads/branch-1.5
Commit: eedb996dde5593a97bcb61b3b1515e6fdea6aa70
Parents: 3500064
Author: Mike Dusenberry mwdus...@us.ibm.com
Authored: Wed Aug 5 07:40:50 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 5 07:42:25 2015 -0700

--
 docs/mllib-data-types.md|  41 +++
 .../spark/mllib/api/python/PythonMLLibAPI.scala |  25 ++
 python/pyspark/mllib/linalg/distributed.py  | 328 ++-
 3 files changed, 388 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/eedb996d/docs/mllib-data-types.md
--
diff --git a/docs/mllib-data-types.md b/docs/mllib-data-types.md
index 11033bf..f0e8d54 100644
--- a/docs/mllib-data-types.md
+++ b/docs/mllib-data-types.md
@@ -494,6 +494,9 @@ rowMat = mat.toRowMatrix()
 
 # Convert to a CoordinateMatrix.
 coordinateMat = mat.toCoordinateMatrix()
+
+# Convert to a BlockMatrix.
+blockMat = mat.toBlockMatrix()
 {% endhighlight %}
 /div
 
@@ -594,6 +597,9 @@ rowMat = mat.toRowMatrix()
 
 # Convert to an IndexedRowMatrix.
 indexedRowMat = mat.toIndexedRowMatrix()
+
+# Convert to a BlockMatrix.
+blockMat = mat.toBlockMatrix()
 {% endhighlight %}
 /div
 
@@ -661,4 +667,39 @@ matA.validate();
 BlockMatrix ata = matA.transpose().multiply(matA);
 {% endhighlight %}
 /div
+
+div data-lang=python markdown=1
+
+A 
[`BlockMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.BlockMatrix)
 
+can be created from an `RDD` of sub-matrix blocks, where a sub-matrix block is 
a 
+`((blockRowIndex, blockColIndex), sub-matrix)` tuple.
+
+{% highlight python %}
+from pyspark.mllib.linalg import Matrices
+from pyspark.mllib.linalg.distributed import BlockMatrix
+
+# Create an RDD of sub-matrix blocks.
+blocks = sc.parallelize([((0, 0), Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])), 
+ ((1, 0), Matrices.dense(3, 2, [7, 8, 9, 10, 11, 
12]))])
+
+# Create a BlockMatrix from an RDD of sub-matrix blocks.
+mat = BlockMatrix(blocks, 3, 2)
+
+# Get its size.
+m = mat.numRows() # 6
+n = mat.numCols() # 2
+
+# Get the blocks as an RDD of sub-matrix blocks.
+blocksRDD = mat.blocks
+
+# Convert to a LocalMatrix.
+localMat = mat.toLocalMatrix()
+
+# Convert to an IndexedRowMatrix.
+indexedRowMat = mat.toIndexedRowMatrix()
+
+# Convert to a CoordinateMatrix.
+coordinateMat = mat.toCoordinateMatrix()
+{% endhighlight %}
+/div
 /div

http://git-wip-us.apache.org/repos/asf/spark/blob/eedb996d/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index d2b3fae..f585aac 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -1129,6 +1129,21 @@ private[python] class PythonMLLibAPI extends 
Serializable {
   }
 
   /**
+   * Wrapper around BlockMatrix constructor.
+   */
+  def createBlockMatrix(blocks: DataFrame, rowsPerBlock: Int, colsPerBlock: 
Int,
+numRows: Long, numCols: Long

spark git commit: [SPARK-6486] [MLLIB] [PYTHON] Add BlockMatrix to PySpark.

2015-08-05 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 519cf6d3f - 34dcf1010


[SPARK-6486] [MLLIB] [PYTHON] Add BlockMatrix to PySpark.

mengxr This adds the `BlockMatrix` to PySpark.  I have the conversions to 
`IndexedRowMatrix` and `CoordinateMatrix` ready as well, so once PR #7554 is 
completed (which relies on PR #7746), this PR can be finished.

Author: Mike Dusenberry mwdus...@us.ibm.com

Closes #7761 from dusenberrymw/SPARK-6486_Add_BlockMatrix_to_PySpark and 
squashes the following commits:

27195c2 [Mike Dusenberry] Adding one more check to 
_convert_to_matrix_block_tuple, and a few minor documentation changes.
ae50883 [Mike Dusenberry] Minor update: BlockMatrix should inherit from 
DistributedMatrix.
b8acc1c [Mike Dusenberry] Moving BlockMatrix to 
pyspark.mllib.linalg.distributed, updating the logic to match that of the other 
distributed matrices, adding conversions, and adding documentation.
c014002 [Mike Dusenberry] Using properties for better documentation.
3bda6ab [Mike Dusenberry] Adding documentation.
8fb3095 [Mike Dusenberry] Small cleanup.
e17af2e [Mike Dusenberry] Adding BlockMatrix to PySpark.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/34dcf101
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/34dcf101
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/34dcf101

Branch: refs/heads/master
Commit: 34dcf10104460816382908b2b8eeb6c925e862bf
Parents: 519cf6d
Author: Mike Dusenberry mwdus...@us.ibm.com
Authored: Wed Aug 5 07:40:50 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 5 07:40:50 2015 -0700

--
 docs/mllib-data-types.md|  41 +++
 .../spark/mllib/api/python/PythonMLLibAPI.scala |  25 ++
 python/pyspark/mllib/linalg/distributed.py  | 328 ++-
 3 files changed, 388 insertions(+), 6 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/34dcf101/docs/mllib-data-types.md
--
diff --git a/docs/mllib-data-types.md b/docs/mllib-data-types.md
index 11033bf..f0e8d54 100644
--- a/docs/mllib-data-types.md
+++ b/docs/mllib-data-types.md
@@ -494,6 +494,9 @@ rowMat = mat.toRowMatrix()
 
 # Convert to a CoordinateMatrix.
 coordinateMat = mat.toCoordinateMatrix()
+
+# Convert to a BlockMatrix.
+blockMat = mat.toBlockMatrix()
 {% endhighlight %}
 /div
 
@@ -594,6 +597,9 @@ rowMat = mat.toRowMatrix()
 
 # Convert to an IndexedRowMatrix.
 indexedRowMat = mat.toIndexedRowMatrix()
+
+# Convert to a BlockMatrix.
+blockMat = mat.toBlockMatrix()
 {% endhighlight %}
 /div
 
@@ -661,4 +667,39 @@ matA.validate();
 BlockMatrix ata = matA.transpose().multiply(matA);
 {% endhighlight %}
 /div
+
+div data-lang=python markdown=1
+
+A 
[`BlockMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.BlockMatrix)
 
+can be created from an `RDD` of sub-matrix blocks, where a sub-matrix block is 
a 
+`((blockRowIndex, blockColIndex), sub-matrix)` tuple.
+
+{% highlight python %}
+from pyspark.mllib.linalg import Matrices
+from pyspark.mllib.linalg.distributed import BlockMatrix
+
+# Create an RDD of sub-matrix blocks.
+blocks = sc.parallelize([((0, 0), Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])), 
+ ((1, 0), Matrices.dense(3, 2, [7, 8, 9, 10, 11, 
12]))])
+
+# Create a BlockMatrix from an RDD of sub-matrix blocks.
+mat = BlockMatrix(blocks, 3, 2)
+
+# Get its size.
+m = mat.numRows() # 6
+n = mat.numCols() # 2
+
+# Get the blocks as an RDD of sub-matrix blocks.
+blocksRDD = mat.blocks
+
+# Convert to a LocalMatrix.
+localMat = mat.toLocalMatrix()
+
+# Convert to an IndexedRowMatrix.
+indexedRowMat = mat.toIndexedRowMatrix()
+
+# Convert to a CoordinateMatrix.
+coordinateMat = mat.toCoordinateMatrix()
+{% endhighlight %}
+/div
 /div

http://git-wip-us.apache.org/repos/asf/spark/blob/34dcf101/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index d2b3fae..f585aac 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -1129,6 +1129,21 @@ private[python] class PythonMLLibAPI extends 
Serializable {
   }
 
   /**
+   * Wrapper around BlockMatrix constructor.
+   */
+  def createBlockMatrix(blocks: DataFrame, rowsPerBlock: Int, colsPerBlock: 
Int,
+numRows: Long, numCols: Long): BlockMatrix = {
+// We use DataFrames for serialization of sub-matrix blocks from
+// Python, so map each Row in the DataFrame

spark git commit: [SPARK-5895] [ML] Add VectorSlicer - updated

2015-08-05 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 9c878923d - a018b8571


[SPARK-5895] [ML] Add VectorSlicer - updated

Add VectorSlicer transformer to spark.ml, with features specified as either 
indices or names.  Transfers feature attributes for selected features.

Updated version of [https://github.com/apache/spark/pull/5731]

CC: yinxusen This updates your PR.  You'll still be the primary author of this 
PR.

CC: mengxr

Author: Xusen Yin yinxu...@gmail.com
Author: Joseph K. Bradley jos...@databricks.com

Closes #7972 from jkbradley/yinxusen-SPARK-5895 and squashes the following 
commits:

b16e86e [Joseph K. Bradley] fixed scala style
71c65d2 [Joseph K. Bradley] fix import order
86e9739 [Joseph K. Bradley] cleanups per code review
9d8d6f1 [Joseph K. Bradley] style fix
83bc2e9 [Joseph K. Bradley] Updated VectorSlicer
98c6939 [Xusen Yin] fix style error
ecbf2d3 [Xusen Yin] change interfaces and params
f6be302 [Xusen Yin] Merge branch 'master' into SPARK-5895
e4781f2 [Xusen Yin] fix commit error
fd154d7 [Xusen Yin] add test suite of vector slicer
17171f8 [Xusen Yin] fix slicer
9ab9747 [Xusen Yin] add vector slicer
aa5a0bf [Xusen Yin] add vector slicer


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a018b857
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a018b857
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a018b857

Branch: refs/heads/master
Commit: a018b85716fd510ae95a3c66d676bbdb90f8d4e7
Parents: 9c87892
Author: Xusen Yin yinxu...@gmail.com
Authored: Wed Aug 5 17:07:55 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 5 17:07:55 2015 -0700

--
 .../apache/spark/ml/feature/VectorSlicer.scala  | 170 +++
 .../apache/spark/ml/util/MetadataUtils.scala|  17 ++
 .../org/apache/spark/mllib/linalg/Vectors.scala |  24 +++
 .../spark/ml/feature/VectorSlicerSuite.scala| 109 
 .../spark/mllib/linalg/VectorsSuite.scala   |   7 +
 5 files changed, 327 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a018b857/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala
new file mode 100644
index 000..772bebe
--- /dev/null
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.attribute.{Attribute, AttributeGroup}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.param.{IntArrayParam, ParamMap, StringArrayParam}
+import org.apache.spark.ml.util.{Identifiable, MetadataUtils, SchemaUtils}
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.StructType
+
+/**
+ * :: Experimental ::
+ * This class takes a feature vector and outputs a new feature vector with a 
subarray of the
+ * original features.
+ *
+ * The subset of features can be specified with either indices 
([[setIndices()]])
+ * or names ([[setNames()]]).  At least one feature must be selected. 
Duplicate features
+ * are not allowed, so there can be no overlap between selected indices and 
names.
+ *
+ * The output vector will order features with the selected indices first (in 
the order given),
+ * followed by the selected names (in the order given).
+ */
+@Experimental
+final class VectorSlicer(override val uid: String)
+  extends Transformer with HasInputCol with HasOutputCol {
+
+  def this() = this(Identifiable.randomUID(vectorSlicer))
+
+  /**
+   * An array of indices to select features from a vector column.
+   * There can be no overlap with [[names]].
+   * @group param

spark git commit: [SPARK-9657] Fix return type of getMaxPatternLength

2015-08-05 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master f9c2a2af1 - dac090d1e


[SPARK-9657] Fix return type of getMaxPatternLength

mengxr

Author: Feynman Liang fli...@databricks.com

Closes #7974 from feynmanliang/SPARK-9657 and squashes the following commits:

7ca533f [Feynman Liang] Fix return type of getMaxPatternLength


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dac090d1
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dac090d1
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dac090d1

Branch: refs/heads/master
Commit: dac090d1e9be7dec6c5ebdb2a81105b87e853193
Parents: f9c2a2a
Author: Feynman Liang fli...@databricks.com
Authored: Wed Aug 5 15:42:18 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 5 15:42:18 2015 -0700

--
 mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dac090d1/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
index d5f0c92..ad6715b5 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
@@ -82,7 +82,7 @@ class PrefixSpan private (
   /**
* Gets the maximal pattern length (i.e. the length of the longest 
sequential pattern to consider.
*/
-  def getMaxPatternLength: Double = maxPatternLength
+  def getMaxPatternLength: Int = maxPatternLength
 
   /**
* Sets maximal pattern length (default: `10`).


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9657] Fix return type of getMaxPatternLength

2015-08-05 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 05cbf133d - 30e9fcfb3


[SPARK-9657] Fix return type of getMaxPatternLength

mengxr

Author: Feynman Liang fli...@databricks.com

Closes #7974 from feynmanliang/SPARK-9657 and squashes the following commits:

7ca533f [Feynman Liang] Fix return type of getMaxPatternLength

(cherry picked from commit dac090d1e9be7dec6c5ebdb2a81105b87e853193)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/30e9fcfb
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/30e9fcfb
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/30e9fcfb

Branch: refs/heads/branch-1.5
Commit: 30e9fcfb321966c09f86eec4e70c579d6dff1cca
Parents: 05cbf13
Author: Feynman Liang fli...@databricks.com
Authored: Wed Aug 5 15:42:18 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 5 15:42:24 2015 -0700

--
 mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/30e9fcfb/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
index d5f0c92..ad6715b5 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
@@ -82,7 +82,7 @@ class PrefixSpan private (
   /**
* Gets the maximal pattern length (i.e. the length of the longest 
sequential pattern to consider.
*/
-  def getMaxPatternLength: Double = maxPatternLength
+  def getMaxPatternLength: Int = maxPatternLength
 
   /**
* Sets maximal pattern length (default: `10`).


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-5895] [ML] Add VectorSlicer - updated

2015-08-05 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 618dc63e7 - 3b617e87c


[SPARK-5895] [ML] Add VectorSlicer - updated

Add VectorSlicer transformer to spark.ml, with features specified as either 
indices or names.  Transfers feature attributes for selected features.

Updated version of [https://github.com/apache/spark/pull/5731]

CC: yinxusen This updates your PR.  You'll still be the primary author of this 
PR.

CC: mengxr

Author: Xusen Yin yinxu...@gmail.com
Author: Joseph K. Bradley jos...@databricks.com

Closes #7972 from jkbradley/yinxusen-SPARK-5895 and squashes the following 
commits:

b16e86e [Joseph K. Bradley] fixed scala style
71c65d2 [Joseph K. Bradley] fix import order
86e9739 [Joseph K. Bradley] cleanups per code review
9d8d6f1 [Joseph K. Bradley] style fix
83bc2e9 [Joseph K. Bradley] Updated VectorSlicer
98c6939 [Xusen Yin] fix style error
ecbf2d3 [Xusen Yin] change interfaces and params
f6be302 [Xusen Yin] Merge branch 'master' into SPARK-5895
e4781f2 [Xusen Yin] fix commit error
fd154d7 [Xusen Yin] add test suite of vector slicer
17171f8 [Xusen Yin] fix slicer
9ab9747 [Xusen Yin] add vector slicer
aa5a0bf [Xusen Yin] add vector slicer

(cherry picked from commit a018b85716fd510ae95a3c66d676bbdb90f8d4e7)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3b617e87
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3b617e87
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3b617e87

Branch: refs/heads/branch-1.5
Commit: 3b617e87cc8524a86a9d5c4a9971520b91119736
Parents: 618dc63
Author: Xusen Yin yinxu...@gmail.com
Authored: Wed Aug 5 17:07:55 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Wed Aug 5 17:08:04 2015 -0700

--
 .../apache/spark/ml/feature/VectorSlicer.scala  | 170 +++
 .../apache/spark/ml/util/MetadataUtils.scala|  17 ++
 .../org/apache/spark/mllib/linalg/Vectors.scala |  24 +++
 .../spark/ml/feature/VectorSlicerSuite.scala| 109 
 .../spark/mllib/linalg/VectorsSuite.scala   |   7 +
 5 files changed, 327 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/3b617e87/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala
new file mode 100644
index 000..772bebe
--- /dev/null
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSlicer.scala
@@ -0,0 +1,170 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.attribute.{Attribute, AttributeGroup}
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.param.{IntArrayParam, ParamMap, StringArrayParam}
+import org.apache.spark.ml.util.{Identifiable, MetadataUtils, SchemaUtils}
+import org.apache.spark.mllib.linalg._
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types.StructType
+
+/**
+ * :: Experimental ::
+ * This class takes a feature vector and outputs a new feature vector with a 
subarray of the
+ * original features.
+ *
+ * The subset of features can be specified with either indices 
([[setIndices()]])
+ * or names ([[setNames()]]).  At least one feature must be selected. 
Duplicate features
+ * are not allowed, so there can be no overlap between selected indices and 
names.
+ *
+ * The output vector will order features with the selected indices first (in 
the order given),
+ * followed by the selected names (in the order given).
+ */
+@Experimental
+final class VectorSlicer(override val uid: String)
+  extends Transformer with HasInputCol with HasOutputCol {
+
+  def this() = this(Identifiable.randomUID(vectorSlicer

spark git commit: [SPARK-9540] [MLLIB] optimize PrefixSpan implementation

2015-08-04 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 6e72d24e2 - bca196754


[SPARK-9540] [MLLIB] optimize PrefixSpan implementation

This is a major refactoring of the PrefixSpan implementation. It contains the 
following changes:

1. Expand prefix with one item at a time. The existing implementation generates 
all subsets for each itemset, which might have scalability issue when the 
itemset is large.
2. Use a new internal format. `(12)(31)` is represented by `[0, 1, 2, 0, 1, 
3, 0]` internally. We use `0` because negative numbers are used to indicates 
partial prefix items, e.g., `_2` is represented by `-2`.
3. Remember the start indices of all partial projections in the projected 
postfix to help next projection.
4. Reuse the original sequence array for projected postfixes.
5. Use `Prefix` IDs in aggregation rather than its content.
6. Use `ArrayBuilder` for building primitive arrays.
7. Expose `maxLocalProjDBSize`.
8. Tests are not changed except using `0` instead of `-1` as the delimiter.

`Postfix`'s API doc should be a good place to start.

Closes #7594

feynmanliang zhangjiajin

Author: Xiangrui Meng m...@databricks.com

Closes #7937 from mengxr/SPARK-9540 and squashes the following commits:

2d0ec31 [Xiangrui Meng] address more comments
48f450c [Xiangrui Meng] address comments from Feynman; fixed a bug in project 
and added a test
65f90e8 [Xiangrui Meng] naming and documentation
8afc86a [Xiangrui Meng] refactor impl

(cherry picked from commit a02bcf20c4fc9e2e182630d197221729e996afc2)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/bca19675
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/bca19675
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/bca19675

Branch: refs/heads/branch-1.5
Commit: bca196754ddf2ccd057d775bd5c3f7d3e5657e6f
Parents: 6e72d24
Author: Xiangrui Meng m...@databricks.com
Authored: Tue Aug 4 22:28:49 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 4 22:28:58 2015 -0700

--
 .../spark/mllib/fpm/LocalPrefixSpan.scala   | 132 +++--
 .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 587 ---
 .../spark/mllib/fpm/PrefixSpanSuite.scala   | 271 +
 3 files changed, 599 insertions(+), 391 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/bca19675/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala
index ccebf95..3ea1077 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala
@@ -22,85 +22,89 @@ import scala.collection.mutable
 import org.apache.spark.Logging
 
 /**
- * Calculate all patterns of a projected database in local.
+ * Calculate all patterns of a projected database in local mode.
+ *
+ * @param minCount minimal count for a frequent pattern
+ * @param maxPatternLength max pattern length for a frequent pattern
  */
-private[fpm] object LocalPrefixSpan extends Logging with Serializable {
-  import PrefixSpan._
+private[fpm] class LocalPrefixSpan(
+val minCount: Long,
+val maxPatternLength: Int) extends Logging with Serializable {
+  import PrefixSpan.Postfix
+  import LocalPrefixSpan.ReversedPrefix
+
   /**
-   * Calculate all patterns of a projected database.
-   * @param minCount minimum count
-   * @param maxPatternLength maximum pattern length
-   * @param prefixes prefixes in reversed order
-   * @param database the projected database
-   * @return a set of sequential pattern pairs,
-   * the key of pair is sequential pattern (a list of items in 
reversed order),
-   * the value of pair is the pattern's count.
+   * Generates frequent patterns on the input array of postfixes.
+   * @param postfixes an array of postfixes
+   * @return an iterator of (frequent pattern, count)
*/
-  def run(
-  minCount: Long,
-  maxPatternLength: Int,
-  prefixes: List[Set[Int]],
-  database: Iterable[List[Set[Int]]]): Iterator[(List[Set[Int]], Long)] = {
-if (prefixes.length == maxPatternLength || database.isEmpty) {
-  return Iterator.empty
-}
-val freqItemSetsAndCounts = getFreqItemAndCounts(minCount, database)
-val freqItems = freqItemSetsAndCounts.keys.flatten.toSet
-val filteredDatabase = database.map { suffix =
-  suffix
-.map(item = freqItems.intersect(item))
-.filter(_.nonEmpty)
-}
-freqItemSetsAndCounts.iterator.flatMap { case (item, count) =
-  val newPrefixes = item :: prefixes
-  val

spark git commit: [SPARK-6485] [MLLIB] [PYTHON] Add CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark.

2015-08-04 Thread meng

 conversion logic.
4d7af86 [Mike Dusenberry] Adding type checks to the constructors.  Although it 
is slightly verbose, it is better for the user to have a good error message 
than a cryptic stacktrace.
93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out of 
this pull request.
f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out of 
this pull request.
6a3ecb7 [Mike Dusenberry] Updating pattern matching.
08f287b [Mike Dusenberry] Slight reformatting of the documentation.
a245dc0 [Mike Dusenberry] Updating Python doctests for compatability between 
Python 2  3. Since Python 3 removed the idea of a separate 'long' type, all 
values that would have been outputted as a 'long' (ex: '4L') will now be 
treated as an 'int' and outputed as one (ex: '4').  The doctests now explicitly 
convert to ints so that both Python 2 and 3 will have the same output.  This is 
fine since the values are all small, and thus can be easily represented as ints.
4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines.
7e3ca16 [Mike Dusenberry] Fixing long lines.
f721ead [Mike Dusenberry] Updating documentation for each of the distributed 
matrices.
ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful.
dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the 
various distributed matrices.  Added logic to be able to access the 
rows/entries of the distributed matrices, which requires serialization through 
DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests.
0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark, 
following the idea of the IndexedRowMatrix API, including using DataFrames for 
serialization.
3c369cb [Mike Dusenberry] Updating the architecture a bit to make conversions 
between the various distributed matrix types easier.  The different distributed 
matrix classes are now only wrappers around the Java objects, and take the Java 
object as an argument during construction.  This way, we can call  for example 
on an , which returns a reference to a Java RowMatrix object, and then 
construct a PySpark RowMatrix object wrapped around the Java object.  This is 
analogous to the behavior of PySpark RDDs and DataFrames.  We now delegate 
creation of the various distributed matrices from scratch in PySpark to the 
factory methods on .
4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark, 
following the idea of the RowMatrix API.  Note that for the IndexedRowMatrix, 
we use DataFrames to serialize the data between Python and Scala/Java, so we 
accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on 
the Scala/Java side before constructing the IndexedRowMatrix.
23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix. 
Inserting newline above doctest so that it renders properly in API docs.
b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix simply 
create and keep a reference to a wrapper over a Java RowMatrix.  Updating 
DistributedMatrices factory methods to accept numRows and numCols with default 
values.  Updating PySpark DistributedMatrices factory method to simply create a 
PySpark RowMatrix. Adding additional doctests for numRows and numCols 
parameters.
bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods.
d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing the 
following: Added a DistributedMatrices class to contain factory methods for 
creating the various distributed matrices.  Added a factory method for creating 
a RowMatrix from an RDD of Vectors.  Added a createRowMatrix function to the 
PythonMLlibAPI to interface with the factory method.  Added DistributedMatrix, 
DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/571d5b53
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/571d5b53
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/571d5b53

Branch: refs/heads/master
Commit: 571d5b5363ff4dbbce1f7019ab8e86cbc3cba4d5
Parents: 1833d9c
Author: Mike Dusenberry mwdus...@us.ibm.com
Authored: Tue Aug 4 16:30:03 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 4 16:30:03 2015 -0700

--
 dev/sparktestsupport/modules.py |   1 +
 docs/mllib-data-types.md| 106 +++-
 .../spark/mllib/api/python/PythonMLLibAPI.scala |  53 +-
 python/docs/pyspark.mllib.rst   |   8 +
 python/pyspark/mllib/common.py  |   2 +
 python/pyspark/mllib/linalg/distributed.py  | 537 +++
 6 files changed, 704 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/571d5b53/dev/sparktestsupport

spark git commit: [SPARK-6485] [MLLIB] [PYTHON] Add CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark.

2015-08-04 Thread meng

 conversion logic.
4d7af86 [Mike Dusenberry] Adding type checks to the constructors.  Although it 
is slightly verbose, it is better for the user to have a good error message 
than a cryptic stacktrace.
93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out of 
this pull request.
f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out of 
this pull request.
6a3ecb7 [Mike Dusenberry] Updating pattern matching.
08f287b [Mike Dusenberry] Slight reformatting of the documentation.
a245dc0 [Mike Dusenberry] Updating Python doctests for compatability between 
Python 2  3. Since Python 3 removed the idea of a separate 'long' type, all 
values that would have been outputted as a 'long' (ex: '4L') will now be 
treated as an 'int' and outputed as one (ex: '4').  The doctests now explicitly 
convert to ints so that both Python 2 and 3 will have the same output.  This is 
fine since the values are all small, and thus can be easily represented as ints.
4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines.
7e3ca16 [Mike Dusenberry] Fixing long lines.
f721ead [Mike Dusenberry] Updating documentation for each of the distributed 
matrices.
ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful.
dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the 
various distributed matrices.  Added logic to be able to access the 
rows/entries of the distributed matrices, which requires serialization through 
DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests.
0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark, 
following the idea of the IndexedRowMatrix API, including using DataFrames for 
serialization.
3c369cb [Mike Dusenberry] Updating the architecture a bit to make conversions 
between the various distributed matrix types easier.  The different distributed 
matrix classes are now only wrappers around the Java objects, and take the Java 
object as an argument during construction.  This way, we can call  for example 
on an , which returns a reference to a Java RowMatrix object, and then 
construct a PySpark RowMatrix object wrapped around the Java object.  This is 
analogous to the behavior of PySpark RDDs and DataFrames.  We now delegate 
creation of the various distributed matrices from scratch in PySpark to the 
factory methods on .
4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark, 
following the idea of the RowMatrix API.  Note that for the IndexedRowMatrix, 
we use DataFrames to serialize the data between Python and Scala/Java, so we 
accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on 
the Scala/Java side before constructing the IndexedRowMatrix.
23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix. 
Inserting newline above doctest so that it renders properly in API docs.
b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix simply 
create and keep a reference to a wrapper over a Java RowMatrix.  Updating 
DistributedMatrices factory methods to accept numRows and numCols with default 
values.  Updating PySpark DistributedMatrices factory method to simply create a 
PySpark RowMatrix. Adding additional doctests for numRows and numCols 
parameters.
bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods.
d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing the 
following: Added a DistributedMatrices class to contain factory methods for 
creating the various distributed matrices.  Added a factory method for creating 
a RowMatrix from an RDD of Vectors.  Added a createRowMatrix function to the 
PythonMLlibAPI to interface with the factory method.  Added DistributedMatrix, 
DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.

(cherry picked from commit 571d5b5363ff4dbbce1f7019ab8e86cbc3cba4d5)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/f4e125ac
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/f4e125ac
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/f4e125ac

Branch: refs/heads/branch-1.5
Commit: f4e125acf36023425722abb0fb74be63a425aa7b
Parents: fe4a4f4
Author: Mike Dusenberry mwdus...@us.ibm.com
Authored: Tue Aug 4 16:30:03 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 4 16:30:11 2015 -0700

--
 dev/sparktestsupport/modules.py |   1 +
 docs/mllib-data-types.md| 106 +++-
 .../spark/mllib/api/python/PythonMLLibAPI.scala |  53 +-
 python/docs/pyspark.mllib.rst   |   8 +
 python/pyspark/mllib/common.py  |   2 +
 python/pyspark/mllib/linalg/distributed.py  | 537 +++
 6 files changed, 704 insertions(+), 3 deletions

spark git commit: [SPARK-9586] [ML] Update BinaryClassificationEvaluator to use setRawPredictionCol

2015-08-04 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 f4e125acf - cff0fe291


[SPARK-9586] [ML] Update BinaryClassificationEvaluator to use 
setRawPredictionCol

Update BinaryClassificationEvaluator to use setRawPredictionCol, rather than 
setScoreCol. Deprecated setScoreCol.

I don't think setScoreCol was actually used anywhere (based on search).

CC: mengxr

Author: Joseph K. Bradley jos...@databricks.com

Closes #7921 from jkbradley/binary-eval-rawpred and squashes the following 
commits:

e5d7dfa [Joseph K. Bradley] Update BinaryClassificationEvaluator to use 
setRawPredictionCol

(cherry picked from commit b77d3b9688d56d33737909375d1d0db07da5827b)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cff0fe29
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cff0fe29
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cff0fe29

Branch: refs/heads/branch-1.5
Commit: cff0fe291aa470ef5cf4e5087c7114fb6360572f
Parents: f4e125a
Author: Joseph K. Bradley jos...@databricks.com
Authored: Tue Aug 4 16:52:43 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 4 16:52:53 2015 -0700

--
 .../spark/ml/evaluation/BinaryClassificationEvaluator.scala | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cff0fe29/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
index 4a82b77..5d5cb7e 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
@@ -28,7 +28,7 @@ import org.apache.spark.sql.types.DoubleType
 
 /**
  * :: Experimental ::
- * Evaluator for binary classification, which expects two input columns: score 
and label.
+ * Evaluator for binary classification, which expects two input columns: 
rawPrediction and label.
  */
 @Experimental
 class BinaryClassificationEvaluator(override val uid: String)
@@ -50,6 +50,13 @@ class BinaryClassificationEvaluator(override val uid: String)
   def setMetricName(value: String): this.type = set(metricName, value)
 
   /** @group setParam */
+  def setRawPredictionCol(value: String): this.type = set(rawPredictionCol, 
value)
+
+  /**
+   * @group setParam
+   * @deprecated use [[setRawPredictionCol()]] instead
+   */
+  @deprecated(use setRawPredictionCol instead, 1.5.0)
   def setScoreCol(value: String): this.type = set(rawPredictionCol, value)
 
   /** @group setParam */


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9586] [ML] Update BinaryClassificationEvaluator to use setRawPredictionCol

2015-08-04 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 571d5b536 - b77d3b968


[SPARK-9586] [ML] Update BinaryClassificationEvaluator to use 
setRawPredictionCol

Update BinaryClassificationEvaluator to use setRawPredictionCol, rather than 
setScoreCol. Deprecated setScoreCol.

I don't think setScoreCol was actually used anywhere (based on search).

CC: mengxr

Author: Joseph K. Bradley jos...@databricks.com

Closes #7921 from jkbradley/binary-eval-rawpred and squashes the following 
commits:

e5d7dfa [Joseph K. Bradley] Update BinaryClassificationEvaluator to use 
setRawPredictionCol


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/b77d3b96
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/b77d3b96
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/b77d3b96

Branch: refs/heads/master
Commit: b77d3b9688d56d33737909375d1d0db07da5827b
Parents: 571d5b5
Author: Joseph K. Bradley jos...@databricks.com
Authored: Tue Aug 4 16:52:43 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Tue Aug 4 16:52:43 2015 -0700

--
 .../spark/ml/evaluation/BinaryClassificationEvaluator.scala | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/b77d3b96/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
index 4a82b77..5d5cb7e 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala
@@ -28,7 +28,7 @@ import org.apache.spark.sql.types.DoubleType
 
 /**
  * :: Experimental ::
- * Evaluator for binary classification, which expects two input columns: score 
and label.
+ * Evaluator for binary classification, which expects two input columns: 
rawPrediction and label.
  */
 @Experimental
 class BinaryClassificationEvaluator(override val uid: String)
@@ -50,6 +50,13 @@ class BinaryClassificationEvaluator(override val uid: String)
   def setMetricName(value: String): this.type = set(metricName, value)
 
   /** @group setParam */
+  def setRawPredictionCol(value: String): this.type = set(rawPredictionCol, 
value)
+
+  /**
+   * @group setParam
+   * @deprecated use [[setRawPredictionCol()]] instead
+   */
+  @deprecated(use setRawPredictionCol instead, 1.5.0)
   def setScoreCol(value: String): this.type = set(rawPredictionCol, value)
 
   /** @group setParam */


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9544] [MLLIB] add Python API for RFormula

2015-08-03 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.5 444058d91 - dc0c8c982


[SPARK-9544] [MLLIB] add Python API for RFormula

Add Python API for RFormula. Similar to other feature transformers in Python. 
This is just a thin wrapper over the Scala implementation. ericl MechCoder

Author: Xiangrui Meng m...@databricks.com

Closes #7879 from mengxr/SPARK-9544 and squashes the following commits:

3d5ff03 [Xiangrui Meng] add an doctest for . and -
5e969a5 [Xiangrui Meng] fix pydoc
1cd41f8 [Xiangrui Meng] organize imports
3c18b10 [Xiangrui Meng] add Python API for RFormula

(cherry picked from commit e4765a46833baff1dd7465c4cf50e947de7e8f21)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/dc0c8c98
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/dc0c8c98
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/dc0c8c98

Branch: refs/heads/branch-1.5
Commit: dc0c8c982825c3c58b7c6c4570c03ba97dba608b
Parents: 444058d
Author: Xiangrui Meng m...@databricks.com
Authored: Mon Aug 3 13:59:35 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 3 13:59:45 2015 -0700

--
 .../org/apache/spark/ml/feature/RFormula.scala  | 21 ++---
 python/pyspark/ml/feature.py| 85 +++-
 2 files changed, 91 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/dc0c8c98/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
index d172691..d5360c9 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
@@ -19,16 +19,14 @@ package org.apache.spark.ml.feature
 
 import scala.collection.mutable
 import scala.collection.mutable.ArrayBuffer
-import scala.util.parsing.combinator.RegexParsers
 
 import org.apache.spark.annotation.Experimental
-import org.apache.spark.ml.{Estimator, Model, Transformer, Pipeline, 
PipelineModel, PipelineStage}
+import org.apache.spark.ml.{Estimator, Model, Pipeline, PipelineModel, 
PipelineStage, Transformer}
 import org.apache.spark.ml.param.{Param, ParamMap}
 import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
 import org.apache.spark.ml.util.Identifiable
 import org.apache.spark.mllib.linalg.VectorUDT
 import org.apache.spark.sql.DataFrame
-import org.apache.spark.sql.functions._
 import org.apache.spark.sql.types._
 
 /**
@@ -63,31 +61,26 @@ class RFormula(override val uid: String) extends 
Estimator[RFormulaModel] with R
*/
   val formula: Param[String] = new Param(this, formula, R model formula)
 
-  private var parsedFormula: Option[ParsedRFormula] = None
-
   /**
* Sets the formula to use for this transformer. Must be called before use.
* @group setParam
* @param value an R formula in string form (e.g. y ~ x + z)
*/
-  def setFormula(value: String): this.type = {
-parsedFormula = Some(RFormulaParser.parse(value))
-set(formula, value)
-this
-  }
+  def setFormula(value: String): this.type = set(formula, value)
 
   /** @group getParam */
   def getFormula: String = $(formula)
 
   /** Whether the formula specifies fitting an intercept. */
   private[ml] def hasIntercept: Boolean = {
-require(parsedFormula.isDefined, Must call setFormula() first.)
-parsedFormula.get.hasIntercept
+require(isDefined(formula), Formula must be defined first.)
+RFormulaParser.parse($(formula)).hasIntercept
   }
 
   override def fit(dataset: DataFrame): RFormulaModel = {
-require(parsedFormula.isDefined, Must call setFormula() first.)
-val resolvedFormula = parsedFormula.get.resolve(dataset.schema)
+require(isDefined(formula), Formula must be defined first.)
+val parsedFormula = RFormulaParser.parse($(formula))
+val resolvedFormula = parsedFormula.resolve(dataset.schema)
 // StringType terms and terms representing interactions need to be encoded 
before assembly.
 // TODO(ekl) add support for feature interactions
 val encoderStages = ArrayBuffer[PipelineStage]()

http://git-wip-us.apache.org/repos/asf/spark/blob/dc0c8c98/python/pyspark/ml/feature.py
--
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 015e7a9..3f04c41 100644
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -24,7 +24,7 @@ from pyspark.mllib.common import inherit_doc
 __all__ = ['Binarizer', 'HashingTF', 'IDF', 'IDFModel', 'NGram', 'Normalizer', 
'OneHotEncoder

spark git commit: [SPARK-9544] [MLLIB] add Python API for RFormula

2015-08-03 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 8ca287ebb - e4765a468


[SPARK-9544] [MLLIB] add Python API for RFormula

Add Python API for RFormula. Similar to other feature transformers in Python. 
This is just a thin wrapper over the Scala implementation. ericl MechCoder

Author: Xiangrui Meng m...@databricks.com

Closes #7879 from mengxr/SPARK-9544 and squashes the following commits:

3d5ff03 [Xiangrui Meng] add an doctest for . and -
5e969a5 [Xiangrui Meng] fix pydoc
1cd41f8 [Xiangrui Meng] organize imports
3c18b10 [Xiangrui Meng] add Python API for RFormula


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e4765a46
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e4765a46
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e4765a46

Branch: refs/heads/master
Commit: e4765a46833baff1dd7465c4cf50e947de7e8f21
Parents: 8ca287e
Author: Xiangrui Meng m...@databricks.com
Authored: Mon Aug 3 13:59:35 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Mon Aug 3 13:59:35 2015 -0700

--
 .../org/apache/spark/ml/feature/RFormula.scala  | 21 ++---
 python/pyspark/ml/feature.py| 85 +++-
 2 files changed, 91 insertions(+), 15 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e4765a46/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
index d172691..d5360c9 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
@@ -19,16 +19,14 @@ package org.apache.spark.ml.feature
 
 import scala.collection.mutable
 import scala.collection.mutable.ArrayBuffer
-import scala.util.parsing.combinator.RegexParsers
 
 import org.apache.spark.annotation.Experimental
-import org.apache.spark.ml.{Estimator, Model, Transformer, Pipeline, 
PipelineModel, PipelineStage}
+import org.apache.spark.ml.{Estimator, Model, Pipeline, PipelineModel, 
PipelineStage, Transformer}
 import org.apache.spark.ml.param.{Param, ParamMap}
 import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
 import org.apache.spark.ml.util.Identifiable
 import org.apache.spark.mllib.linalg.VectorUDT
 import org.apache.spark.sql.DataFrame
-import org.apache.spark.sql.functions._
 import org.apache.spark.sql.types._
 
 /**
@@ -63,31 +61,26 @@ class RFormula(override val uid: String) extends 
Estimator[RFormulaModel] with R
*/
   val formula: Param[String] = new Param(this, formula, R model formula)
 
-  private var parsedFormula: Option[ParsedRFormula] = None
-
   /**
* Sets the formula to use for this transformer. Must be called before use.
* @group setParam
* @param value an R formula in string form (e.g. y ~ x + z)
*/
-  def setFormula(value: String): this.type = {
-parsedFormula = Some(RFormulaParser.parse(value))
-set(formula, value)
-this
-  }
+  def setFormula(value: String): this.type = set(formula, value)
 
   /** @group getParam */
   def getFormula: String = $(formula)
 
   /** Whether the formula specifies fitting an intercept. */
   private[ml] def hasIntercept: Boolean = {
-require(parsedFormula.isDefined, Must call setFormula() first.)
-parsedFormula.get.hasIntercept
+require(isDefined(formula), Formula must be defined first.)
+RFormulaParser.parse($(formula)).hasIntercept
   }
 
   override def fit(dataset: DataFrame): RFormulaModel = {
-require(parsedFormula.isDefined, Must call setFormula() first.)
-val resolvedFormula = parsedFormula.get.resolve(dataset.schema)
+require(isDefined(formula), Formula must be defined first.)
+val parsedFormula = RFormulaParser.parse($(formula))
+val resolvedFormula = parsedFormula.resolve(dataset.schema)
 // StringType terms and terms representing interactions need to be encoded 
before assembly.
 // TODO(ekl) add support for feature interactions
 val encoderStages = ArrayBuffer[PipelineStage]()

http://git-wip-us.apache.org/repos/asf/spark/blob/e4765a46/python/pyspark/ml/feature.py
--
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 015e7a9..3f04c41 100644
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -24,7 +24,7 @@ from pyspark.mllib.common import inherit_doc
 __all__ = ['Binarizer', 'HashingTF', 'IDF', 'IDFModel', 'NGram', 'Normalizer', 
'OneHotEncoder',
'PolynomialExpansion', 'RegexTokenizer', 'StandardScaler', 
'StandardScalerModel',
'StringIndexer', 'StringIndexerModel

spark git commit: [SPARK-9000] [MLLIB] Support generic item types in PrefixSpan

2015-08-02 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 57084e0c7 - 28d944e86


[SPARK-9000] [MLLIB] Support generic item types in PrefixSpan

mengxr Please review after #7818 merges and master is rebased.

Continues work by rikima

Closes #7400

Author: Feynman Liang fli...@databricks.com
Author: masaki rikitoku rikima3...@gmail.com

Closes #7837 from feynmanliang/SPARK-7400-genericItems and squashes the 
following commits:

8b2c756 [Feynman Liang] Remove orig
92443c8 [Feynman Liang] Style fixes
42c6349 [Feynman Liang] Style fix
14e67fc [Feynman Liang] Generic prefixSpan itemtypes
b3b21e0 [Feynman Liang] Initial support for generic itemtype in public api
b86e0d5 [masaki rikitoku] modify to support generic item type


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/28d944e8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/28d944e8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/28d944e8

Branch: refs/heads/master
Commit: 28d944e86d066eb4c651dd803f0b022605ed644e
Parents: 57084e0
Author: Feynman Liang fli...@databricks.com
Authored: Sat Aug 1 23:11:25 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Sat Aug 1 23:11:25 2015 -0700

--
 .../org/apache/spark/mllib/fpm/PrefixSpan.scala |  40 ++-
 .../spark/mllib/fpm/PrefixSpanSuite.scala   | 104 +--
 2 files changed, 132 insertions(+), 12 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/28d944e8/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
index 22b4ddb..c1761c3 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
@@ -18,6 +18,7 @@
 package org.apache.spark.mllib.fpm
 
 import scala.collection.mutable.ArrayBuilder
+import scala.reflect.ClassTag
 
 import org.apache.spark.Logging
 import org.apache.spark.annotation.Experimental
@@ -90,15 +91,44 @@ class PrefixSpan private (
   }
 
   /**
-   * Find the complete set of sequential patterns in the input sequences.
+   * Find the complete set of sequential patterns in the input sequences of 
itemsets.
+   * @param data ordered sequences of itemsets.
+   * @return (sequential itemset pattern, count) tuples
+   */
+  def run[Item: ClassTag](data: RDD[Array[Array[Item]]]): 
RDD[(Array[Array[Item]], Long)] = {
+val itemToInt = data.aggregate(Set[Item]())(
+  seqOp = { (uniqItems, item) = uniqItems ++ item.flatten.toSet },
+  combOp = { _ ++ _ }
+).zipWithIndex.toMap
+val intToItem = Map() ++ (itemToInt.map { case (k, v) = (v, k) })
+
+val dataInternalRepr = data.map { seq =
+  seq.map(itemset = itemset.map(itemToInt)).reduce((a, b) = a ++ 
(DELIMITER +: b))
+}
+val results = run(dataInternalRepr)
+
+def toPublicRepr(pattern: Iterable[Int]): List[Array[Item]] = {
+  pattern.span(_ != DELIMITER) match {
+case (x, xs) if xs.size  1 = x.map(intToItem).toArray :: 
toPublicRepr(xs.tail)
+case (x, xs) = List(x.map(intToItem).toArray)
+  }
+}
+results.map { case (seq: Array[Int], count: Long) =
+  (toPublicRepr(seq).toArray, count)
+}
+  }
+
+  /**
+   * Find the complete set of sequential patterns in the input sequences. This 
method utilizes
+   * the internal representation of itemsets as Array[Int] where each itemset 
is represented by
+   * a contiguous sequence of non-negative integers and delimiters represented 
by [[DELIMITER]].
* @param data ordered sequences of itemsets. Items are represented by 
non-negative integers.
-   *  Each itemset has one or more items and is delimited by 
[[DELIMITER]].
+   * Each itemset has one or more items and is delimited by 
[[DELIMITER]].
* @return a set of sequential pattern pairs,
* the key of pair is pattern (a list of elements),
* the value of pair is the pattern's count.
*/
-  // TODO: generalize to arbitrary item-types and use mapping to Ints for 
internal algorithm
-  def run(data: RDD[Array[Int]]): RDD[(Array[Int], Long)] = {
+  private[fpm] def run(data: RDD[Array[Int]]): RDD[(Array[Int], Long)] = {
 val sc = data.sparkContext
 
 if (data.getStorageLevel == StorageLevel.NONE) {
@@ -260,7 +290,7 @@ class PrefixSpan private (
 private[fpm] object PrefixSpan {
   private[fpm] val DELIMITER = -1
 
-  /** Splits a sequence of itemsets delimited by [[DELIMITER]]. */
+  /** Splits an array of itemsets delimited by [[DELIMITER]]. */
   private[fpm] def splitSequence(sequence: List[Int]): List[Set[Int]] = {
 sequence.span

spark git commit: [SPARK-9527] [MLLIB] add PrefixSpanModel and make PrefixSpan Java friendly

2015-08-02 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 8eafa2aeb - 66924ffa6


[SPARK-9527] [MLLIB] add PrefixSpanModel and make PrefixSpan Java friendly

1. Use `PrefixSpanModel` to wrap the frequent sequences.
2. Define `FreqSequence` to wrap each frequent sequence, which contains a 
Java-friendly method `javaSequence`
3. Overload `run` for Java users.
4. Added a unit test in Java to check Java compatibility.

zhangjiajin feynmanliang

Author: Xiangrui Meng m...@databricks.com

Closes #7869 from mengxr/SPARK-9527 and squashes the following commits:

4345594 [Xiangrui Meng] add PrefixSpanModel and make PrefixSpan Java friendly


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/66924ffa
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/66924ffa
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/66924ffa

Branch: refs/heads/master
Commit: 66924ffa6bdb8e0df1b90b789cb7ad443377e729
Parents: 8eafa2a
Author: Xiangrui Meng m...@databricks.com
Authored: Sun Aug 2 11:50:17 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Sun Aug 2 11:50:17 2015 -0700

--
 .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 52 +--
 .../spark/mllib/fpm/JavaPrefixSpanSuite.java| 67 
 .../spark/mllib/fpm/PrefixSpanSuite.scala   |  8 +--
 3 files changed, 118 insertions(+), 9 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/66924ffa/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
index c1761c3..9eaf733 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
@@ -17,11 +17,16 @@
 
 package org.apache.spark.mllib.fpm
 
+import java.{lang = jl, util = ju}
+
+import scala.collection.JavaConverters._
 import scala.collection.mutable.ArrayBuilder
 import scala.reflect.ClassTag
 
 import org.apache.spark.Logging
 import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.api.java.JavaSparkContext.fakeClassTag
 import org.apache.spark.rdd.RDD
 import org.apache.spark.storage.StorageLevel
 
@@ -93,9 +98,9 @@ class PrefixSpan private (
   /**
* Find the complete set of sequential patterns in the input sequences of 
itemsets.
* @param data ordered sequences of itemsets.
-   * @return (sequential itemset pattern, count) tuples
+   * @return a [[PrefixSpanModel]] that contains the frequent sequences
*/
-  def run[Item: ClassTag](data: RDD[Array[Array[Item]]]): 
RDD[(Array[Array[Item]], Long)] = {
+  def run[Item: ClassTag](data: RDD[Array[Array[Item]]]): 
PrefixSpanModel[Item] = {
 val itemToInt = data.aggregate(Set[Item]())(
   seqOp = { (uniqItems, item) = uniqItems ++ item.flatten.toSet },
   combOp = { _ ++ _ }
@@ -113,9 +118,25 @@ class PrefixSpan private (
 case (x, xs) = List(x.map(intToItem).toArray)
   }
 }
-results.map { case (seq: Array[Int], count: Long) =
-  (toPublicRepr(seq).toArray, count)
+val freqSequences = results.map { case (seq: Array[Int], count: Long) =
+  new FreqSequence[Item](toPublicRepr(seq).toArray, count)
 }
+new PrefixSpanModel[Item](freqSequences)
+  }
+
+  /**
+   * A Java-friendly version of [[run()]] that reads sequences from a 
[[JavaRDD]] and returns
+   * frequent sequences in a [[PrefixSpanModel]].
+   * @param data ordered sequences of itemsets stored as Java Iterable of 
Iterables
+   * @tparam Item item type
+   * @tparam Itemset itemset type, which is an Iterable of Items
+   * @tparam Sequence sequence type, which is an Iterable of Itemsets
+   * @return a [[PrefixSpanModel]] that contains the frequent sequences
+   */
+  def run[Item, Itemset : jl.Iterable[Item], Sequence : 
jl.Iterable[Itemset]](
+  data: JavaRDD[Sequence]): PrefixSpanModel[Item] = {
+implicit val tag = fakeClassTag[Item]
+run(data.rdd.map(_.asScala.map(_.asScala.toArray).toArray))
   }
 
   /**
@@ -287,7 +308,7 @@ class PrefixSpan private (
 
 }
 
-private[fpm] object PrefixSpan {
+object PrefixSpan {
   private[fpm] val DELIMITER = -1
 
   /** Splits an array of itemsets delimited by [[DELIMITER]]. */
@@ -313,4 +334,25 @@ private[fpm] object PrefixSpan {
 // TODO: improve complexity by using partial prefixes, considering one 
item at a time
 itemSet.subsets.filter(_ != Set.empty[Int])
   }
+
+  /**
+   * Represents a frequence sequence.
+   * @param sequence a sequence of itemsets stored as an Array of Arrays
+   * @param freq frequency
+   * @tparam Item item type
+   */
+  class

spark git commit: [SPARK-8999] [MLLIB] PrefixSpan non-temporal sequences

2015-08-01 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 65038973a - d2a9b66f6


[SPARK-8999] [MLLIB] PrefixSpan non-temporal sequences

mengxr Extends PrefixSpan to non-temporal itemsets. Continues work by 
zhangjiajin

 * Internal API uses List[Set[Int]] which is likely not efficient; will need to 
refactor during QA

Closes #7646

Author: zhangjiajin zhangjia...@huawei.com
Author: Feynman Liang fli...@databricks.com
Author: zhang jiajin zhangjia...@huawei.com

Closes #7818 from feynmanliang/SPARK-8999-nonTemporal and squashes the 
following commits:

4ded81d [Feynman Liang] Replace all filters to filter nonempty
350e67e [Feynman Liang] Code review feedback
03156ca [Feynman Liang] Fix tests, drop delimiters at boundaries of sequences
d1fe0ed [Feynman Liang] Remove comments
86ca4e5 [Feynman Liang] Fix style
7c7bf39 [Feynman Liang] Fixed itemSet sequences
6073b10 [Feynman Liang] Basic itemset functionality, failing test
1a7fb48 [Feynman Liang] Add delimiter to results
5db00aa [Feynman Liang] Working for items, not itemsets
6787716 [Feynman Liang] Working on temporal sequences
f1114b9 [Feynman Liang] Add -1 delimiter
00fe756 [Feynman Liang] Reset base files for rebase
f486dcd [zhangjiajin] change maxLocalProjDBSize and fix a bug (remove -3 from 
frequent items).
60a0b76 [zhangjiajin] fixed a scala style error.
740c203 [zhangjiajin] fixed a scala style error.
5785cb8 [zhangjiajin] support non-temporal sequence
a5d649d [zhangjiajin] restore original version
09dc409 [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark 
into multiItems_2
ae8c02d [zhangjiajin] Fixed some Scala style errors.
216ab0c [zhangjiajin] Support non-temporal sequence in PrefixSpan
b572f54 [zhangjiajin] initialize file before rebase.
f06772f [zhangjiajin] fix a scala style error.
a7e50d4 [zhangjiajin] Add feature: Collect enough frequent prefixes before 
projection in PrefixSpan.
c1d13d0 [zhang jiajin] Delete PrefixspanSuite.scala
d9d8137 [zhang jiajin] Delete Prefixspan.scala
c6ceb63 [zhangjiajin] Add new algorithm PrefixSpan and test file.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d2a9b66f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d2a9b66f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d2a9b66f

Branch: refs/heads/master
Commit: d2a9b66f6c0de89d6d16370af1c77c7f51b11d3e
Parents: 6503897
Author: zhangjiajin zhangjia...@huawei.com
Authored: Sat Aug 1 01:56:27 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Sat Aug 1 01:56:27 2015 -0700

--
 .../spark/mllib/fpm/LocalPrefixSpan.scala   |  46 ++--
 .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 111 ++---
 .../spark/mllib/fpm/PrefixSpanSuite.scala   | 237 ---
 3 files changed, 302 insertions(+), 92 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d2a9b66f/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala
index 0ea7920..ccebf95 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala
@@ -25,7 +25,7 @@ import org.apache.spark.Logging
  * Calculate all patterns of a projected database in local.
  */
 private[fpm] object LocalPrefixSpan extends Logging with Serializable {
-
+  import PrefixSpan._
   /**
* Calculate all patterns of a projected database.
* @param minCount minimum count
@@ -39,12 +39,19 @@ private[fpm] object LocalPrefixSpan extends Logging with 
Serializable {
   def run(
   minCount: Long,
   maxPatternLength: Int,
-  prefixes: List[Int],
-  database: Iterable[Array[Int]]): Iterator[(List[Int], Long)] = {
-if (prefixes.length == maxPatternLength || database.isEmpty) return 
Iterator.empty
-val frequentItemAndCounts = getFreqItemAndCounts(minCount, database)
-val filteredDatabase = database.map(x = 
x.filter(frequentItemAndCounts.contains))
-frequentItemAndCounts.iterator.flatMap { case (item, count) =
+  prefixes: List[Set[Int]],
+  database: Iterable[List[Set[Int]]]): Iterator[(List[Set[Int]], Long)] = {
+if (prefixes.length == maxPatternLength || database.isEmpty) {
+  return Iterator.empty
+}
+val freqItemSetsAndCounts = getFreqItemAndCounts(minCount, database)
+val freqItems = freqItemSetsAndCounts.keys.flatten.toSet
+val filteredDatabase = database.map { suffix =
+  suffix
+.map(item = freqItems.intersect(item))
+.filter(_.nonEmpty)
+}
+freqItemSetsAndCounts.iterator.flatMap { case (item, count) =
   val

spark git commit: [SPARK-8169] [ML] Add StopWordsRemover as a transformer

2015-08-01 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master d2a9b66f6 - 876566501


[SPARK-8169] [ML] Add StopWordsRemover as a transformer

jira: https://issues.apache.org/jira/browse/SPARK-8169

stop words: http://en.wikipedia.org/wiki/Stop_words

StopWordsRemover takes a string array column and outputs a string array column 
with all defined stop words removed. The transformer should also come with a 
standard set of stop words as default.

Currently I used a minimum stop words set since on some 
[case](http://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html),
 small set of stop words is preferred.
ASCII char has been tested, Yet I cannot check it in due to style check.

Further thought,
1. Maybe I should use OpenHashSet. Is it recommended?
2. Currently I leave the null in input array untouched, i.e. Array(null, null) 
= Array(null, null).
3. If the current stop words set looks too limited, any suggestion for 
replacement? We can have something similar to the one in 
[SKlearn](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py).

Author: Yuhao Yang hhb...@gmail.com

Closes #6742 from hhbyyh/stopwords and squashes the following commits:

fa959d8 [Yuhao Yang] separating udf
f190217 [Yuhao Yang] replace default list and other small fix
04403ab [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into 
stopwords
b3aa957 [Yuhao Yang] add stopWordsRemover


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/87656650
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/87656650
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/87656650

Branch: refs/heads/master
Commit: 8765665015ef47a23e00f7d01d4d280c31bb236d
Parents: d2a9b66
Author: Yuhao Yang hhb...@gmail.com
Authored: Sat Aug 1 02:31:28 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Sat Aug 1 02:31:28 2015 -0700

--
 .../spark/ml/feature/StopWordsRemover.scala | 155 +++
 .../ml/feature/StopWordsRemoverSuite.scala  |  80 ++
 2 files changed, 235 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/87656650/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala
new file mode 100644
index 000..3cc4142
--- /dev/null
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala
@@ -0,0 +1,155 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.param.{ParamMap, BooleanParam, Param}
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.sql.DataFrame
+import org.apache.spark.sql.types.{StringType, StructField, ArrayType, 
StructType}
+import org.apache.spark.sql.functions.{col, udf}
+
+/**
+ * stop words list
+ */
+private object StopWords {
+
+  /**
+   * Use the same default stopwords list as scikit-learn.
+   * The original list can be found from Glasgow Information Retrieval Group
+   * [[http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words]]
+   */
+  val EnglishStopWords = Array( a, about, above, across, after, 
afterwards, again,
+against, all, almost, alone, along, already, also, 
although, always,
+am, among, amongst, amoungst, amount, an, and, another,
+any, anyhow, anyone, anything, anyway, anywhere, are,
+around, as, at, back, be, became, because, become,
+becomes, becoming, been, before, beforehand, behind, being,
+below, beside, besides, between, beyond, bill, both,
+bottom, but, by, call, can, cannot, cant, co, con,
+could, couldnt, cry, de, describe, detail, do, done,
+down, due, during, each

spark git commit: [SPARK-9490] [DOCS] [MLLIB] MLlib evaluation metrics guide example python code uses deprecated print statement

2015-07-31 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 815c8245f - 873ab0f96


[SPARK-9490] [DOCS] [MLLIB] MLlib evaluation metrics guide example python code 
uses deprecated print statement

Use print(x) not print x for Python 3 in eval examples
CC sethah mengxr -- just wanted to close this out before 1.5

Author: Sean Owen so...@cloudera.com

Closes #7822 from srowen/SPARK-9490 and squashes the following commits:

01abeba [Sean Owen] Change print x to print(x) in the rest of the docs too
bd7f7fb [Sean Owen] Use print(x) not print x for Python 3 in eval examples


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/873ab0f9
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/873ab0f9
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/873ab0f9

Branch: refs/heads/master
Commit: 873ab0f9692d8ea6220abdb8d9200041068372a8
Parents: 815c824
Author: Sean Owen so...@cloudera.com
Authored: Fri Jul 31 13:45:28 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Fri Jul 31 13:45:28 2015 -0700

--
 docs/ml-guide.md|  2 +-
 docs/mllib-evaluation-metrics.md| 66 
 docs/mllib-feature-extraction.md|  2 +-
 docs/mllib-statistics.md| 20 +-
 docs/quick-start.md |  2 +-
 docs/sql-programming-guide.md   |  6 +--
 docs/streaming-programming-guide.md |  2 +-
 7 files changed, 50 insertions(+), 50 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/873ab0f9/docs/ml-guide.md
--
diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index 8c46adf..b6ca50e 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -561,7 +561,7 @@ test = sc.parallelize([(4L, spark i j k),
 prediction = model.transform(test)
 selected = prediction.select(id, text, prediction)
 for row in selected.collect():
-print row
+print(row)
 
 sc.stop()
 {% endhighlight %}

http://git-wip-us.apache.org/repos/asf/spark/blob/873ab0f9/docs/mllib-evaluation-metrics.md
--
diff --git a/docs/mllib-evaluation-metrics.md b/docs/mllib-evaluation-metrics.md
index 4ca0bb0..7066d5c 100644
--- a/docs/mllib-evaluation-metrics.md
+++ b/docs/mllib-evaluation-metrics.md
@@ -302,10 +302,10 @@ predictionAndLabels = test.map(lambda lp: 
(float(model.predict(lp.features)), lp
 metrics = BinaryClassificationMetrics(predictionAndLabels)
 
 # Area under precision-recall curve
-print Area under PR = %s % metrics.areaUnderPR
+print(Area under PR = %s % metrics.areaUnderPR)
 
 # Area under ROC curve
-print Area under ROC = %s % metrics.areaUnderROC
+print(Area under ROC = %s % metrics.areaUnderROC)
 
 {% endhighlight %}
 
@@ -606,24 +606,24 @@ metrics = MulticlassMetrics(predictionAndLabels)
 precision = metrics.precision()
 recall = metrics.recall()
 f1Score = metrics.fMeasure()
-print Summary Stats
-print Precision = %s % precision
-print Recall = %s % recall
-print F1 Score = %s % f1Score
+print(Summary Stats)
+print(Precision = %s % precision)
+print(Recall = %s % recall)
+print(F1 Score = %s % f1Score)
 
 # Statistics by class
 labels = data.map(lambda lp: lp.label).distinct().collect()
 for label in sorted(labels):
-print Class %s precision = %s % (label, metrics.precision(label))
-print Class %s recall = %s % (label, metrics.recall(label))
-print Class %s F1 Measure = %s % (label, metrics.fMeasure(label, 
beta=1.0))
+print(Class %s precision = %s % (label, metrics.precision(label)))
+print(Class %s recall = %s % (label, metrics.recall(label)))
+print(Class %s F1 Measure = %s % (label, metrics.fMeasure(label, 
beta=1.0)))
 
 # Weighted stats
-print Weighted recall = %s % metrics.weightedRecall
-print Weighted precision = %s % metrics.weightedPrecision
-print Weighted F(1) Score = %s % metrics.weightedFMeasure()
-print Weighted F(0.5) Score = %s % metrics.weightedFMeasure(beta=0.5)
-print Weighted false positive rate = %s % metrics.weightedFalsePositiveRate
+print(Weighted recall = %s % metrics.weightedRecall)
+print(Weighted precision = %s % metrics.weightedPrecision)
+print(Weighted F(1) Score = %s % metrics.weightedFMeasure())
+print(Weighted F(0.5) Score = %s % metrics.weightedFMeasure(beta=0.5))
+print(Weighted false positive rate = %s % metrics.weightedFalsePositiveRate)
 {% endhighlight %}
 
 /div
@@ -881,28 +881,28 @@ scoreAndLabels = sc.parallelize([
 metrics = MultilabelMetrics(scoreAndLabels)
 
 # Summary stats
-print Recall = %s % metrics.recall()
-print Precision = %s % metrics.precision()
-print F1 measure = %s % metrics.f1Measure()
-print Accuracy = %s % metrics.accuracy
+print(Recall = %s % metrics.recall())
+print(Precision = %s % metrics.precision())
+print(F1 measure = %s

spark git commit: [SPARK-8998] [MLLIB] Distribute PrefixSpan computation for large projected databases

2015-07-30 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master c5815930b - d212a3142


[SPARK-8998] [MLLIB] Distribute PrefixSpan computation for large projected 
databases

Continuation of work by zhangjiajin

Closes #7412

Author: zhangjiajin zhangjia...@huawei.com
Author: Feynman Liang fli...@databricks.com
Author: zhang jiajin zhangjia...@huawei.com

Closes #7783 from feynmanliang/SPARK-8998-improve-distributed and squashes the 
following commits:

a61943d [Feynman Liang] Collect small patterns to local
4ddf479 [Feynman Liang] Parallelize freqItemCounts
ad23aa9 [zhang jiajin] Merge pull request #1 from 
feynmanliang/SPARK-8998-collectBeforeLocal
87fa021 [Feynman Liang] Improve extend prefix readability
c2caa5c [Feynman Liang] Readability improvements and comments
1235cfc [Feynman Liang] Use Iterable[Array[_]] over Array[Array[_]] for database
da0091b [Feynman Liang] Use lists for prefixes to reuse data
cb2a4fc [Feynman Liang] Inline code for readability
01c9ae9 [Feynman Liang] Add getters
6e149fa [Feynman Liang] Fix splitPrefixSuffixPairs
64271b3 [zhangjiajin] Modified codes according to comments.
d2250b7 [zhangjiajin] remove minPatternsBeforeLocalProcessing, add 
maxSuffixesBeforeLocalProcessing.
b07e20c [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark 
into CollectEnoughPrefixes
095aa3a [zhangjiajin] Modified the code according to the review comments.
baa2885 [zhangjiajin] Modified the code according to the review comments.
6560c69 [zhangjiajin] Add feature: Collect enough frequent prefixes before 
projection in PrefixeSpan
a8fde87 [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark
4dd1c8a [zhangjiajin] initialize file before rebase.
078d410 [zhangjiajin] fix a scala style error.
22b0ef4 [zhangjiajin] Add feature: Collect enough frequent prefixes before 
projection in PrefixSpan.
ca9c4c8 [zhangjiajin] Modified the code according to the review comments.
574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization.
ba5df34 [zhangjiajin] Fix a Scala style error.
4c60fb3 [zhangjiajin] Fix some Scala style errors.
1dd33ad [zhangjiajin] Modified the code according to the review comments.
89bc368 [zhangjiajin] Fixed a Scala style error.
a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala
951fd42 [zhang jiajin] Delete Prefixspan.scala
575995f [zhangjiajin] Modified the code according to the review comments.
91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d212a314
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d212a314
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d212a314

Branch: refs/heads/master
Commit: d212a314227dec26c0dbec8ed3422d0ec8f818f9
Parents: c581593
Author: zhangjiajin zhangjia...@huawei.com
Authored: Thu Jul 30 08:14:09 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Jul 30 08:14:09 2015 -0700

--
 .../spark/mllib/fpm/LocalPrefixSpan.scala   |   6 +-
 .../org/apache/spark/mllib/fpm/PrefixSpan.scala | 203 ++-
 .../spark/mllib/fpm/PrefixSpanSuite.scala   |  21 +-
 3 files changed, 161 insertions(+), 69 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d212a314/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala
index 7ead632..0ea7920 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/LocalPrefixSpan.scala
@@ -40,7 +40,7 @@ private[fpm] object LocalPrefixSpan extends Logging with 
Serializable {
   minCount: Long,
   maxPatternLength: Int,
   prefixes: List[Int],
-  database: Array[Array[Int]]): Iterator[(List[Int], Long)] = {
+  database: Iterable[Array[Int]]): Iterator[(List[Int], Long)] = {
 if (prefixes.length == maxPatternLength || database.isEmpty) return 
Iterator.empty
 val frequentItemAndCounts = getFreqItemAndCounts(minCount, database)
 val filteredDatabase = database.map(x = 
x.filter(frequentItemAndCounts.contains))
@@ -67,7 +67,7 @@ private[fpm] object LocalPrefixSpan extends Logging with 
Serializable {
 }
   }
 
-  def project(database: Array[Array[Int]], prefix: Int): Array[Array[Int]] = {
+  def project(database: Iterable[Array[Int]], prefix: Int): 
Iterable[Array[Int]] = {
 database
   .map(getSuffix(prefix, _))
   .filter(_.nonEmpty)
@@ -81,7 +81,7 @@ private[fpm] object LocalPrefixSpan extends Logging with 
Serializable {
*/
   private def getFreqItemAndCounts(
   minCount: Long

spark git commit: [SPARK-7368] [MLLIB] Add QR decomposition for RowMatrix

2015-07-30 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 6175d6cfe - d31c618e3


[SPARK-7368] [MLLIB] Add QR decomposition for RowMatrix

jira: https://issues.apache.org/jira/browse/SPARK-7368
Add QR decomposition for RowMatrix.

I'm not sure what's the blueprint about the distributed Matrix from community 
and whether this will be a desirable feature , so I sent a prototype for 
discussion. I'll go on polish the code and provide ut and performance 
statistics if it's acceptable.

The implementation refers to the [paper: 
https://www.cs.purdue.edu/homes/dgleich/publications/Benson%202013%20-%20direct-tsqr.pdf]
Austin R. Benson, David F. Gleich, James Demmel. Direct QR factorizations for 
tall-and-skinny matrices in MapReduce architectures, 2013 IEEE International 
Conference on Big Data, which is a stable algorithm with good scalability.

Currently I tried it on a 40 * 500 rowMatrix (16 partitions) and it can 
bring down the computation time from 8.8 mins (using breeze.linalg.qr.reduced)  
to 2.6 mins on a 4 worker cluster. I think there will still be some room for 
performance improvement.

Any trial and suggestion is welcome.

Author: Yuhao Yang hhb...@gmail.com

Closes #5909 from hhbyyh/qrDecomposition and squashes the following commits:

cec797b [Yuhao Yang] remove unnecessary qr
0fb1012 [Yuhao Yang] hierarchy R computing
3fbdb61 [Yuhao Yang] update qr to indirect and add ut
0d913d3 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into 
qrDecomposition
39213c3 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into 
qrDecomposition
c0fc0c7 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into 
qrDecomposition
39b0b22 [Yuhao Yang] initial draft for discussion


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/d31c618e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/d31c618e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/d31c618e

Branch: refs/heads/master
Commit: d31c618e3c8838f8198556876b9dcbbbf835f7b2
Parents: 6175d6c
Author: Yuhao Yang hhb...@gmail.com
Authored: Thu Jul 30 07:49:10 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Jul 30 07:49:10 2015 -0700

--
 .../linalg/SingularValueDecomposition.scala |  8 
 .../mllib/linalg/distributed/RowMatrix.scala| 46 +++-
 .../linalg/distributed/RowMatrixSuite.scala | 17 
 3 files changed, 70 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/d31c618e/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala
index 9669c36..b416d50 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.scala
@@ -25,3 +25,11 @@ import org.apache.spark.annotation.Experimental
  */
 @Experimental
 case class SingularValueDecomposition[UType, VType](U: UType, s: Vector, V: 
VType)
+
+/**
+ * :: Experimental ::
+ * Represents QR factors.
+ */
+@Experimental
+case class QRDecomposition[UType, VType](Q: UType, R: VType)
+

http://git-wip-us.apache.org/repos/asf/spark/blob/d31c618e/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
index 1626da9..bfc90c9 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
@@ -22,7 +22,7 @@ import java.util.Arrays
 import scala.collection.mutable.ListBuffer
 
 import breeze.linalg.{DenseMatrix = BDM, DenseVector = BDV, SparseVector = 
BSV, axpy = brzAxpy,
-  svd = brzSvd}
+  svd = brzSvd, MatrixSingularException, inv}
 import breeze.numerics.{sqrt = brzSqrt}
 import com.github.fommil.netlib.BLAS.{getInstance = blas}
 
@@ -498,6 +498,50 @@ class RowMatrix(
   }
 
   /**
+   * Compute QR decomposition for [[RowMatrix]]. The implementation is 
designed to optimize the QR
+   * decomposition (factorization) for the [[RowMatrix]] of a tall and skinny 
shape.
+   * Reference:
+   *  Paul G. Constantine, David F. Gleich. Tall and skinny QR factorizations 
in MapReduce
+   *  architectures  ([[http://dx.doi.org/10.1145/1996092.1996103]])
+   *
+   * @param computeQ whether to computeQ
+   * @return QRDecomposition(Q

spark git commit: [SPARK-] [MLLIB] minor fix on tokenizer doc

2015-07-30 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master d212a3142 - 9c0501c5d


[SPARK-] [MLLIB] minor fix on tokenizer doc

A trivial fix for the comments of RegexTokenizer.

Maybe this is too small, yet I just noticed it and think it can be quite 
misleading. I can create a jira if necessary.

Author: Yuhao Yang hhb...@gmail.com

Closes #7791 from hhbyyh/docFix and squashes the following commits:

cdf2542 [Yuhao Yang] minor fix on tokenizer doc


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9c0501c5
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9c0501c5
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9c0501c5

Branch: refs/heads/master
Commit: 9c0501c5d04d83ca25ce433138bf64df6a14dc58
Parents: d212a31
Author: Yuhao Yang hhb...@gmail.com
Authored: Thu Jul 30 08:20:52 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Jul 30 08:20:52 2015 -0700

--
 mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/9c0501c5/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
index 0b3af47..248288c 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
@@ -50,7 +50,7 @@ class Tokenizer(override val uid: String) extends 
UnaryTransformer[String, Seq[S
 /**
  * :: Experimental ::
  * A regex based tokenizer that extracts tokens either by using the provided 
regex pattern to split
- * the text (default) or repeatedly matching the regex (if `gaps` is true).
+ * the text (default) or repeatedly matching the regex (if `gaps` is false).
  * Optional parameters also allow filtering tokens using a minimal length.
  * It returns an array of strings that can be empty.
  */


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-] [MLLIB] minor fix on tokenizer doc

2015-07-30 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.4 8dfdca46d - 020dd30e5


[SPARK-] [MLLIB] minor fix on tokenizer doc

A trivial fix for the comments of RegexTokenizer.

Maybe this is too small, yet I just noticed it and think it can be quite 
misleading. I can create a jira if necessary.

Author: Yuhao Yang hhb...@gmail.com

Closes #7791 from hhbyyh/docFix and squashes the following commits:

cdf2542 [Yuhao Yang] minor fix on tokenizer doc

(cherry picked from commit 9c0501c5d04d83ca25ce433138bf64df6a14dc58)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/020dd30e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/020dd30e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/020dd30e

Branch: refs/heads/branch-1.4
Commit: 020dd30e5173d534d1a2cd5934a66f70bc764459
Parents: 8dfdca4
Author: Yuhao Yang hhb...@gmail.com
Authored: Thu Jul 30 08:20:52 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Jul 30 08:21:09 2015 -0700

--
 mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/020dd30e/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala 
b/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
index 5f9f57a..4b1700d 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
@@ -50,7 +50,7 @@ class Tokenizer(override val uid: String) extends 
UnaryTransformer[String, Seq[S
 /**
  * :: Experimental ::
  * A regex based tokenizer that extracts tokens either by using the provided 
regex pattern to split
- * the text (default) or repeatedly matching the regex (if `gaps` is true).
+ * the text (default) or repeatedly matching the regex (if `gaps` is false).
  * Optional parameters also allow filtering tokens using a minimal length.
  * It returns an array of strings that can be empty.
  */


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-5561] [MLLIB] Generalized PeriodicCheckpointer for RDDs and Graphs

2015-07-30 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master d31c618e3 - c5815930b


[SPARK-5561] [MLLIB] Generalized PeriodicCheckpointer for RDDs and Graphs

PeriodicGraphCheckpointer was introduced for Latent Dirichlet Allocation (LDA), 
but it was meant to be generalized to work with Graphs, RDDs, and other data 
structures based on RDDs.  This PR generalizes it.

For those who are not familiar with the periodic checkpointer, it tries to 
automatically handle persisting/unpersisting and checkpointing/removing 
checkpoint files in a lineage of RDD-based objects.

I need it generalized to use with GradientBoostedTrees 
[https://issues.apache.org/jira/browse/SPARK-6684].  It should be useful for 
other iterative algorithms as well.

Changes I made:
* Copied PeriodicGraphCheckpointer to PeriodicCheckpointer.
* Within PeriodicCheckpointer, I created abstract methods for the basic 
operations (checkpoint, persist, etc.).
* The subclasses for Graphs and RDDs implement those abstract methods.
* I copied the test suite for the graph checkpointer and made tiny 
modifications to make it work for RDDs.

To review this PR, I recommend doing 2 diffs:
(1) diff between the old PeriodicGraphCheckpointer.scala and the new 
PeriodicCheckpointer.scala
(2) diff between the 2 test suites

CCing andrewor14 in case there are relevant changes to checkpointing.
CCing feynmanliang in case you're interested in learning about checkpointing.
CCing mengxr for final OK.
Thanks all!

Author: Joseph K. Bradley jos...@databricks.com

Closes #7728 from jkbradley/gbt-checkpoint and squashes the following commits:

d41902c [Joseph K. Bradley] Oops, forgot to update an extra time in the 
checkpointer tests, after the last commit. I'll fix that. I'll also make some 
of the checkpointer methods protected, which I should have done before.
32b23b8 [Joseph K. Bradley] fixed usage of checkpointer in lda
0b3dbc0 [Joseph K. Bradley] Changed checkpointer constructor not to take 
initial data.
568918c [Joseph K. Bradley] Generalized PeriodicGraphCheckpointer to 
PeriodicCheckpointer, with subclasses for RDDs and Graphs.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/c5815930
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/c5815930
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/c5815930

Branch: refs/heads/master
Commit: c5815930be46a89469440b7c61b59764fb67a54c
Parents: d31c618
Author: Joseph K. Bradley jos...@databricks.com
Authored: Thu Jul 30 07:56:15 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Jul 30 07:56:15 2015 -0700

--
 .../spark/mllib/clustering/LDAOptimizer.scala   |   6 +-
 .../spark/mllib/impl/PeriodicCheckpointer.scala | 154 +
 .../mllib/impl/PeriodicGraphCheckpointer.scala  | 105 ++-
 .../mllib/impl/PeriodicRDDCheckpointer.scala|  97 +++
 .../impl/PeriodicGraphCheckpointerSuite.scala   |  16 +-
 .../impl/PeriodicRDDCheckpointerSuite.scala | 173 +++
 6 files changed, 452 insertions(+), 99 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/c5815930/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala
index 7e75e70..4b90fbd 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala
@@ -142,8 +142,8 @@ final class EMLDAOptimizer extends LDAOptimizer {
 this.k = k
 this.vocabSize = docs.take(1).head._2.size
 this.checkpointInterval = lda.getCheckpointInterval
-this.graphCheckpointer = new
-  PeriodicGraphCheckpointer[TopicCounts, TokenCount](graph, 
checkpointInterval)
+this.graphCheckpointer = new PeriodicGraphCheckpointer[TopicCounts, 
TokenCount](
+  checkpointInterval, graph.vertices.sparkContext)
 this.globalTopicTotals = computeGlobalTopicTotals()
 this
   }
@@ -188,7 +188,7 @@ final class EMLDAOptimizer extends LDAOptimizer {
 // Update the vertex descriptors with the new counts.
 val newGraph = GraphImpl.fromExistingRDDs(docTopicDistributions, 
graph.edges)
 graph = newGraph
-graphCheckpointer.updateGraph(newGraph)
+graphCheckpointer.update(newGraph)
 globalTopicTotals = computeGlobalTopicTotals()
 this
   }

http://git-wip-us.apache.org/repos/asf/spark/blob/c5815930/mllib/src/main/scala/org/apache/spark/mllib/impl/PeriodicCheckpointer.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/impl

spark git commit: [SPARK-8671] [ML] Added isotonic regression to the pipeline API.

2015-07-30 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 0dbd6963d - 7f7a319c4


[SPARK-8671] [ML] Added isotonic regression to the pipeline API.

Author: martinzapletal zapletal-mar...@email.cz

Closes #7517 from zapletal-martin/SPARK-8671-isotonic-regression-api and 
squashes the following commits:

8c435c1 [martinzapletal] Review https://github.com/apache/spark/pull/7517 
feedback update.
bebbb86 [martinzapletal] Merge remote-tracking branch 'upstream/master' into 
SPARK-8671-isotonic-regression-api
b68efc0 [martinzapletal] Added tests for param validation.
07c12bd [martinzapletal] Comments and refactoring.
834fcf7 [martinzapletal] Merge remote-tracking branch 'upstream/master' into 
SPARK-8671-isotonic-regression-api
b611fee [martinzapletal] SPARK-8671. Added first version of isotonic regression 
to pipeline API


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/7f7a319c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/7f7a319c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/7f7a319c

Branch: refs/heads/master
Commit: 7f7a319c4ce07f07a6bd68100cf0a4f1da66269e
Parents: 0dbd696
Author: martinzapletal zapletal-mar...@email.cz
Authored: Thu Jul 30 15:57:14 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Jul 30 15:57:14 2015 -0700

--
 .../ml/regression/IsotonicRegression.scala  | 144 ++
 .../ml/regression/IsotonicRegressionSuite.scala | 148 +++
 2 files changed, 292 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/7f7a319c/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala 
b/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala
new file mode 100644
index 000..4ece8cf
--- /dev/null
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/regression/IsotonicRegression.scala
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.param.{Param, ParamMap, BooleanParam}
+import org.apache.spark.ml.util.{SchemaUtils, Identifiable}
+import org.apache.spark.mllib.regression.{IsotonicRegression = 
MLlibIsotonicRegression}
+import org.apache.spark.mllib.regression.{IsotonicRegressionModel = 
MLlibIsotonicRegressionModel}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.types.{DoubleType, DataType}
+import org.apache.spark.sql.{Row, DataFrame}
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * Params for isotonic regression.
+ */
+private[regression] trait IsotonicRegressionParams extends PredictorParams {
+
+  /**
+   * Param for weight column name.
+   * TODO: Move weightCol to sharedParams.
+   *
+   * @group param
+   */
+  final val weightCol: Param[String] =
+new Param[String](this, weightCol, weight column name)
+
+  /** @group getParam */
+  final def getWeightCol: String = $(weightCol)
+
+  /**
+   * Param for isotonic parameter.
+   * Isotonic (increasing) or antitonic (decreasing) sequence.
+   * @group param
+   */
+  final val isotonic: BooleanParam =
+new BooleanParam(this, isotonic, isotonic (increasing) or antitonic 
(decreasing) sequence)
+
+  /** @group getParam */
+  final def getIsotonicParam: Boolean = $(isotonic)
+}
+
+/**
+ * :: Experimental ::
+ * Isotonic regression.
+ *
+ * Currently implemented using parallelized pool adjacent violators algorithm.
+ * Only univariate (single feature) algorithm supported.
+ *
+ * Uses [[org.apache.spark.mllib.regression.IsotonicRegression]].
+ */
+@Experimental
+class IsotonicRegression(override val uid: String)
+  extends Regressor[Double, IsotonicRegression, IsotonicRegressionModel]
+  with IsotonicRegressionParams {
+
+  def this() = this(Identifiable.randomUID(isoReg))
+
+  /**
+   * Set the isotonic parameter

spark git commit: [SPARK-9463] [ML] Expose model coefficients with names in SparkR RFormula

2015-07-30 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master be7be6d4c - e7905a939


[SPARK-9463] [ML] Expose model coefficients with names in SparkR RFormula

Preview:

```
 summary(m)
features coefficients
1(Intercept)1.6765001
2   Sepal_Length0.3498801
3 Species.versicolor   -0.9833885
4  Species.virginica   -1.0075104

```

Design doc from umbrella task: 
https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit

cc mengxr

Author: Eric Liang e...@databricks.com

Closes #7771 from ericl/summary and squashes the following commits:

ccd54c3 [Eric Liang] second pass
a5ca93b [Eric Liang] comments
2772111 [Eric Liang] clean up
70483ef [Eric Liang] fix test
7c247d4 [Eric Liang] Merge branch 'master' into summary
3c55024 [Eric Liang] working
8c539aa [Eric Liang] first pass


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e7905a93
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e7905a93
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e7905a93

Branch: refs/heads/master
Commit: e7905a9395c1a002f50bab29e16a729e14d4ed6f
Parents: be7be6d
Author: Eric Liang e...@databricks.com
Authored: Thu Jul 30 16:15:43 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Jul 30 16:15:43 2015 -0700

--
 R/pkg/NAMESPACE |  3 ++-
 R/pkg/R/mllib.R | 26 +++
 R/pkg/inst/tests/test_mllib.R   | 11 
 .../apache/spark/ml/feature/OneHotEncoder.scala | 12 -
 .../org/apache/spark/ml/feature/RFormula.scala  | 12 -
 .../org/apache/spark/ml/r/SparkRWrappers.scala  | 27 ++--
 .../spark/ml/regression/LinearRegression.scala  |  8 --
 .../spark/ml/feature/OneHotEncoderSuite.scala   |  8 +++---
 .../apache/spark/ml/feature/RFormulaSuite.scala | 18 +
 9 files changed, 108 insertions(+), 17 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/e7905a93/R/pkg/NAMESPACE
--
diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 7f7a8a2..a329e14 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -12,7 +12,8 @@ export(print.jobj)
 
 # MLlib integration
 exportMethods(glm,
-  predict)
+  predict,
+  summary)
 
 # Job group lifecycle management methods
 export(setJobGroup,

http://git-wip-us.apache.org/repos/asf/spark/blob/e7905a93/R/pkg/R/mllib.R
--
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index 6a8baca..efddcc1 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -71,3 +71,29 @@ setMethod(predict, signature(object = PipelineModel),
   function(object, newData) {
 return(dataFrame(callJMethod(object@model, transform, 
newData@sdf)))
   })
+
+#' Get the summary of a model
+#'
+#' Returns the summary of a model produced by glm(), similarly to R's 
summary().
+#'
+#' @param model A fitted MLlib model
+#' @return a list with a 'coefficient' component, which is the matrix of 
coefficients. See
+#' summary.glm for more information.
+#' @rdname glm
+#' @export
+#' @examples
+#'\dontrun{
+#' model - glm(y ~ x, trainingData)
+#' summary(model)
+#'}
+setMethod(summary, signature(object = PipelineModel),
+  function(object) {
+features - callJStatic(org.apache.spark.ml.api.r.SparkRWrappers,
+   getModelFeatures, object@model)
+weights - callJStatic(org.apache.spark.ml.api.r.SparkRWrappers,
+   getModelWeights, object@model)
+coefficients - as.matrix(unlist(weights))
+colnames(coefficients) - c(Estimate)
+rownames(coefficients) - unlist(features)
+return(list(coefficients = coefficients))
+  })

http://git-wip-us.apache.org/repos/asf/spark/blob/e7905a93/R/pkg/inst/tests/test_mllib.R
--
diff --git a/R/pkg/inst/tests/test_mllib.R b/R/pkg/inst/tests/test_mllib.R
index 3bef693..f272de7 100644
--- a/R/pkg/inst/tests/test_mllib.R
+++ b/R/pkg/inst/tests/test_mllib.R
@@ -48,3 +48,14 @@ test_that(dot minus and intercept vs native glm, {
   rVals - predict(glm(Sepal.Width ~ . - Species + 0, data = iris), iris)
   expect_true(all(abs(rVals - vals)  1e-6), rVals - vals)
 })
+
+test_that(summary coefficients match with native glm, {
+  training - createDataFrame(sqlContext, iris)
+  stats - summary(glm(Sepal_Width ~ Sepal_Length + Species, data = training))
+  coefs - as.vector(stats$coefficients)
+  rCoefs - as.vector(coef(glm(Sepal.Width ~ Sepal.Length + Species, data = 
iris)))
+  expect_true(all

spark git commit: [SPARK-9225] [MLLIB] LDASuite needs unit tests for empty documents

2015-07-30 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 9c0501c5d - a6e53a9c8


[SPARK-9225] [MLLIB] LDASuite needs unit tests for empty documents

Add unit tests for running LDA with empty documents.
Both EMLDAOptimizer and OnlineLDAOptimizer are tested.

feynmanliang

Author: Meihua Wu meihu...@umich.edu

Closes #7620 from rotationsymmetry/SPARK-9225 and squashes the following 
commits:

3ed7c88 [Meihua Wu] Incorporate reviewer's further comments
f9432e8 [Meihua Wu] Incorporate reviewer's comments
8e1b9ec [Meihua Wu] Merge remote-tracking branch 'upstream/master' into 
SPARK-9225
ad55665 [Meihua Wu] Add unit tests for running LDA with empty documents


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a6e53a9c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a6e53a9c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a6e53a9c

Branch: refs/heads/master
Commit: a6e53a9c8b24326d1b6dca7a0e36ce6c643daa77
Parents: 9c0501c
Author: Meihua Wu meihu...@umich.edu
Authored: Thu Jul 30 08:52:01 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Jul 30 08:52:01 2015 -0700

--
 .../spark/mllib/clustering/LDASuite.scala   | 40 
 1 file changed, 40 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a6e53a9c/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala 
b/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala
index b91c7ce..61d2edf 100644
--- a/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/mllib/clustering/LDASuite.scala
@@ -390,6 +390,46 @@ class LDASuite extends SparkFunSuite with 
MLlibTestSparkContext {
 }
   }
 
+  test(EMLDAOptimizer with empty docs) {
+val vocabSize = 6
+val emptyDocsArray = Array.fill(6)(Vectors.sparse(vocabSize, Array.empty, 
Array.empty))
+val emptyDocs = emptyDocsArray
+  .zipWithIndex.map { case (wordCounts, docId) =
+(docId.toLong, wordCounts)
+}
+val distributedEmptyDocs = sc.parallelize(emptyDocs, 2)
+
+val op = new EMLDAOptimizer()
+val lda = new LDA()
+  .setK(3)
+  .setMaxIterations(5)
+  .setSeed(12345)
+  .setOptimizer(op)
+
+val model = lda.run(distributedEmptyDocs)
+assert(model.vocabSize === vocabSize)
+  }
+
+  test(OnlineLDAOptimizer with empty docs) {
+val vocabSize = 6
+val emptyDocsArray = Array.fill(6)(Vectors.sparse(vocabSize, Array.empty, 
Array.empty))
+val emptyDocs = emptyDocsArray
+  .zipWithIndex.map { case (wordCounts, docId) =
+(docId.toLong, wordCounts)
+}
+val distributedEmptyDocs = sc.parallelize(emptyDocs, 2)
+
+val op = new OnlineLDAOptimizer()
+val lda = new LDA()
+  .setK(3)
+  .setMaxIterations(5)
+  .setSeed(12345)
+  .setOptimizer(op)
+
+val model = lda.run(distributedEmptyDocs)
+assert(model.vocabSize === vocabSize)
+  }
+
 }
 
 private[clustering] object LDASuite {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9277] [MLLIB] SparseVector constructor must throw an error when declared number of elements less than array length

2015-07-30 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master a6e53a9c8 - ed3cb1d21


[SPARK-9277] [MLLIB] SparseVector constructor must throw an error when declared 
number of elements less than array length

Check that SparseVector size is at least as big as the number of indices/values 
provided. And add tests for constructor checks.

CC MechCoder jkbradley -- I am not sure if a change needs to also happen in the 
Python API? I didn't see it had any similar checks to begin with, but I don't 
know it well.

Author: Sean Owen so...@cloudera.com

Closes #7794 from srowen/SPARK-9277 and squashes the following commits:

e8dc31e [Sean Owen] Fix scalastyle
6ffe34a [Sean Owen] Check that SparseVector size is at least as big as the 
number of indices/values provided. And add tests for constructor checks.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ed3cb1d2
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ed3cb1d2
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ed3cb1d2

Branch: refs/heads/master
Commit: ed3cb1d21c73645c8f6e6ee08181f876fc192e41
Parents: a6e53a9
Author: Sean Owen so...@cloudera.com
Authored: Thu Jul 30 09:19:55 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Jul 30 09:19:55 2015 -0700

--
 .../org/apache/spark/mllib/linalg/Vectors.scala  |  2 ++
 .../org/apache/spark/mllib/linalg/VectorsSuite.scala | 15 +++
 2 files changed, 17 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ed3cb1d2/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala
--
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala 
b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala
index 0cb28d7..23c2c16 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala
@@ -637,6 +637,8 @@ class SparseVector(
   require(indices.length == values.length, Sparse vectors require that the 
dimension of the +
 s indices match the dimension of the values. You provided 
${indices.length} indices and  +
 s ${values.length} values.)
+  require(indices.length = size, sYou provided ${indices.length} indices and 
values,  +
+swhich exceeds the specified vector size ${size}.)
 
   override def toString: String =
 s($size,${indices.mkString([, ,, ])},${values.mkString([, ,, 
])})

http://git-wip-us.apache.org/repos/asf/spark/blob/ed3cb1d2/mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala
--
diff --git 
a/mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala 
b/mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala
index 03be411..1c37ea5 100644
--- a/mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala
@@ -57,6 +57,21 @@ class VectorsSuite extends SparkFunSuite with Logging {
 assert(vec.values === values)
   }
 
+  test(sparse vector construction with mismatched indices/values array) {
+intercept[IllegalArgumentException] {
+  Vectors.sparse(4, Array(1, 2, 3), Array(3.0, 5.0, 7.0, 9.0))
+}
+intercept[IllegalArgumentException] {
+  Vectors.sparse(4, Array(1, 2, 3), Array(3.0, 5.0))
+}
+  }
+
+  test(sparse vector construction with too many indices vs size) {
+intercept[IllegalArgumentException] {
+  Vectors.sparse(3, Array(1, 2, 3, 4), Array(3.0, 5.0, 7.0, 9.0))
+}
+  }
+
   test(dense to array) {
 val vec = Vectors.dense(arr).asInstanceOf[DenseVector]
 assert(vec.toArray.eq(arr))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR] [MLLIB] fix doc for RegexTokenizer

2015-07-30 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master ed3cb1d21 - 81464f2a8


[MINOR] [MLLIB] fix doc for RegexTokenizer

This is #7791 for Python. hhbyyh

Author: Xiangrui Meng m...@databricks.com

Closes #7798 from mengxr/regex-tok-py and squashes the following commits:

baa2dcd [Xiangrui Meng] fix doc for RegexTokenizer


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/81464f2a
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/81464f2a
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/81464f2a

Branch: refs/heads/master
Commit: 81464f2a8243c6ae2a39bac7ebdc50d4f60af451
Parents: ed3cb1d
Author: Xiangrui Meng m...@databricks.com
Authored: Thu Jul 30 09:45:17 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Jul 30 09:45:17 2015 -0700

--
 python/pyspark/ml/feature.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/81464f2a/python/pyspark/ml/feature.py
--
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index 86e654d..015e7a9 100644
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -525,7 +525,7 @@ class RegexTokenizer(JavaTransformer, HasInputCol, 
HasOutputCol):
 
 A regex based tokenizer that extracts tokens either by using the
 provided regex pattern (in Java dialect) to split the text
-(default) or repeatedly matching the regex (if gaps is true).
+(default) or repeatedly matching the regex (if gaps is false).
 Optional parameters also allow filtering tokens using a minimal
 length.
 It returns an array of strings that can be empty.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [MINOR] [MLLIB] fix doc for RegexTokenizer

2015-07-30 Thread meng

Repository: spark
Updated Branches:
  refs/heads/branch-1.4 020dd30e5 - 6e85064f4


[MINOR] [MLLIB] fix doc for RegexTokenizer

This is #7791 for Python. hhbyyh

Author: Xiangrui Meng m...@databricks.com

Closes #7798 from mengxr/regex-tok-py and squashes the following commits:

baa2dcd [Xiangrui Meng] fix doc for RegexTokenizer

(cherry picked from commit 81464f2a8243c6ae2a39bac7ebdc50d4f60af451)
Signed-off-by: Xiangrui Meng m...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6e85064f
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6e85064f
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6e85064f

Branch: refs/heads/branch-1.4
Commit: 6e85064f416bf647ea463bffa621367647862c61
Parents: 020dd30
Author: Xiangrui Meng m...@databricks.com
Authored: Thu Jul 30 09:45:17 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Jul 30 09:45:41 2015 -0700

--
 python/pyspark/ml/feature.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/6e85064f/python/pyspark/ml/feature.py
--
diff --git a/python/pyspark/ml/feature.py b/python/pyspark/ml/feature.py
index ddb33f4..7432108 100644
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@@ -456,7 +456,7 @@ class RegexTokenizer(JavaTransformer, HasInputCol, 
HasOutputCol):
 
 A regex based tokenizer that extracts tokens either by using the
 provided regex pattern (in Java dialect) to split the text
-(default) or repeatedly matching the regex (if gaps is true).
+(default) or repeatedly matching the regex (if gaps is false).
 Optional parameters also allow filtering tokens using a minimal
 length.
 It returns an array of strings that can be empty.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-9408] [PYSPARK] [MLLIB] Refactor linalg.py to /linalg

2015-07-30 Thread meng

Repository: spark
Updated Branches:
  refs/heads/master 1afdeb7b4 - ca71cc8c8


[SPARK-9408] [PYSPARK] [MLLIB] Refactor linalg.py to /linalg

This is based on MechCoder 's PR https://github.com/apache/spark/pull/7731. 
Hopefully it could pass tests. MechCoder I tried to make minimal changes. If 
this passes Jenkins, we can merge this one first and then try to move 
`__init__.py` to `local.py` in a separate PR.

Closes #7731

Author: Xiangrui Meng m...@databricks.com

Closes #7746 from mengxr/SPARK-9408 and squashes the following commits:

0e05a3b [Xiangrui Meng] merge master
1135551 [Xiangrui Meng] add a comment for str(...)
c48cae0 [Xiangrui Meng] update tests
173a805 [Xiangrui Meng] move linalg.py to linalg/__init__.py


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ca71cc8c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ca71cc8c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ca71cc8c

Branch: refs/heads/master
Commit: ca71cc8c8b2d64b7756ae697c06876cd18b536dc
Parents: 1afdeb7
Author: Xiangrui Meng m...@databricks.com
Authored: Thu Jul 30 16:57:38 2015 -0700
Committer: Xiangrui Meng m...@databricks.com
Committed: Thu Jul 30 16:57:38 2015 -0700

--
 dev/sparktestsupport/modules.py |2 +-
 python/pyspark/mllib/linalg.py  | 1162 --
 python/pyspark/mllib/linalg/__init__.py | 1162 ++
 python/pyspark/sql/types.py |2 +-
 4 files changed, 1164 insertions(+), 1164 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ca71cc8c/dev/sparktestsupport/modules.py
--
diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index 030d982..44600cb 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -323,7 +323,7 @@ pyspark_mllib = Module(
 pyspark.mllib.evaluation,
 pyspark.mllib.feature,
 pyspark.mllib.fpm,
-pyspark.mllib.linalg,
+pyspark.mllib.linalg.__init__,
 pyspark.mllib.random,
 pyspark.mllib.recommendation,
 pyspark.mllib.regression,

http://git-wip-us.apache.org/repos/asf/spark/blob/ca71cc8c/python/pyspark/mllib/linalg.py
--
diff --git a/python/pyspark/mllib/linalg.py b/python/pyspark/mllib/linalg.py
deleted file mode 100644
index 334dc8e..000
--- a/python/pyspark/mllib/linalg.py
+++ /dev/null
@@ -1,1162 +0,0 @@
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the License); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an AS IS BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-
-MLlib utilities for linear algebra. For dense vectors, MLlib
-uses the NumPy C{array} type, so you can simply pass NumPy arrays
-around. For sparse vectors, users can construct a L{SparseVector}
-object from MLlib or pass SciPy C{scipy.sparse} column vectors if
-SciPy is available in their environment.
-
-
-import sys
-import array
-
-if sys.version = '3':
-basestring = str
-xrange = range
-import copyreg as copy_reg
-long = int
-else:
-from itertools import izip as zip
-import copy_reg
-
-import numpy as np
-
-from pyspark.sql.types import UserDefinedType, StructField, StructType, 
ArrayType, DoubleType, \
-IntegerType, ByteType, BooleanType
-
-
-__all__ = ['Vector', 'DenseVector', 'SparseVector', 'Vectors',
-   'Matrix', 'DenseMatrix', 'SparseMatrix', 'Matrices']
-
-
-if sys.version_info[:2] == (2, 7):
-# speed up pickling array in Python 2.7
-def fast_pickle_array(ar):
-return array.array, (ar.typecode, ar.tostring())
-copy_reg.pickle(array.array, fast_pickle_array)
-
-
-# Check whether we have SciPy. MLlib works without it too, but if we have it, 
some methods,
-# such as _dot and _serialize_double_vector, start to support scipy.sparse 
matrices.
-
-try:
-import scipy.sparse
-_have_scipy = True
-except:
-# No SciPy in environment, but that's okay
-_have_scipy = False
-
-
-def _convert_to_vector(l):
-if isinstance(l, Vector):
-return l
-elif type(l

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 1469 matches

Mail list logo