from:"yanboliang"

[GitHub] spark pull request: [SPARK-15608][ml][doc] add_isotonic_regression_doc

2016-05-31 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13381#discussion_r65206320
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/IsotonicRegressionExample.scala
 ---
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+
+package org.apache.spark.examples.ml
+
+// $example on$
+import org.apache.spark.ml.regression.IsotonicRegression
+// $example off$
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.types.{DoubleType, StructType}
+
+/**
+ * An example demonstrating Isotonic Regression.
+ * Run with
+ * {{{
+ * bin/run-example ml.IsotonicRegressionExample
+ * }}}
+ */
+object IsotonicRegressionExample {
+
+  def main(args: Array[String]): Unit = {
+
+// Creates a SparkSession.
+val spark = SparkSession
+  .builder
+  .appName(s"${this.getClass.getSimpleName}")
+  .getOrCreate()
+// $example on$
+val ir = new IsotonicRegression().setIsotonic(true)
+  
.setLabelCol("label").setFeaturesCol("features").setWeightCol("weight")
+
+val dataReader = spark.read
+dataReader.schema(new StructType().add("label", 
DoubleType).add("features", DoubleType))
+
+var data = 
dataReader.csv("data/mllib/sample_isotonic_regression_data.txt")
--- End diff --

Ditto, use libsvm format.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15608][ml][doc] add_isotonic_regression_doc

2016-05-31 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13381#discussion_r65207453
  
--- Diff: docs/ml-classification-regression.md ---
@@ -685,6 +685,88 @@ The implementation matches the result from R's 
survival function
 
 
 
+## Isotonic regression
+[Isotonic regression](http://en.wikipedia.org/wiki/Isotonic_regression)
+belongs to the family of regression algorithms. Formally isotonic 
regression is a problem where
+given a finite set of real numbers `$Y = {y_1, y_2, ..., y_n}$` 
representing observed responses
+and `$X = {x_1, x_2, ..., x_n}$` the unknown response values to be fitted
+finding a function that minimises
+
+`\begin{equation}
+  f(x) = \sum_{i=1}^n w_i (y_i - x_i)^2
+\end{equation}`
+
+with respect to complete order subject to
+`$x_1\le x_2\le ...\le x_n$` where `$w_i$` are positive weights.
+The resulting function is called isotonic regression and it is unique.
+It can be viewed as least squares problem under order restriction.
+Essentially isotonic regression is a
+[monotonic function](http://en.wikipedia.org/wiki/Monotonic_function)
+best fitting the original data points.
+
+`spark.ml` supports a
+[pool adjacent violators algorithm](http://doi.org/10.1198/TECH.2010.10111)
+which uses an approach to
+[parallelizing isotonic 
regression](http://doi.org/10.1007/978-3-642-99789-1_10).
+The training input is a RDD of tuples of three double values that represent
+label, feature and weight in this order. Additionally IsotonicRegression 
algorithm has one
+optional parameter called $isotonic$ defaulting to true.
+This argument specifies if the isotonic regression is
+isotonic (monotonically increasing) or antitonic (monotonically 
decreasing).
+
+Training returns an IsotonicRegressionModel that can be used to predict
+labels for both known and unknown features. The result of isotonic 
regression
+is treated as piecewise linear function. The rules for prediction 
therefore are:
+
+* If the prediction input exactly matches a training feature
+  then associated prediction is returned. In case there are multiple 
predictions with the same
+  feature then one of them is returned. Which one is undefined
+  (same as java.util.Arrays.binarySearch).
+* If the prediction input is lower or higher than all training features
+  then prediction with lowest or highest feature is returned respectively.
+  In case there are multiple predictions with the same feature
+  then the lowest or highest is returned respectively.
+* If the prediction input falls between two training features then 
prediction is treated
+  as piecewise linear function and interpolated value is calculated from 
the
+  predictions of the two closest features. In case there are multiple 
values
+  with the same feature then the same rules as in previous point are used.
+
+### Examples
+
+
+
+Data are read from a file where each line has a format label,feature
+i.e. 4710.28,500.00. The data are split to training and testing set.
+Model is created using the training set and a mean squared error is 
calculated from the predicted
+labels and real labels in the test set.
+
+Refer to the [`IsotonicRegression` Scala 
docs](api/scala/index.html#org.apache.spark.ml.regression.IsotonicRegression) 
for details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/IsotonicRegressionExample.scala %}
+
+
+Data are read from a file where each line has a format label,feature
+i.e. 4710.28,500.00. The data are split to training and testing set.
+Model is created using the training set and a mean squared error is 
calculated from the predicted
+labels and real labels in the test set.
+
+Refer to the [`IsotonicRegression` Java 
docs](api/java/org/apache/spark/ml/regression/IsotonicRegression.html) for 
details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaIsotonicRegressionExample.java %}
+
+
+Data are read from a file where each line has a format label,feature
--- End diff --

This section is identity for different languages, so it's better we can 
move them out of the ```div``` and eliminate repetition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15608][ml][doc] add_isotonic_regression_doc

2016-05-31 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13381#discussion_r65207910
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaIsotonicRegressionExample.java
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.ml;
+
+// $example on$
+
+import java.util.List;
+
+import org.apache.spark.ml.regression.IsotonicRegression;
+import org.apache.spark.ml.regression.IsotonicRegressionModel;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.types.StructType;
+
+// $example off$
+
+public class JavaIsotonicRegressionExample {
+  public static void main(String[] args) {
+  // Create a SparkSession.
+  SparkSession spark = SparkSession
+  .builder()
+  .appName("JavaIsotonicRegression")
--- End diff --

JavaIsotonicRegressionExample


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15608][ml][doc] add_isotonic_regression_doc

2016-05-31 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13381#discussion_r65207877
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaIsotonicRegressionExample.java
 ---
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.examples.ml;
+
+// $example on$
+
+import java.util.List;
+
+import org.apache.spark.ml.regression.IsotonicRegression;
+import org.apache.spark.ml.regression.IsotonicRegressionModel;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.types.StructType;
+
+// $example off$
+
+public class JavaIsotonicRegressionExample {
+  public static void main(String[] args) {
+  // Create a SparkSession.
+  SparkSession spark = SparkSession
+  .builder()
+  .appName("JavaIsotonicRegression")
--- End diff --

```JavaIsotonicRegressionExample```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15605] [ML] [Examples] Remove JavaDeveloperApiExa...

2016-05-31 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/13353
  
cc @mengxr @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13590] [ML] [Doc] Document spark.ml LiR, LoR and ...

2016-05-31 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/12731
  
ping @mengxr 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15177] [SparkR] [ML] SparkR 2.0 QA: New R APIs an...

2016-05-31 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/13023
  
@vectorijk There is a separate PR focus on updating machine learning 
section of SparkR users guide. FYI #13285. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-15587] [ML] ML 2.0 QA: Scala APIs audit for ml.fe...

2016-05-31 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13410#discussion_r65192762
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -1481,6 +1474,10 @@ class StandardScaler(JavaEstimator, HasInputCol, 
HasOutputCol, JavaMLReadable, J
 Standardizes features by removing the mean and scaling to unit 
variance using column summary
 statistics on the samples in the training set.
 
+The "unit std" is computed using the `corrected sample standard 
deviation \
+
<https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation>`_,
+which is computed as the square root of the unbiased sample variance.
+
--- End diff --

Sync the docs with Scala one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13675: [SPARK-15957] [ML] RFormula supports forcing to i...

2016-06-14 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/13675

[SPARK-15957] [ML] RFormula supports forcing to index label

## What changes were proposed in this pull request?
Add param to make users can force to index label whether it is numeric or 
string. For classification algorithms, we force to index label by setting it 
with true.

## How was this patch tested?
Unit tests.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-15957

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13675.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13675


commit 24ef9baa3b2aa110ce447d935959855fcb954a8e
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-06-14T23:58:00Z

RFormula supports forcing to index label




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13662: [SPARK-15945] [MLLIB] Conversion between old/new vector ...

2016-06-14 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/13662
  
It looks like the merge script is not happy, I will retry later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13662: [SPARK-15945] [MLLIB] Conversion between old/new vector ...

2016-06-14 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/13662
  
LGTM, merged into master and branch-2.0. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13675: [SPARK-15957] [ML] RFormula supports forcing to index la...

2016-06-16 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/13675
  
cc @mengxr @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13381: [SPARK-15608][ml][examples][doc] add examples and docume...

2016-06-16 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/13381
  
Merged into master and branch-2.0. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13381: [SPARK-15608][ml][examples][doc] add examples and docume...

2016-06-16 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/13381
  
Merged into master and branch-2.0. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13731: [SPARK-15946] [MLLIB] Conversion between old/new vector ...

2016-06-17 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/13731
  
LGTM, merged into master and branch-2.0. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13641: [SPARK-10258][DOC][ML] Add @Since annotations to ml.feat...

2016-06-15 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/13641
  
@MLnick I found you did not add ```@Since``` for all params definition, is 
this as expectedï¼I think we should add them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13641: [SPARK-10258][DOC][ML] Add @Since annotations to ...

2016-06-15 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13641#discussion_r67245508
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/MaxAbsScaler.scala ---
@@ -88,7 +91,7 @@ class MaxAbsScaler @Since("2.0.0") (override val uid: 
String)
   override def copy(extra: ParamMap): MaxAbsScaler = defaultCopy(extra)
 }
 
-@Since("1.6.0")
+@Since("2.0.0")
 object MaxAbsScaler extends DefaultParamsReadable[MaxAbsScaler] {
 
   @Since("1.6.0")
--- End diff --

And in L174 and L177 should also be ```@Since("2.0.0")```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13641: [SPARK-10258][DOC][ML] Add @Since annotations to ...

2016-06-15 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13641#discussion_r67245105
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/MaxAbsScaler.scala ---
@@ -88,7 +91,7 @@ class MaxAbsScaler @Since("2.0.0") (override val uid: 
String)
   override def copy(extra: ParamMap): MaxAbsScaler = defaultCopy(extra)
 }
 
-@Since("1.6.0")
+@Since("2.0.0")
 object MaxAbsScaler extends DefaultParamsReadable[MaxAbsScaler] {
 
   @Since("1.6.0")
--- End diff --

Not due to this PR, but here seems like a typo since ```MaxAbsScaler``` was 
added after 1.6.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [M...

2016-01-15 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/10306#issuecomment-171942083
  
@mengxr Thanks for the prompt. I will check my environment and re-run the 
test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [M...

2016-01-18 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/10806

[SPARK-8519][SPARK-11560][SPARK-11559] [ML] [MLlib] Optimize KMeans 
implementation

* Use BLAS Level 3 matrix-matrix multiplications to compute pairwise 
distance in k-means.
* Remove runs related code completely, it will have no effect after this 
change.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-8519-new

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10806.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10806


commit 28fa06a449f27e8d51e8e737963d1bf982d731e3
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-01-18T14:48:06Z

Initial draft of KMeans optimization

commit e929e355281b4d13dae04650d11b6b510de72e26
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-01-18T14:51:59Z

Disable one test of PIC




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [M...

2016-01-18 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/10306#issuecomment-172558579
  
@mengxr I found the misconfiguration of my test environment and updated it, 
thanks! 
Now ```gemm``` is about 20-30 times faster than ```axpy/dot``` in the 
updated test cases.
```Scala
println(com.github.fommil.netlib.BLAS.getInstance().getClass.getName)
val n = 3000
val count = 10
val random = new Random()

val a = Vectors.dense(Array.fill(n)(random.nextDouble()))
val aa = Array.fill(n)(a)
val b = Vectors.dense(Array.fill(n)(random.nextDouble()))
val bb = Array.fill(n)(b)

val a1 = new DenseMatrix(n, n, aa.flatMap(_.toArray), true)
val b1 = new DenseMatrix(n, n, bb.flatMap(_.toArray), false)
val c1 = Matrices.zeros(n, n).asInstanceOf[DenseMatrix]

var total1 = 0.0

// Trial runs
for (i <- 0 until 10) {
  gemm(2.0, a1, b1, 2.0, c1)
}

for (i <- 0 until count) {
  val start = System.nanoTime()
  gemm(2.0, a1, b1, 2.0, c1)
  total1 += (System.nanoTime() - start)/1e9
}
total1 = total1 / count
println("gemm elapsed time: = %.3f".format(total1) + " seconds.")

// Trial runs
for (m <- 0 until 10) {
  for (i <- 0 until n; j <- 0 until n) {
dot(bb(j), aa(i))
  }
}

var total2 = 0.0
for (m <- 0 until count) {
  val start = System.nanoTime()
  for (i <- 0 until n; j <- 0 until n) {
//  axpy(1.0, bb(j), aa(i))
dot(bb(j), aa(i))
  }
  total2 += (System.nanoTime() - start)/1e9
}
total2 = total2 / count
println("dot elapsed time: = %.3f".format(total2) + " seconds.")
```
The output is:
```
com.github.fommil.netlib.NativeSystemBLAS
gemm elapsed time: = 1.022 seconds.
dot elapsed time: = 29.017 seconds.
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [M...

2016-01-18 Thread yanboliang

Github user yanboliang closed the pull request at:

https://github.com/apache/spark/pull/10306


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8519][SPARK-11560][SPARK-11559] [ML] [M...

2016-01-18 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/10306#issuecomment-172562607
  
@mengxr I have a new and advanced implementation for this issue at #10806 , 
let's move the discussion there. I will close this PR now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12645] [SparkR] SparkR support hash fun...

2016-01-15 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/10597#issuecomment-171910873
  
@shivaram Just like @felixcheung commented, the ```hash``` function was 
added only in 2.0.0. So revert it from branch 1.6 will fix the broken test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12905] [ML] [PySpark] PCAModel return e...

2016-01-19 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/10830

[SPARK-12905] [ML] [PySpark] PCAModel return eigenvalues for PySpark

```PCAModel``` return eigenvalues for PySpark

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-12905

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10830.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10830


commit 0fedf916676d447bc23b36e60c323fbfae94a1e1
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-01-19T08:52:28Z

PCAModel return eigenvalues for PySpark




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12903] [SparkR] Add covar_samp and cova...

2016-01-19 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/10829#issuecomment-172784554
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12903] [SparkR] Add covar_samp and cova...

2016-01-19 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/10829

[SPARK-12903] [SparkR] Add covar_samp and covar_pop for SparkR

Add ```covar_samp``` and ```covar_pop``` for SparkR.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-12903

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10829.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10829


commit 7c3e718f1bce57baea852b909e6f50500343dd36
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-01-19T08:08:57Z

Add covar_samp and covar_pop for SparkR




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13378: [SPARK-15643] [Doc] [ML] Update spark.ml and spark.mllib...

2016-06-27 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/13378
  
@MLnick I have updated the new deprecations in the [JIRA] 
(https://issues.apache.org/jira/browse/SPARK-15643?focusedCommentId=15343059=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15343059)
 in this PR. To the vector conversions issue, I think it fits more to add them 
in your section. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13935: [SPARK-16242] [MLlib] [PySpark] Conversion betwee...

2016-06-27 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/13935

[SPARK-16242] [MLlib] [PySpark] Conversion between old/new matrix columns 
in a DataFrame (Python)

## What changes were proposed in this pull request?
This PR implements python wrappers for #13888 to convert old/new matrix 
columns in a DataFrame.

## How was this patch tested?
Doctest in python.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-16242

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13935.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13935


commit 11789339b0eab023bca61e24ac5e73f715a2d97a
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-06-28T03:09:58Z

Conversion between old/new matrix columns in a DataFrame (Python)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13935: [SPARK-16242] [MLlib] [PySpark] Conversion between old/n...

2016-06-27 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/13935
  
cc @hhbyyh @mengxr 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #13937: [SPARK-16245] [ML] model loading backward compatibility ...

2016-06-27 Thread yanboliang

Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/13937
  
cc @hhbyyh @mengxr 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13937: [SPARK-16245] [ML] model loading backward compati...

2016-06-27 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13937#discussion_r68697383
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala ---
@@ -206,24 +206,21 @@ object PCAModel extends MLReadable[PCAModel] {
 override def load(path: String): PCAModel = {
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
 
-  // explainedVariance field is not present in Spark <= 1.6
-  val versionRegex = "([0-9]+)\\.([0-9]+).*".r
-  val hasExplainedVariance = metadata.sparkVersion match {
-case versionRegex(major, minor) =>
-  major.toInt >= 2 || (major.toInt == 1 && minor.toInt > 6)
-case _ => false
-  }
+  val versionRegex = "([0-9]+)\\.(.+)".r
+  val versionRegex(major, _) = metadata.sparkVersion
 
   val dataPath = new Path(path, "data").toString
-  val model = if (hasExplainedVariance) {
+  val model = if (major.toInt >= 2) {
 val Row(pc: DenseMatrix, explainedVariance: DenseVector) =
   sparkSession.read.parquet(dataPath)
 .select("pc", "explainedVariance")
 .head()
 new PCAModel(metadata.uid, pc, explainedVariance)
   } else {
-val Row(pc: DenseMatrix) = 
sparkSession.read.parquet(dataPath).select("pc").head()
-new PCAModel(metadata.uid, pc, 
Vectors.dense(Array.empty[Double]).asInstanceOf[DenseVector])
+// explainedVariance field is not present and we use the old 
matrix in Spark <= 2.0
+val Row(pc: OldDenseMatrix) = 
sparkSession.read.parquet(dataPath).select("pc").head()
+new PCAModel(metadata.uid, pc.asML,
+  Vectors.dense(Array.empty[Double]).asInstanceOf[DenseVector])
--- End diff --

Here we combine the ```explainedVariance``` field issue and the old matrix 
issue together to handle backward compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13937: [SPARK-16245] [ML] model loading backward compati...

2016-06-27 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/13937

[SPARK-16245] [ML] model loading backward compatibility for ml.feature.PCA

## What changes were proposed in this pull request?
model loading backward compatibility for ml.feature.PCA.

## How was this patch tested?
existing ut and manual test for loading old models.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-16245

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13937.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13937


commit 5246bcfa1ba510c281c456b0f61bf32f70d10174
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-06-28T04:42:41Z

model loading backward compatibility for ml.feature.PCA




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13023: [SPARK-15177] [SparkR] [ML] SparkR 2.0 QA: New R ...

2016-06-24 Thread yanboliang

Github user yanboliang closed the pull request at:

https://github.com/apache/spark/pull/13023


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #13888: [SPARK-16187] [ML] Implement util method for ML M...

2016-06-25 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/13888#discussion_r68486013
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala ---
@@ -309,8 +309,8 @@ object MLUtils extends Logging {
   }
 
   /**
-   * Converts vector columns in an input Dataset to the 
[[org.apache.spark.ml.linalg.Vector]] type
-   * from the new [[org.apache.spark.mllib.linalg.Vector]] type under the 
`spark.ml` package.
+   * Converts vector columns in an input Dataset to the 
[[org.apache.spark.mllib.linalg.Vector]]
+   * type from the new [[org.apache.spark.ml.linalg.Vector]] type under 
the `spark.ml` package.
* @param dataset input dataset
--- End diff --

The original annotation is correct, it does not necessary to revise.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13037][ML][PySpark] PySpark ml.recommen...

2016-02-05 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11044#issuecomment-180395685
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-09 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r52330009
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+  s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+s"link function.")
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0&quo

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-09 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r52331693
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+  s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+s"link function.")
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0&quo

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-09 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/11136

[SPARK-12811] [ML] Estimator for Generalized Linear Models(GLMs)

Estimator for Generalized Linear Models(GLMs)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-12811

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11136.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11136


commit 5af604e180d92a75b18b30717d6516d5ee8135bd
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-02-09T15:37:27Z

Initial version of Generalized Linear Regression




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12962] [SQL] [PySpark] PySpark support ...

2016-02-04 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/10876#issuecomment-180175154
  
ping @davies 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12974] [ML] [PySpark] Add Python API fo...

2016-02-12 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/10889#issuecomment-183240342
  
@mengxr Actually, I have already updated this PR after #10216 get merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-11 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r52610253
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+  s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+s"link function.")
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0&quo

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-11 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r52611868
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
+  s"Generalized Linear Regression with ${$(family)} family does not 
support ${$(link)} " +
+s"link function.")
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalized_linear_model]])
+ * specified by giving a symbolic description of the linear predictor and
+ * a description of the error distribution.
+ */
+@Experimental
+@Since("2.0.0")
+class GeneralizedLinearRegression @Since("2.0.0") (@Since("2.0.0") 
override val uid: String)
+  extends Regressor[Vector, GeneralizedLinearRegression, 
GeneralizedLinearRegressionModel]
+  with GeneralizedLinearRegressionParams with Logging {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("genLinReg"))
+
+  /**
+   * Set the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * @group setParam
+   */
+  @Since("2.0.0")
+  def setFamily(value: String): this.type = set(family, value)
+
+  /**
+   * Set the name of the model link function.
+   * @group setParam
+   */
+  @Since("2.0.0&quo

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-11 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r52611593
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,472 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionParams extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"the name of family which is a description of the error distribution 
to be used in the model",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilies.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of the model link function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "the name of the 
model link function",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinks.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+
require(GeneralizedLinearRegression.supportedFamilyLinkPairs.contains($(family) 
-> $(link)),
--- End diff --

Good point! But we can not check ```isSet(link)``` in the setter for 
family, because users may set family before set link and it will produce 
mistake. We can check ```isSet(link)``` at the start of train.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-11 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-182905997
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11939] [ML] [PySpark] PySpark support m...

2016-01-27 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/10469#issuecomment-175472122
  
@jkbradley Thanks for your comments! I have made ```MLReadable``` and 
```MLWritable``` more general and not specific to Java wrappers, addressed all 
comments except for setting the ```Param.parent```. I left the inline comments 
in the threads above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11939] [ML] [PySpark] PySpark support m...

2016-01-26 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10469#discussion_r50950853
  
--- Diff: python/pyspark/ml/wrapper.py ---
@@ -82,13 +71,16 @@ def _transfer_params_to_java(self):
 pair = self._make_java_param_pair(param, paramMap[param])
 self._java_obj.set(pair)
 
-def _transfer_params_from_java(self):
+def _transfer_params_from_java(self, withParent=False):
--- End diff --

I try to put setting Param.parent field into load(), and we will add such 
code snippet to ```JavaMLReader.load()```:
```Scala
for param in self._instance.params:
value = self._instance._paramMap[param]
self._instance._paramMap.pop(param)
param.parent = self._instance.uid
self._instance._paramMap[param] = value
```
It will first pop elements from the dict and then push new one which I 
think is inefficient.
And this code will operate on ```_paramMap``` which is the ML instance's 
internal variable. 
So I vote to do this work at ```_transfer_params_from_java``` or we can add 
a new function there to do the same work. I'm open to hear your thoughts. 
@jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11939] [ML] [PySpark] PySpark support m...

2016-01-26 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10469#discussion_r50948067
  
--- Diff: python/pyspark/ml/util.py ---
@@ -52,3 +71,141 @@ def _randomUID(cls):
 concatenates the class name, "_", and 12 random hex chars.
 """
 return cls.__name__ + "_" + uuid.uuid4().hex[12:]
+
+
+@inherit_doc
+class MLWriter(object):
+"""
+Abstract class for utility classes that can save ML instances.
+
+.. versionadded:: 2.0.0
+"""
+
+def __init__(self, instance):
+self._jwrite = instance._java_obj.write()
+
+@since("2.0.0")
+def save(self, path):
+"""Saves the ML instances to the input path."""
+self._jwrite.save(path)
+
+@since("2.0.0")
+def overwrite(self):
+"""Overwrites if the output path already exists."""
+self._jwrite.overwrite()
+return self
+
+@since("2.0.0")
+def context(self, sqlContext):
+"""Sets the SQL context to use for saving."""
+self._jwrite.context(sqlContext._ssql_ctx)
+return self
+
+
+@inherit_doc
+class MLWritable(object):
+"""
+Mixin for ML instances that provide MLWriter through their Scala
+implementation.
+
+.. versionadded:: 2.0.0
+"""
+
+@since("2.0.0")
--- End diff --

Yes, I removed the ```Since``` annotations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13047][PYSPARK][ML] Pyspark Params.hasP...

2016-01-27 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10962#discussion_r51087340
  
--- Diff: python/pyspark/ml/param/__init__.py ---
@@ -152,13 +152,17 @@ def isDefined(self, param):
 return self.isSet(param) or self.hasDefault(param)
 
 @since("1.4.0")
-def hasParam(self, paramName):
+def hasParam(self, param):
 """
-Tests whether this instance contains a param with a given
-(string) name.
+Tests whether this instance contains a param.
 """
-param = self._resolveParam(paramName)
-return param in self.params
+if isinstance(param, Param):
+return hasattr(self, param.name)
--- End diff --

+1 for accepting only strings. If no strong reasons, keep consistent with 
Scala is the best choice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13032] [ML] [PySpark] PySpark support m...

2016-01-28 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/10469#issuecomment-176070210
  
@jkbradley You PR looks good and get merged, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Minor] [ML] [PySpark] Cleanup test cases of c...

2016-01-31 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/10975#issuecomment-177743797
  
@mengxr I did not found other class has similar test except ```KMeans```, 
is this deliberately designed or lacks of some test cases? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13035] [ML] [PySpark] PySpark ml.cluste...

2016-01-31 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10999#discussion_r51375854
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -69,6 +70,25 @@ class KMeans(JavaEstimator, HasFeaturesCol, 
HasPredictionCol, HasMaxIter, HasTol
 True
 >>> rows[2].prediction == rows[3].prediction
 True
+>>> import os, tempfile
--- End diff --

hmm... Here we combine the test and example functions. I do not have strong 
preference about whether this should live here or in tests file. @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...

2016-02-01 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11000#discussion_r51400507
  
--- Diff: python/pyspark/ml/regression.py ---
@@ -447,7 +447,7 @@ def _create_model(self, java_model):
 
 
 @inherit_doc
-class DecisionTreeModel(JavaModel):
+class DecisionTreeModel(JavaModel, MLWritable, MLReadable):
 """Abstraction for Decision Tree models.
--- End diff --

@Wenpei Please check wether it supports ```save/load``` for the peer Scala 
implementation. Some algorithms such as ```DecisionTree``` did not support it 
currently. And you should add doc test that will test the correctness of your 
modification. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13037][ML][PySpark] PySpark ml.recommen...

2016-02-03 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11044#discussion_r51690764
  
--- Diff: python/pyspark/ml/recommendation.py ---
@@ -81,6 +82,23 @@ class ALS(JavaEstimator, HasCheckpointInterval, 
HasMaxIter, HasPredictionCol, Ha
 Row(user=1, item=0, prediction=2.6258413791656494)
 >>> predictions[2]
 Row(user=2, item=0, prediction=-1.5018409490585327)
+>>> import os, tempfile
+>>> path = tempfile.mkdtemp()
+>>> ALS_path = path + "/als"
+>>> als.save(ALS_path)
+>>> als2 = ALS.load(ALS_path)
+>>> als.getMaxIter()
+5
+>>> model_path = path + "/als_model"
+>>> model.save(model_path)
+>>> model2 = ALSModel.load(model_path)
+>>> model.rank == model2.rank
+True
--- End diff --

Can we also add test for ```userFactors``` or ```itemFactors```?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13037][ML][PySpark] PySpark ml.recommen...

2016-02-03 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11044#discussion_r51690527
  
--- Diff: python/pyspark/ml/recommendation.py ---
@@ -81,6 +82,23 @@ class ALS(JavaEstimator, HasCheckpointInterval, 
HasMaxIter, HasPredictionCol, Ha
 Row(user=1, item=0, prediction=2.6258413791656494)
 >>> predictions[2]
 Row(user=2, item=0, prediction=-1.5018409490585327)
+>>> import os, tempfile
+>>> path = tempfile.mkdtemp()
+>>> ALS_path = path + "/als"
--- End diff --

nit: ```ALS_path``` -> ```als_path```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13035] [ML] [PySpark] PySpark ml.cluste...

2016-01-31 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/10999

[SPARK-13035] [ML] [PySpark] PySpark ml.clustering support export/import

PySpark ml.clustering support export/import.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-13035

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10999.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10999


commit dffafbf67782d4ee16524ad7a6edb4090e26c786
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-01-31T09:41:15Z

PySpark ml.clustering support export/import




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...

2016-02-02 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11000#issuecomment-178575141
  
@Wenpei It looks like ```_transfer_params_from_java``` did not consider the 
params which do not have default value and we should handle them. Would you 
mind to create a jira to track this issue? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13153][PySpark] ML persistence failed w...

2016-02-02 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11043#discussion_r51674846
  
--- Diff: python/pyspark/ml/wrapper.py ---
@@ -79,8 +79,9 @@ def _transfer_params_from_java(self):
 for param in self.params:
 if self._java_obj.hasParam(param.name):
 java_param = self._java_obj.getParam(param.name)
-value = _java2py(sc, 
self._java_obj.getOrDefault(java_param))
-self._paramMap[param] = value
+if self._java_obj.hasDefault(java_param):
--- End diff --

We should check by ```isDefined``` rather than ```hasDefault``` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13153][PySpark] ML persistence failed w...

2016-02-02 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11043#issuecomment-178983977
  
ping @mengxr @jkbradley 
Could you add @Wenpei to white list ? This is an obvious bug and we should 
fix it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...

2016-02-02 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11000#issuecomment-178966434
  
It should not make all parameters have default value because of some params 
are not setting default value on purpose. I think we should modify 
```_transfer_params_from_java``` to make it not to get the params which do not 
have default values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13047][PYSPARK][ML] Pyspark Params.hasP...

2016-01-27 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10962#discussion_r51075753
  
--- Diff: python/pyspark/ml/param/__init__.py ---
@@ -152,13 +152,17 @@ def isDefined(self, param):
 return self.isSet(param) or self.hasDefault(param)
 
 @since("1.4.0")
-def hasParam(self, paramName):
+def hasParam(self, param):
 """
-Tests whether this instance contains a param with a given
-(string) name.
+Tests whether this instance contains a param.
 """
-param = self._resolveParam(paramName)
-return param in self.params
+if isinstance(param, Param):
+return hasattr(self, param.name)
--- End diff --

If we support ```param``` of type ```Param```, we should not only check the 
```hasattr(self, param.name)``` but also check ```self.uid == param.parent```. 
You can directly call ```_shouldOwn``` to do this work. It means if you provide 
a ```Param``` to check whether it belongs to the instance, you should both 
check ```uid``` and ```name```.
I vote to only support ```paramName``` that make consistent semantics with 
Scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12962] [SQL] [PySpark] PySpark support ...

2016-01-28 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10876#discussion_r51226739
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -263,6 +263,38 @@ def corr(col1, col2):
 return Column(sc._jvm.functions.corr(_to_java_column(col1), 
_to_java_column(col2)))
 
 
+@since(2.0)
+def covar_pop(col1, col2):
+"""Returns a new :class:`Column` for the population covariance of 
``col1``
+and ``col2``.
+
+>>> a = [x * x - 2 * x + 3.5 for x in range(20)]
+>>> b = range(20)
+>>> df = sqlContext.createDataFrame(zip(a, b), ["a", "b"])
+>>> covDf = df.agg(covar_pop("a", "b").alias('c'))
+>>> covDf.selectExpr('abs(c - 565.25) < 1e-16 as t').collect()
--- End diff --

@davies OK, I will cleanup them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Minor] [ML] [PySpark] Cleanup test cases of c...

2016-01-28 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/10975

[Minor] [ML] [PySpark] Cleanup test cases of clustering.py



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark clustering-cleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10975.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10975


commit 0e749a954c27030099809ddc75e7c33fdecf6021
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-01-29T05:29:34Z

Cleanup clustering.py




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...

2016-02-24 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11000#issuecomment-188217429
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Minor] [ML] [Doc] Cleanup dots at the end of ...

2016-02-24 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11344#issuecomment-188213659
  
@srowen Some ScalaDoc will end with two dots if we don't fix, you can refer 
[here](https://github.com/apache/spark/pull/11344/files#diff-9edc669edcf2c0c7cf1efe4a0a57da80L367).
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...

2016-02-24 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11000#issuecomment-188217276
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...

2016-02-24 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11000#issuecomment-188216762
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...

2016-02-24 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11000#issuecomment-188217094
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Minor] [ML] [Doc] Cleanup dots at the end of ...

2016-02-24 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11344#issuecomment-188281271
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53911268
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -0,0 +1,499 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import scala.util.Random
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.param.ParamsSuite
+import org.apache.spark.ml.util.MLTestingUtils
+import org.apache.spark.mllib.classification.LogisticRegressionSuite._
+import org.apache.spark.mllib.linalg.{BLAS, DenseVector, Vectors}
+import org.apache.spark.mllib.random._
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.{DataFrame, Row}
+
+class GeneralizedLinearRegressionSuite extends SparkFunSuite with 
MLlibTestSparkContext {
+
+  private val seed: Int = 42
+  @transient var datasetGaussianIdentity: DataFrame = _
+  @transient var datasetGaussianLog: DataFrame = _
+  @transient var datasetGaussianInverse: DataFrame = _
+  @transient var datasetBinomial: DataFrame = _
+  @transient var datasetPoissonLog: DataFrame = _
+  @transient var datasetPoissonIdentity: DataFrame = _
+  @transient var datasetPoissonSqrt: DataFrame = _
+  @transient var datasetGammaInverse: DataFrame = _
+  @transient var datasetGammaIdentity: DataFrame = _
+  @transient var datasetGammaLog: DataFrame = _
+
+  override def beforeAll(): Unit = {
+super.beforeAll()
+
+import GeneralizedLinearRegressionSuite._
+
+datasetGaussianIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "identity"), 2))
+
+datasetGaussianLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "log"), 2))
+
+datasetGaussianInverse = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "gaussian", link = "inverse"), 2))
+
+datasetBinomial = {
+  val nPoints = 1
+  val coefficients = Array(-0.57997, 0.912083, -0.371077, -0.819866, 
2.688191)
+  val xMean = Array(5.843, 3.057, 3.758, 1.199)
+  val xVariance = Array(0.6856, 0.1899, 3.116, 0.581)
+
+  val testData =
+generateMultinomialLogisticInput(coefficients, xMean, xVariance, 
true, nPoints, seed)
+
+  sqlContext.createDataFrame(sc.parallelize(testData, 4))
+}
+
+datasetPoissonLog = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 0.25, coefficients = Array(0.22, 0.06), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family = "poisson", link = "log"), 2))
+
+datasetPoissonIdentity = sqlContext.createDataFrame(
+  sc.parallelize(generateGeneralizedLinearRegressionInput(
+intercept = 2.5, coefficients = Array(2.2, 0.6), xMean = 
Array(2.9, 10.5),
+xVariance = Array(0.7, 1.2), nPoints = 1, seed, eps = 0.01,
+family

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53911596
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,565 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionBase extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * Default is "gaussian".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"The name of family which is a description of the error distribution 
to be used in the " +
+  "model. Supported options: gaussian(default), binomial, poisson and 
gamma.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of link function which provides the relationship
+   * between the linear predictor and the mean of the distribution 
function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
+"which provides the relationship between the linear predictor and the 
mean of the " +
+"distribution function. Supported options: identity, log, inverse, 
logit, probit, " +
+"cloglog and sqrt.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  import GeneralizedLinearRegression._
+  protected lazy val familyObj = Family.fromName($(family))
+  protected lazy val linkObj = if (isDefined(link)) {
+Link.fromName($(link))
+  } else {
+familyObj.defaultLink
+  }
+  protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if ($(solver) == "irls") {
+  setDefault(maxIter -> 25)
+}
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains(
+familyObj -> linkObj), s"Generalized Linear Regression with 
${$(family)} family " +
+s"does not support ${$(link)} link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalize

[GitHub] spark pull request: [] [] [] Clean up sharedParams

2016-02-24 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/11344

[] [] [] Clean up sharedParams

## What changes were proposed in this pull request?
Remove duplicated dot at the end of some sharedParams in ScalaDoc.
cc @mengxr @srowen 
## How was this patch tested?
Documents change, no test.




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark shared-cleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11344.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11344


commit a12b1ac390fe6cb15386cdc6e052a1e28c22e992
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-02-24T10:09:50Z

Clean up sharedParams




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53909986
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,565 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionBase extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * Default is "gaussian".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"The name of family which is a description of the error distribution 
to be used in the " +
+  "model. Supported options: gaussian(default), binomial, poisson and 
gamma.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of link function which provides the relationship
+   * between the linear predictor and the mean of the distribution 
function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
+"which provides the relationship between the linear predictor and the 
mean of the " +
+"distribution function. Supported options: identity, log, inverse, 
logit, probit, " +
+"cloglog and sqrt.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  import GeneralizedLinearRegression._
+  protected lazy val familyObj = Family.fromName($(family))
+  protected lazy val linkObj = if (isDefined(link)) {
+Link.fromName($(link))
+  } else {
+familyObj.defaultLink
+  }
+  protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if ($(solver) == "irls") {
+  setDefault(maxIter -> 25)
+}
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains(
+familyObj -> linkObj), s"Generalized Linear Regression with 
${$(family)} family " +
+s"does not support ${$(link)} link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalize

[GitHub] spark pull request: [SPARK-7106][MLlib][PySpark] Support model sav...

2016-02-24 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11321#discussion_r53908665
  
--- Diff: python/pyspark/mllib/fpm.py ---
@@ -40,6 +41,11 @@ class FPGrowthModel(JavaModelWrapper):
 >>> model = FPGrowth.train(rdd, 0.6, 2)
 >>> sorted(model.freqItemsets().collect())
 [FreqItemset(items=[u'a'], freq=4), FreqItemset(items=[u'c'], freq=3), 
...
+>>> model_path = temp_path + "/fpg_model"
--- End diff --

```/fpm``` is enough, because we only support save/load model under the old 
MLlib API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53909458
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,565 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionBase extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * Default is "gaussian".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"The name of family which is a description of the error distribution 
to be used in the " +
+  "model. Supported options: gaussian(default), binomial, poisson and 
gamma.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of link function which provides the relationship
+   * between the linear predictor and the mean of the distribution 
function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
+"which provides the relationship between the linear predictor and the 
mean of the " +
+"distribution function. Supported options: identity, log, inverse, 
logit, probit, " +
+"cloglog and sqrt.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  import GeneralizedLinearRegression._
+  protected lazy val familyObj = Family.fromName($(family))
+  protected lazy val linkObj = if (isDefined(link)) {
+Link.fromName($(link))
+  } else {
+familyObj.defaultLink
+  }
+  protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if ($(solver) == "irls") {
+  setDefault(maxIter -> 25)
+}
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains(
+familyObj -> linkObj), s"Generalized Linear Regression with 
${$(family)} family " +
+s"does not support ${$(link)} link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalize

[GitHub] spark pull request: [SPARK-13504] [SparkR] Add approxQuantile for ...

2016-02-25 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/11383

[SPARK-13504] [SparkR] Add approxQuantile for SparkR

## What changes were proposed in this pull request?
Add ```approxQuantile``` for SparkR.
## How was this patch tested?
unit tests



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-13504

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11383.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11383


commit 4f17adb6b18d53aef08233248d0b8bf0f294ddf4
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-02-26T03:53:46Z

Add approxQuantile for SparkR




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13036][SPARK-13318][SPARK-13319] Add sa...

2016-02-26 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11203#discussion_r54218770
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -1330,6 +1448,21 @@ class StringIndexer(JavaEstimator, HasInputCol, 
HasOutputCol, HasHandleInvalid):
 >>> sorted(set([(i[0], str(i[1])) for i in itd.select(itd.id, 
itd.label2).collect()]),
 ... key=lambda x: x[0])
 [(0, 'a'), (1, 'b'), (2, 'c'), (3, 'a'), (4, 'a'), (5, 'c')]
+>>> stringIndexerPath = temp_path + "/string-indexer"
+>>> stringIndexer.save(stringIndexerPath)
+>>> loadedIndexerModel = 
StringIndexer.load(stringIndexerPath).fit(stringIndDf)
--- End diff --

It does not need to train here. Like others, check ```StringIndexer``` 
param equality, then save the model trained at L1323, loaded the model and 
check model equality.   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13036][SPARK-13318][SPARK-13319] Add sa...

2016-02-26 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11203#discussion_r54217956
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -443,6 +477,12 @@ class HashingTF(JavaTransformer, HasInputCol, 
HasOutputCol, HasNumFeatures):
 >>> params = {hashingTF.numFeatures: 5, hashingTF.outputCol: "vector"}
 >>> hashingTF.transform(df, params).head().vector
 SparseVector(5, {2: 1.0, 3: 1.0, 4: 1.0})
+>>> hashingTFPath = temp_path + "/hashing-tf"
+>>> hashingTF.save(hashingTFPath)
+>>> loadedHashingTF = HashingTF.load(hashingTFPath)
+>>> param = loadedHashingTF.getParam("numFeatures")
+>>> loadedHashingTF.getOrDefault(param) == 
hashingTF.getOrDefault(param)
+True
--- End diff --

Could you use ```getNumFeatures``` like other transformers in the doc test? 
It will make your test clean. ```HashingTF``` extends from 
```HasNumFeatures```, so it has this method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-25 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r54206474
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
@@ -157,6 +157,12 @@ private[ml] class WeightedLeastSquares(
 private[ml] object WeightedLeastSquares {
 
   /**
+   * In order to take the normal equation approach efficiently, 
[[WeightedLeastSquares]]
+   * only supports the number of features is no more than 4096.
+   */
+  val MaxNumFeatures: Int = 4096
--- End diff --

OK, I will update it to ```MAX_NUM_FEATURES``` after collecting more 
comments. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13545] [MLlib] [PySpark] Make MLlib LR'...

2016-02-28 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/11424

[SPARK-13545] [MLlib] [PySpark] Make MLlib LR's default parameters 
consistent in Scala and Python

## What changes were proposed in this pull request?
Make MLlib LR's default parameters consistent in Scala and Python.
## How was this patch tested?
No new tests, should pass current tests.




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-13545

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11424.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11424


commit fc370c0c3fba12af42551d4d71043cb54e3fde71
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-02-29T05:32:58Z

Make MLlib LR's default parameters consistent in Scala and Python




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13322] [ML] AFTSurvivalRegression suppo...

2016-02-25 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/11365

[SPARK-13322] [ML] AFTSurvivalRegression supports feature standardization

## What changes were proposed in this pull request?

AFTSurvivalRegression should support feature standardization, it will 
improve the convergence rate.

## How was this patch tested?
unit test.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-13322

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11365.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11365


commit 0e5efab562d4174908fab5d9d9a788c95fb183e0
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-02-25T06:27:52Z

AFTSurvivalRegression supports feature standardization

commit ae28544d9141c8ddbad57ee79873125edbb115ec
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-02-25T07:57:42Z

add test case: numerical stability of standardization




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

2016-02-25 Thread yanboliang

Github user yanboliang closed the pull request at:

https://github.com/apache/spark/pull/11247


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13372] [ML] Fix LogisticRegression when...

2016-02-25 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11247#issuecomment-188668664
  
@dbtsai I think you convinced me, and I have also checked the R glmnet 
implementation. The current behavior may be more make sense, so I will close 
this PR. Thanks for your kindly clarification!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [Minor] [ML] [Doc] Cleanup dots at the end of ...

2016-02-25 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11344#issuecomment-188778682
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13490] [ML] ML LinearRegression should ...

2016-02-25 Thread yanboliang

GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/11367

[SPARK-13490] [ML] ML LinearRegression should cache standardization param 
value

## What changes were proposed in this pull request?
Like [SPARK-13132](https://issues.apache.org/jira/browse/SPARK-13132) for 
LogisticRegression, when LinearRegression with L1 regularization, the inner 
functor passed to the quasi-newton optimizer in 
```org.apache.spark.ml.regression.LinearRegression#train``` makes repeated 
calls to ```$(standardization)```. This ultimately involves repeated string 
interpolation triggered by ```org.apache.spark.ml.param.Param#hashCode```. We 
should cache the value of the ```standardization``` rather than re-fetching it 
from the ParamMap for every iteration.

## How was this patch tested?
No extra tests are added. It should pass all existing tests. 




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-13490

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11367.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11367


commit f9c79b136ac41bd91f482a4f948acb780f493516
Author: Yanbo Liang <yblia...@gmail.com>
Date:   2016-02-25T09:35:57Z

ML LinearRegression should cache standardization param value




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11940][PYSPARK] Python API for ml.clust...

2016-02-29 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10242#discussion_r54395410
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -291,6 +292,317 @@ def _create_model(self, java_model):
 return BisectingKMeansModel(java_model)
 
 
+class LDAModel(JavaModel):
+""" A clustering model derived from the LDA method.
+
+Latent Dirichlet Allocation (LDA), a topic model designed for text 
documents.
+Terminology
+- "word" = "term": an element of the vocabulary
+- "token": instance of a term appearing in a document
+- "topic": multinomial distribution over words representing some 
concept
+References:
+- Original LDA paper (journal version):
+Blei, Ng, and Jordan.  "Latent Dirichlet Allocation."  JMLR, 2003.
+"""
+
+@since("2.0.0")
+def isDistributed(self):
+"""Indicates whether this instance is of type 
DistributedLDAModel"""
+return self._call_java("isDistributed")
+
+@since("2.0.0")
+def vocabSize(self):
+"""Vocabulary size (number of terms or terms in the vocabulary)"""
+return self._call_java("vocabSize")
+
+@since("2.0.0")
+def topicsMatrix(self):
+"""Inferred topics, where each topic is represented by a 
distribution over terms."""
+return self._call_java("topicsMatrix")
+
+@since("2.0.0")
+def logLikelihood(self, dataset):
+"""Calculates a lower bound on the log likelihood of the entire 
corpus."""
+return self._call_java("logLikelihood", dataset)
+
+@since("2.0.0")
+def logPerplexity(self, dataset):
+"""Calculate an upper bound bound on perplexity.  (Lower is 
better.)"""
+return self._call_java("logPerplexity", dataset)
+
+@since("2.0.0")
+def describeTopics(self, maxTermsPerTopic=10):
+"""Return the topics described by weighted terms.
+
+WARNING: If vocabSize and k are large, this can return a large 
object!
+
+:param maxTermsPerTopic: Maximum number of terms to collect for 
each topic.
+(default: vocabulary size)
+:return: Array over topics. Each topic is represented as a pair of 
matching arrays:
+(term indices, term weights in topic).
+Each topic's terms are sorted in order of decreasing weight.
+"""
+return self._call_java("describeTopics", maxTermsPerTopic)
+
+
+class DistributedLDAModel(LDAModel):
+"""
+Model fitted by LDA.
+
+.. versionadded:: 2.0.0
+"""
+def toLocal(self):
+return self._call_java("toLocal")
+
+
+class LocalLDAModel(LDAModel):
+"""
+Model fitted by LDA.
+
+.. versionadded:: 2.0.0
+"""
+pass
+
+
+class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, 
HasCheckpointInterval):
+""" A clustering model derived from the LDA method.
+
+Latent Dirichlet Allocation (LDA), a topic model designed for text 
documents.
+Terminology
+- "word" = "term": an element of the vocabulary
+- "token": instance of a term appearing in a document
+- "topic": multinomial distribution over words representing some 
concept
+References:
+- Original LDA paper (journal version):
+Blei, Ng, and Jordan.  "Latent Dirichlet Allocation."  JMLR, 2003.
+
+>>> from pyspark.mllib.linalg import Vectors, SparseVector
+>>> from pyspark.ml.clustering import LDA
+>>> df = sqlContext.createDataFrame([[1, Vectors.dense([0.0, 1.0])], \
+[2, SparseVector(2, {0: 1.0})],], ["id", "features"])
+>>> lda = LDA(k=2, seed=1, optimizer="em")
+>>> model = lda.fit(df)
+>>> model.isDistributed()
+True
+>>> localModel = model.toLocal()
+>>> localModel.isDistributed()
+False
+>>> model.vocabSize()
+2
+>>> model.describeTopics().show()
++-+---++
+|topic|termIndices| termWeights|
+

[GitHub] spark pull request: [SPARK-11940][PYSPARK] Python API for ml.clust...

2016-02-29 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10242#discussion_r54394942
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -291,6 +292,317 @@ def _create_model(self, java_model):
 return BisectingKMeansModel(java_model)
 
 
+class LDAModel(JavaModel):
+""" A clustering model derived from the LDA method.
+
+Latent Dirichlet Allocation (LDA), a topic model designed for text 
documents.
+Terminology
+- "word" = "term": an element of the vocabulary
+- "token": instance of a term appearing in a document
+- "topic": multinomial distribution over words representing some 
concept
+References:
+- Original LDA paper (journal version):
+Blei, Ng, and Jordan.  "Latent Dirichlet Allocation."  JMLR, 2003.
+"""
+
+@since("2.0.0")
+def isDistributed(self):
+"""Indicates whether this instance is of type 
DistributedLDAModel"""
+return self._call_java("isDistributed")
+
+@since("2.0.0")
+def vocabSize(self):
+"""Vocabulary size (number of terms or terms in the vocabulary)"""
+return self._call_java("vocabSize")
+
+@since("2.0.0")
+def topicsMatrix(self):
+"""Inferred topics, where each topic is represented by a 
distribution over terms."""
+return self._call_java("topicsMatrix")
+
+@since("2.0.0")
+def logLikelihood(self, dataset):
+"""Calculates a lower bound on the log likelihood of the entire 
corpus."""
+return self._call_java("logLikelihood", dataset)
+
+@since("2.0.0")
+def logPerplexity(self, dataset):
+"""Calculate an upper bound bound on perplexity.  (Lower is 
better.)"""
+return self._call_java("logPerplexity", dataset)
+
+@since("2.0.0")
+def describeTopics(self, maxTermsPerTopic=10):
+"""Return the topics described by weighted terms.
+
+WARNING: If vocabSize and k are large, this can return a large 
object!
+
+:param maxTermsPerTopic: Maximum number of terms to collect for 
each topic.
+(default: vocabulary size)
+:return: Array over topics. Each topic is represented as a pair of 
matching arrays:
+(term indices, term weights in topic).
+Each topic's terms are sorted in order of decreasing weight.
+"""
+return self._call_java("describeTopics", maxTermsPerTopic)
+
+
+class DistributedLDAModel(LDAModel):
+"""
+Model fitted by LDA.
+
+.. versionadded:: 2.0.0
+"""
+def toLocal(self):
+return self._call_java("toLocal")
+
+
+class LocalLDAModel(LDAModel):
+"""
+Model fitted by LDA.
+
+.. versionadded:: 2.0.0
+"""
+pass
+
+
+class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, 
HasCheckpointInterval):
+""" A clustering model derived from the LDA method.
+
+Latent Dirichlet Allocation (LDA), a topic model designed for text 
documents.
+Terminology
+- "word" = "term": an element of the vocabulary
+- "token": instance of a term appearing in a document
+- "topic": multinomial distribution over words representing some 
concept
+References:
+- Original LDA paper (journal version):
+Blei, Ng, and Jordan.  "Latent Dirichlet Allocation."  JMLR, 2003.
+
+>>> from pyspark.mllib.linalg import Vectors, SparseVector
+>>> from pyspark.ml.clustering import LDA
+>>> df = sqlContext.createDataFrame([[1, Vectors.dense([0.0, 1.0])], \
+[2, SparseVector(2, {0: 1.0})],], ["id", "features"])
+>>> lda = LDA(k=2, seed=1, optimizer="em")
+>>> model = lda.fit(df)
+>>> model.isDistributed()
+True
+>>> localModel = model.toLocal()
+>>> localModel.isDistributed()
+False
+>>> model.vocabSize()
+2
+>>> model.describeTopics().show()
++-+---++
+|topic|termIndices| termWeights|
+

[GitHub] spark pull request: [SPARK-11940][PYSPARK] Python API for ml.clust...

2016-02-29 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10242#discussion_r54395971
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -291,6 +292,317 @@ def _create_model(self, java_model):
 return BisectingKMeansModel(java_model)
 
 
+class LDAModel(JavaModel):
+""" A clustering model derived from the LDA method.
+
+Latent Dirichlet Allocation (LDA), a topic model designed for text 
documents.
+Terminology
+- "word" = "term": an element of the vocabulary
+- "token": instance of a term appearing in a document
+- "topic": multinomial distribution over words representing some 
concept
+References:
+- Original LDA paper (journal version):
+Blei, Ng, and Jordan.  "Latent Dirichlet Allocation."  JMLR, 2003.
+"""
--- End diff --

```.. versionadded:: 2.0.0``` for ```LDAModel```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11940][PYSPARK] Python API for ml.clust...

2016-02-29 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10242#discussion_r54394622
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -167,6 +167,200 @@ def getInitSteps(self):
 return self.getOrDefault(self.initSteps)
 
 
+class LDAModel(JavaModel):
+""" A clustering model derived from the LDA method.
+
+Latent Dirichlet Allocation (LDA), a topic model designed for text 
documents.
+Terminology
+- "word" = "term": an element of the vocabulary
+- "token": instance of a term appearing in a document
+- "topic": multinomial distribution over words representing some 
concept
+References:
+- Original LDA paper (journal version):
+Blei, Ng, and Jordan.  "Latent Dirichlet Allocation."  JMLR, 2003.
+"""
+
+@since("1.7.0")
+def isDistributed(self):
+"""Indicates whether this instance is of type 
DistributedLDAModel"""
+return self._call_java("isDistributed")
+
+@since("1.7.0")
+def vocabSize(self):
+"""Vocabulary size (number of terms or terms in the vocabulary)"""
+return self._call_java("vocabSize")
+
+@since("1.7.0")
+def topicsMatrix(self):
+"""Inferred topics, where each topic is represented by a 
distribution over terms."""
+return self._call_java("topicsMatrix")
+
+@since("1.7.0")
+def logLikelihood(self, dataset):
+"""Calculates a lower bound on the log likelihood of the entire 
corpus."""
+return self._call_java("logLikelihood", dataset)
+
+@since("1.7.0")
+def logPerplexity(self, dataset):
+"""Calculate an upper bound bound on perplexity.  (Lower is 
better.)"""
+return self._call_java("logPerplexity", dataset)
+
+@since("1.7.0")
+def describeTopics(self, maxTermsPerTopic=10):
+"""Return the topics described by weighted terms.
+
+WARNING: If vocabSize and k are large, this can return a large 
object!
+
+:param maxTermsPerTopic: Maximum number of terms to collect for 
each topic.
+(default: vocabulary size)
+:return: Array over topics. Each topic is represented as a pair of 
matching arrays:
+(term indices, term weights in topic).
+Each topic's terms are sorted in order of decreasing weight.
+"""
+return self._call_java("describeTopics", maxTermsPerTopic)
+
+
+class DistributedLDAModel(LDAModel):
+"""
+Model fitted by LDA.
+
+.. versionadded:: 1.7.0
+"""
+def toLocal(self):
+return self._call_java("toLocal")
+
+
+class LocalLDAModel(LDAModel):
+"""
+Model fitted by LDA.
+
+.. versionadded:: 1.7.0
+"""
+pass
+
+
+class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, 
HasCheckpointInterval):
+""" A clustering model derived from the LDA method.
+
+Latent Dirichlet Allocation (LDA), a topic model designed for text 
documents.
+Terminology
+- "word" = "term": an element of the vocabulary
+- "token": instance of a term appearing in a document
+- "topic": multinomial distribution over words representing some 
concept
+References:
+- Original LDA paper (journal version):
+Blei, Ng, and Jordan.  "Latent Dirichlet Allocation."  JMLR, 2003.
+
+>>> from pyspark.mllib.linalg import Vectors, SparseVector
+>>> from pyspark.ml.clustering import LDA
+>>> df = sqlContext.createDataFrame([[1, Vectors.dense([0.0, 1.0])], \
--- End diff --

Here we usually make the next line start with ```...```, you can refer 
[here](https://github.com/apache/spark/blob/master/python/pyspark/ml/clustering.py#L58).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-11940][PYSPARK] Python API for ml.clust...

2016-02-29 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10242#discussion_r54394752
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -291,6 +292,317 @@ def _create_model(self, java_model):
 return BisectingKMeansModel(java_model)
 
 
+class LDAModel(JavaModel):
+""" A clustering model derived from the LDA method.
+
+Latent Dirichlet Allocation (LDA), a topic model designed for text 
documents.
+Terminology
+- "word" = "term": an element of the vocabulary
+- "token": instance of a term appearing in a document
+- "topic": multinomial distribution over words representing some 
concept
+References:
+- Original LDA paper (journal version):
+Blei, Ng, and Jordan.  "Latent Dirichlet Allocation."  JMLR, 2003.
+"""
+
+@since("2.0.0")
+def isDistributed(self):
+"""Indicates whether this instance is of type 
DistributedLDAModel"""
+return self._call_java("isDistributed")
+
+@since("2.0.0")
+def vocabSize(self):
+"""Vocabulary size (number of terms or terms in the vocabulary)"""
+return self._call_java("vocabSize")
+
+@since("2.0.0")
+def topicsMatrix(self):
+"""Inferred topics, where each topic is represented by a 
distribution over terms."""
+return self._call_java("topicsMatrix")
+
+@since("2.0.0")
+def logLikelihood(self, dataset):
+"""Calculates a lower bound on the log likelihood of the entire 
corpus."""
+return self._call_java("logLikelihood", dataset)
+
+@since("2.0.0")
+def logPerplexity(self, dataset):
+"""Calculate an upper bound bound on perplexity.  (Lower is 
better.)"""
+return self._call_java("logPerplexity", dataset)
+
+@since("2.0.0")
+def describeTopics(self, maxTermsPerTopic=10):
+"""Return the topics described by weighted terms.
+
+WARNING: If vocabSize and k are large, this can return a large 
object!
+
+:param maxTermsPerTopic: Maximum number of terms to collect for 
each topic.
+(default: vocabulary size)
+:return: Array over topics. Each topic is represented as a pair of 
matching arrays:
+(term indices, term weights in topic).
+Each topic's terms are sorted in order of decreasing weight.
+"""
+return self._call_java("describeTopics", maxTermsPerTopic)
+
+
+class DistributedLDAModel(LDAModel):
+"""
+Model fitted by LDA.
+
+.. versionadded:: 2.0.0
+"""
+def toLocal(self):
+return self._call_java("toLocal")
+
+
+class LocalLDAModel(LDAModel):
+"""
+Model fitted by LDA.
+
+.. versionadded:: 2.0.0
+"""
+pass
+
+
+class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, 
HasCheckpointInterval):
+""" A clustering model derived from the LDA method.
+
+Latent Dirichlet Allocation (LDA), a topic model designed for text 
documents.
+Terminology
+- "word" = "term": an element of the vocabulary
+- "token": instance of a term appearing in a document
+- "topic": multinomial distribution over words representing some 
concept
+References:
+- Original LDA paper (journal version):
+Blei, Ng, and Jordan.  "Latent Dirichlet Allocation."  JMLR, 2003.
+
+>>> from pyspark.mllib.linalg import Vectors, SparseVector
+>>> from pyspark.ml.clustering import LDA
+>>> df = sqlContext.createDataFrame([[1, Vectors.dense([0.0, 1.0])], \
+[2, SparseVector(2, {0: 1.0})],], ["id", "features"])
+>>> lda = LDA(k=2, seed=1, optimizer="em")
+>>> model = lda.fit(df)
+>>> model.isDistributed()
+True
+>>> localModel = model.toLocal()
+>>> localModel.isDistributed()
+False
+>>> model.vocabSize()
+2
+>>> model.describeTopics().show()
++-+---++
+|topic|termIndices| termWeights|
++-+--

[GitHub] spark pull request: [SPARK-11940][PYSPARK] Python API for ml.clust...

2016-02-29 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10242#discussion_r54395661
  
--- Diff: python/pyspark/ml/clustering.py ---
@@ -291,6 +292,317 @@ def _create_model(self, java_model):
 return BisectingKMeansModel(java_model)
 
 
+class LDAModel(JavaModel):
+""" A clustering model derived from the LDA method.
+
+Latent Dirichlet Allocation (LDA), a topic model designed for text 
documents.
+Terminology
+- "word" = "term": an element of the vocabulary
+- "token": instance of a term appearing in a document
+- "topic": multinomial distribution over words representing some 
concept
+References:
+- Original LDA paper (journal version):
+Blei, Ng, and Jordan.  "Latent Dirichlet Allocation."  JMLR, 2003.
+"""
+
+@since("2.0.0")
+def isDistributed(self):
+"""Indicates whether this instance is of type 
DistributedLDAModel"""
+return self._call_java("isDistributed")
+
+@since("2.0.0")
+def vocabSize(self):
+"""Vocabulary size (number of terms or terms in the vocabulary)"""
+return self._call_java("vocabSize")
+
+@since("2.0.0")
+def topicsMatrix(self):
+"""Inferred topics, where each topic is represented by a 
distribution over terms."""
+return self._call_java("topicsMatrix")
+
+@since("2.0.0")
+def logLikelihood(self, dataset):
+"""Calculates a lower bound on the log likelihood of the entire 
corpus."""
+return self._call_java("logLikelihood", dataset)
+
+@since("2.0.0")
+def logPerplexity(self, dataset):
+"""Calculate an upper bound bound on perplexity.  (Lower is 
better.)"""
+return self._call_java("logPerplexity", dataset)
+
+@since("2.0.0")
+def describeTopics(self, maxTermsPerTopic=10):
+"""Return the topics described by weighted terms.
+
+WARNING: If vocabSize and k are large, this can return a large 
object!
+
+:param maxTermsPerTopic: Maximum number of terms to collect for 
each topic.
+(default: vocabulary size)
+:return: Array over topics. Each topic is represented as a pair of 
matching arrays:
+(term indices, term weights in topic).
+Each topic's terms are sorted in order of decreasing weight.
+"""
+return self._call_java("describeTopics", maxTermsPerTopic)
+
+
+class DistributedLDAModel(LDAModel):
+"""
+Model fitted by LDA.
+
+.. versionadded:: 2.0.0
+"""
+def toLocal(self):
+return self._call_java("toLocal")
+
+
+class LocalLDAModel(LDAModel):
+"""
+Model fitted by LDA.
+
+.. versionadded:: 2.0.0
+"""
+pass
+
+
+class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, 
HasCheckpointInterval):
+""" A clustering model derived from the LDA method.
+
+Latent Dirichlet Allocation (LDA), a topic model designed for text 
documents.
+Terminology
+- "word" = "term": an element of the vocabulary
+- "token": instance of a term appearing in a document
+- "topic": multinomial distribution over words representing some 
concept
+References:
+- Original LDA paper (journal version):
+Blei, Ng, and Jordan.  "Latent Dirichlet Allocation."  JMLR, 2003.
+
+>>> from pyspark.mllib.linalg import Vectors, SparseVector
+>>> from pyspark.ml.clustering import LDA
+>>> df = sqlContext.createDataFrame([[1, Vectors.dense([0.0, 1.0])], \
+[2, SparseVector(2, {0: 1.0})],], ["id", "features"])
+>>> lda = LDA(k=2, seed=1, optimizer="em")
+>>> model = lda.fit(df)
+>>> model.isDistributed()
+True
+>>> localModel = model.toLocal()
+>>> localModel.isDistributed()
+False
+>>> model.vocabSize()
+2
+>>> model.describeTopics().show()
++-+---++
+|topic|termIndices| termWeights|
+

[GitHub] spark pull request: [SPARK-13506] [MLlib] Fix the wrong parameter ...

2016-02-26 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11387#issuecomment-189200629
  
LGTM cc @mengxr


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13036][SPARK-13318][SPARK-13319] Add sa...

2016-02-26 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11203#issuecomment-189188086
  
@yinxusen Looks good overall, I left some inline comments. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...

2016-02-22 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11000#discussion_r53635317
  
--- Diff: python/pyspark/ml/regression.py ---
@@ -172,6 +172,16 @@ class IsotonicRegression(JavaEstimator, 
HasFeaturesCol, HasLabelCol, HasPredicti
 0.0
 >>> model.boundaries
 DenseVector([0.0, 1.0])
+>>> ir_path = temp_path + "/ir"
+>>> ir.save(ir_path)
+>>> ir2 = IsotonicRegression.load(ir_path)
+>>> ir2.getIsotonic()
+True
+>>> model_path = temp_path + "/ir_model"
+>>> model.save(model_path)
+>>> model2 = IsotonicRegressionModel.load(model_path)
+>>> model.boundaries == model2.boundaries
+True
--- End diff --

We should also check equality of ```predictions```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...

2016-02-22 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11000#issuecomment-187221567
  
Looks good except minor issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...

2016-02-22 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11000#issuecomment-187222806
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-13033][ML][PySpark] Add import/export f...

2016-02-22 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11000#discussion_r53635702
  
--- Diff: python/pyspark/ml/regression.py ---
@@ -690,6 +700,18 @@ class AFTSurvivalRegression(JavaEstimator, 
HasFeaturesCol, HasLabelCol, HasPredi
 |  0.0|(1,[],[])|   0.0|   1.0|
 +-+-+--+--+
 ...
+>>> aftsr_path = temp_path + "/aftsr"
+>>> aftsr.save(aftsr_path)
+>>> aftsr2 = AFTSurvivalRegression.load(aftsr_path)
+>>> aftsr2.getMaxIter()
+100
+>>> model_path = temp_path + "/aftsr_model"
+>>> model.save(model_path)
+>>> model2 = AFTSurvivalRegressionModel.load(model_path)
+>>> model.coefficients == model2.coefficients
+True
+>>> model.intercept == model2.intercept
+True
--- End diff --

We should also check equality of model ```scale```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/11136#discussion_r53913710
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -0,0 +1,565 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.regression
+
+import breeze.stats.distributions.{Gaussian => GD}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.PredictorParams
+import org.apache.spark.ml.feature.Instance
+import org.apache.spark.ml.optim._
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared._
+import org.apache.spark.ml.util.Identifiable
+import org.apache.spark.mllib.linalg.{BLAS, Vector}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{DataFrame, Row}
+import org.apache.spark.sql.functions._
+
+/**
+ * Params for Generalized Linear Regression.
+ */
+private[regression] trait GeneralizedLinearRegressionBase extends 
PredictorParams
+  with HasFitIntercept with HasMaxIter with HasTol with HasRegParam with 
HasWeightCol
+  with HasSolver with Logging {
+
+  /**
+   * Param for the name of family which is a description of the error 
distribution
+   * to be used in the model.
+   * Supported options: "gaussian", "binomial", "poisson" and "gamma".
+   * Default is "gaussian".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val family: Param[String] = new Param(this, "family",
+"The name of family which is a description of the error distribution 
to be used in the " +
+  "model. Supported options: gaussian(default), binomial, poisson and 
gamma.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedFamilyNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getFamily: String = $(family)
+
+  /**
+   * Param for the name of link function which provides the relationship
+   * between the linear predictor and the mean of the distribution 
function.
+   * Supported options: "identity", "log", "inverse", "logit", "probit", 
"cloglog" and "sqrt".
+   * @group param
+   */
+  @Since("2.0.0")
+  final val link: Param[String] = new Param(this, "link", "The name of 
link function " +
+"which provides the relationship between the linear predictor and the 
mean of the " +
+"distribution function. Supported options: identity, log, inverse, 
logit, probit, " +
+"cloglog and sqrt.",
+
ParamValidators.inArray[String](GeneralizedLinearRegression.supportedLinkNames.toArray))
+
+  /** @group getParam */
+  @Since("2.0.0")
+  def getLink: String = $(link)
+
+  import GeneralizedLinearRegression._
+  protected lazy val familyObj = Family.fromName($(family))
+  protected lazy val linkObj = if (isDefined(link)) {
+Link.fromName($(link))
+  } else {
+familyObj.defaultLink
+  }
+  protected lazy val familyAndLink = new FamilyAndLink(familyObj, linkObj)
+
+  @Since("2.0.0")
+  override def validateParams(): Unit = {
+if ($(solver) == "irls") {
+  setDefault(maxIter -> 25)
+}
+if (isDefined(link)) {
+  
require(GeneralizedLinearRegression.supportedFamilyAndLinkParis.contains(
+familyObj -> linkObj), s"Generalized Linear Regression with 
${$(family)} family " +
+s"does not support ${$(link)} link function.")
+}
+  }
+}
+
+/**
+ * :: Experimental ::
+ *
+ * Fit a Generalized Linear Model 
([[https://en.wikipedia.org/wiki/Generalize

[GitHub] spark pull request: [SPARK-12811] [ML] Estimator for Generalized L...

2016-02-24 Thread yanboliang

Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/11136#issuecomment-188165176
  
@mengxr This PR is ready for another pass. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-9835] [ML] IterativelyReweightedLeastSq...

2016-01-21 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10639#discussion_r50390322
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/glm/Families.scala ---
@@ -0,0 +1,138 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.glm
+
+import org.apache.spark.rdd.RDD
+
+/**
+ * A description of the error distribution and link function to be used in 
the model.
+ * @param link a link function instance
+ */
+private[ml] abstract class Family(val link: Link) extends Serializable {
--- End diff --

I think ```Families``` can be used by 
[SPARK-12811](https://issues.apache.org/jira/browse/SPARK-12811) which provide 
Estimator interface for GLMs, so I move it to a new folder named ```glm```.
Here we have two ways to support GLMs:
* Implement ```reweightFunc``` for each ```Family/Link``` directly based on 
mathematical formula.
* Implement the ```Family``` framework like what I have done and a factory 
method which can output ```reweightFunc``` according to argument.

The former one has better execution efficiency, the later one is more easy 
to understand. Looking forward your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 2641 matches

Mail list logo