date:20170213

[GitHub] spark issue #14830: [SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14830
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72859/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14830: [SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14830
  
 Build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14830: [SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14830
  
Build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14830: [SPARK-16992][PYSPARK][DOCS] import sort and autopep8 on...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/14830
  
Build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang

Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16699#discussion_r100975590
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -944,15 +981,27 @@ class GeneralizedLinearRegressionModel private[ml] (
   private lazy val familyAndLink = FamilyAndLink(this)
 
   override protected def predict(features: Vector): Double = {
-val eta = predictLink(features)
+if (!isSetOffsetCol(this)) {
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang

Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16699#discussion_r100975556
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -406,6 +435,14 @@ object GeneralizedLinearRegression extends 
DefaultParamsReadable[GeneralizedLine
 
   private[regression] val epsilon: Double = 1E-16
 
+  /** Checks whether weight column is set and nonempty */
+  private[regression] def isSetWeightCol(params: 
GeneralizedLinearRegressionBase): Boolean =
+params.isSet(params.weightCol) && !params.getWeightCol.isEmpty
+
+  /** Checks whether offset column is set and nonempty */
+  private[regression] def isSetOffsetCol(params: 
GeneralizedLinearRegressionBase): Boolean =
+params.isSet(params.offsetCol) && !params.getOffsetCol.isEmpty
--- End diff --

I adopted `params.getOffsetCol.nonEmpty`. 
As for `isDefined` VS `isSet`, in my other PR #16344, @yanboliang suggested 
`isSet` is more accurate and we should be using `isSet` and may need to change 
the existing `isDefined` at some point. For this reason, I'm using `isSet` 
(which I believe checks whether the param is explicitly set by the user).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang

Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16699#discussion_r100975164
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
@@ -168,6 +179,7 @@ private[regression] trait 
GeneralizedLinearRegressionBase extends PredictorParam
 }
 
 val newSchema = super.validateAndTransformSchema(schema, fitting, 
featuresDataType)
+if (isSetOffsetCol(this)) SchemaUtils.checkNumericType(schema, 
$(offsetCol))
--- End diff --

@sethah This is a great point. The new commit does allow offset to be 
missing when making predictions. I now check validity of offset only when it's 
set and available in the prediction set. Otherwise, set offset to be zero. 
Thanks for catching this and contributing the test. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang

Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16699#discussion_r100974912
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -798,77 +798,160 @@ class GeneralizedLinearRegressionSuite
 }
   }
 
-  test("glm summary: gaussian family with weight") {
+  test("generalized linear regression with offset") {
 /*
-   R code:
+  R code:
+  library(statmod)
 
-   A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
-   b <- c(17, 19, 23, 29)
-   w <- c(1, 2, 3, 4)
-   df <- as.data.frame(cbind(A, b))
- */
-val datasetWithWeight = Seq(
-  Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
-  Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)),
-  Instance(23.0, 3.0, Vectors.dense(2.0, 11.0)),
-  Instance(29.0, 4.0, Vectors.dense(3.0, 13.0))
+  df <- as.data.frame(matrix(c(
+0.2, 1.0, 2.0, 0.0, 5.0,
+0.5, 2.1, 0.5, 1.0, 2.0,
+0.9, 0.4, 1.0, 2.0, 1.0,
+0.7, 0.7, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
+  families <- list(gaussian, binomial, poisson, Gamma, tweedie(1.5))
+  f1 <- V1 ~ -1 + V4 + V5
+  f2 <- V1 ~ V4 + V5
+  for (f in c(f1, f2)) {
+for (fam in families) {
+  model <- glm(f, df, family = fam, weights = V2, offset = V3)
+  print(as.vector(coef(model)))
+}
+  }
+  [1]  0.5169222 -0.334
+  [1]  0.9419107 -0.6864404
+  [1]  0.1812436 -0.6568422
+  [1] -0.2869094  0.7857710
+  [1] 0.1055254 0.2979113
+  [1] -0.05990345  0.53188982 -0.32118415
+  [1] -0.2147117  0.9911750 -0.6356096
+  [1] -1.5616130  0.6646470 -0.3192581
+  [1]  0.3390397 -0.3406099  0.6870259
+  [1] 0.3665034 0.1039416 0.1484616
+*/
+val dataset = Seq(
+  OffsetInstance(0.2, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
+  OffsetInstance(0.5, 2.1, 0.5, Vectors.dense(1.0, 2.0)),
+  OffsetInstance(0.9, 0.4, 1.0, Vectors.dense(2.0, 1.0)),
+  OffsetInstance(0.7, 0.7, 0.0, Vectors.dense(3.0, 3.0))
 ).toDF()
+
+val expected = Seq(
+  Vectors.dense(0, 0.5169222, -0.334),
+  Vectors.dense(0, 0.9419107, -0.6864404),
+  Vectors.dense(0, 0.1812436, -0.6568422),
+  Vectors.dense(0, -0.2869094, 0.785771),
+  Vectors.dense(0, 0.1055254, 0.2979113),
+  Vectors.dense(-0.05990345, 0.53188982, -0.32118415),
+  Vectors.dense(-0.2147117, 0.991175, -0.6356096),
+  Vectors.dense(-1.561613, 0.664647, -0.3192581),
+  Vectors.dense(0.3390397, -0.3406099, 0.6870259),
+  Vectors.dense(0.3665034, 0.1039416, 0.1484616))
+
+import GeneralizedLinearRegression._
+
+var idx = 0
+
+for (fitIntercept <- Seq(false, true)) {
+  for (family <- Seq("gaussian", "binomial", "poisson", "gamma", 
"tweedie")) {
+var trainer = new GeneralizedLinearRegression().setFamily(family)
--- End diff --

Changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread actuaryzhang

Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16699#discussion_r100974891
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala
 ---
@@ -798,77 +798,160 @@ class GeneralizedLinearRegressionSuite
 }
   }
 
-  test("glm summary: gaussian family with weight") {
+  test("generalized linear regression with offset") {
 /*
-   R code:
+  R code:
+  library(statmod)
 
-   A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2)
-   b <- c(17, 19, 23, 29)
-   w <- c(1, 2, 3, 4)
-   df <- as.data.frame(cbind(A, b))
- */
-val datasetWithWeight = Seq(
-  Instance(17.0, 1.0, Vectors.dense(0.0, 5.0).toSparse),
-  Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)),
-  Instance(23.0, 3.0, Vectors.dense(2.0, 11.0)),
-  Instance(29.0, 4.0, Vectors.dense(3.0, 13.0))
+  df <- as.data.frame(matrix(c(
+0.2, 1.0, 2.0, 0.0, 5.0,
+0.5, 2.1, 0.5, 1.0, 2.0,
+0.9, 0.4, 1.0, 2.0, 1.0,
+0.7, 0.7, 0.0, 3.0, 3.0), 4, 5, byrow = TRUE))
+  families <- list(gaussian, binomial, poisson, Gamma, tweedie(1.5))
+  f1 <- V1 ~ -1 + V4 + V5
+  f2 <- V1 ~ V4 + V5
+  for (f in c(f1, f2)) {
+for (fam in families) {
+  model <- glm(f, df, family = fam, weights = V2, offset = V3)
+  print(as.vector(coef(model)))
+}
+  }
+  [1]  0.5169222 -0.334
+  [1]  0.9419107 -0.6864404
+  [1]  0.1812436 -0.6568422
+  [1] -0.2869094  0.7857710
+  [1] 0.1055254 0.2979113
+  [1] -0.05990345  0.53188982 -0.32118415
+  [1] -0.2147117  0.9911750 -0.6356096
+  [1] -1.5616130  0.6646470 -0.3192581
+  [1]  0.3390397 -0.3406099  0.6870259
+  [1] 0.3665034 0.1039416 0.1484616
+*/
+val dataset = Seq(
+  OffsetInstance(0.2, 1.0, 2.0, Vectors.dense(0.0, 5.0)),
+  OffsetInstance(0.5, 2.1, 0.5, Vectors.dense(1.0, 2.0)),
+  OffsetInstance(0.9, 0.4, 1.0, Vectors.dense(2.0, 1.0)),
+  OffsetInstance(0.7, 0.7, 0.0, Vectors.dense(3.0, 3.0))
 ).toDF()
+
+val expected = Seq(
+  Vectors.dense(0, 0.5169222, -0.334),
+  Vectors.dense(0, 0.9419107, -0.6864404),
+  Vectors.dense(0, 0.1812436, -0.6568422),
+  Vectors.dense(0, -0.2869094, 0.785771),
+  Vectors.dense(0, 0.1055254, 0.2979113),
+  Vectors.dense(-0.05990345, 0.53188982, -0.32118415),
+  Vectors.dense(-0.2147117, 0.991175, -0.6356096),
+  Vectors.dense(-1.561613, 0.664647, -0.3192581),
+  Vectors.dense(0.3390397, -0.3406099, 0.6870259),
+  Vectors.dense(0.3665034, 0.1039416, 0.1484616))
+
+import GeneralizedLinearRegression._
+
+var idx = 0
+
+for (fitIntercept <- Seq(false, true)) {
+  for (family <- Seq("gaussian", "binomial", "poisson", "gamma", 
"tweedie")) {
--- End diff --

I did implement this, but it seems that the order of the values in 
`GeneralizedLinearRegression.supportedFamilyNames` changes from test to test... 
I'm not sure why this happened but since it's a minor issues, I just reverted 
it back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16699
  
**[Test build #72858 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72858/testReport)**
 for PR 16699 at commit 
[`e95c25b`](https://github.com/apache/spark/commit/e95c25b73682669b65f194141ae08c56deb4d90c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16699
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16699
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16699
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16699
  
**[Test build #72857 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72857/testReport)**
 for PR 16699 at commit 
[`90d68a6`](https://github.com/apache/spark/commit/90d68a67815aceae63eaad7345477a082bb2febd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16699: [SPARK-18710][ML] Add offset in GLM

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16699
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16715: [Spark-18080][ML][PYTHON] Python API & Examples for Loca...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16715
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72856/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16715: [Spark-18080][ML][PYTHON] Python API & Examples for Loca...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16715
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16715: [Spark-18080][ML][PYTHON] Python API & Examples for Loca...

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16715
  
**[Test build #72856 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72856/testReport)**
 for PR 16715 at commit 
[`c64d50b`](https://github.com/apache/spark/commit/c64d50bd5a11f0f284e0964dcfce5a9040d1be99).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100970965
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/LSH.scala ---
@@ -222,17 +222,18 @@ private[ml] abstract class LSHModel[T <: LSHModel[T]]
   }
 
   /**
-   * Join two dataset to approximately find all pairs of rows whose 
distance are smaller than
+   * Join two datasets to approximately find all pairs of rows whose 
distance are smaller than
* the threshold. If the [[outputCol]] is missing, the method will 
transform the data; if the
* [[outputCol]] exists, it will use the [[outputCol]]. This allows 
caching of the transformed
* data when necessary.
*
* @param datasetA One of the datasets to join.
* @param datasetB Another dataset to join.
* @param threshold The threshold for the distance of row pairs.
-   * @param distCol Output column for storing the distance between each 
result row and the key.
+   * @param distCol Output column for storing the distance between each 
pair of rows.
* @return A joined dataset containing pairs of rows. The original rows 
are in columns
-   * "datasetA" and "datasetB", and a distCol is added to show the 
distance of each pair.
+   * "datasetA" and "datasetB", and a distCol is added to show the 
distance between each
--- End diff --

a column "distCol"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100971720
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -755,6 +945,103 @@ def maxAbs(self):
 
 
 @inherit_doc
+class MinHashLSH(JavaEstimator, LSHParams, HasInputCol, HasOutputCol, 
HasSeed,
+ JavaMLReadable, JavaMLWritable):
+
+"""
+.. note:: Experimental
+
+LSH class for Jaccard distance.
+The input can be dense or sparse vectors, but it is more efficient if 
it is sparse.
+For example, `Vectors.sparse(10, [(2, 1.0), (3, 1.0), (5, 1.0)])` 
means there are 10 elements
+in the space. This set contains elements 2, 3, and 5. Also, any input 
vector must have at
+least 1 non-zero index, and all non-zero values are treated as binary 
"1" values.
+
+.. seealso:: `Wikipedia on MinHash 
`_
+
+>>> from pyspark.ml.linalg import Vectors
+>>> from pyspark.sql.functions import col
+>>> data = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
+... (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
+... (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
+>>> df = spark.createDataFrame(data, ["id", "features"])
+>>> mh = MinHashLSH(inputCol="features", outputCol="hashes", 
seed=12345)
+>>> model = mh.fit(df)
+>>> model.transform(df).head()
+Row(id=0, features=SparseVector(6, {0: 1.0, 1: 1.0, 2: 1.0}), 
hashes=[DenseVector([-1638925...
+>>> data2 = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
+...  (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
+...  (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
+>>> df2 = spark.createDataFrame(data2, ["id", "features"])
+>>> key = Vectors.sparse(6, [1, 2], [1.0, 1.0])
+>>> model.approxNearestNeighbors(df2, key, 1).collect()
+[Row(id=5, features=SparseVector(6, {1: 1.0, 2: 1.0, 4: 1.0}), 
hashes=[DenseVector([-163892...
+>>> model.approxSimilarityJoin(df, df2, 0.6, 
distCol="JaccardDistance").select(
+... col("datasetA.id").alias("idA"),
+... col("datasetB.id").alias("idB"),
+... col("JaccardDistance")).show()
++---+---+---+
+|idA|idB|JaccardDistance|
++---+---+---+
+|  1|  4|0.5|
+|  0|  5|0.5|
++---+---+---+
+...
+>>> mhPath = temp_path + "/mh"
+>>> mh.save(mhPath)
+>>> mh2 = MinHashLSH.load(mhPath)
+>>> mh2.getOutputCol() == mh.getOutputCol()
+True
+>>> modelPath = temp_path + "/mh-model"
+>>> model.save(modelPath)
+>>> model2 = MinHashLSHModel.load(modelPath)
--- End diff --

let's add an equality check here and for BRP. For example for IDFModel we 
have:

`loadedModel.transform(df).head().idf == model.transform(df).head().idf`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100970887
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala ---
@@ -37,38 +43,45 @@ object MinHashLSHExample {
   (0, Vectors.sparse(6, Seq((0, 1.0), (1, 1.0), (2, 1.0,
   (1, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (4, 1.0,
   (2, Vectors.sparse(6, Seq((0, 1.0), (2, 1.0), (4, 1.0
-)).toDF("id", "keys")
+)).toDF("id", "features")
 
 val dfB = spark.createDataFrame(Seq(
   (3, Vectors.sparse(6, Seq((1, 1.0), (3, 1.0), (5, 1.0,
   (4, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (5, 1.0,
   (5, Vectors.sparse(6, Seq((1, 1.0), (2, 1.0), (4, 1.0
-)).toDF("id", "keys")
+)).toDF("id", "features")
 
 val key = Vectors.sparse(6, Seq((1, 1.0), (3, 1.0)))
 
 val mh = new MinHashLSH()
-  .setNumHashTables(3)
-  .setInputCol("keys")
-  .setOutputCol("values")
+  .setNumHashTables(5)
+  .setInputCol("features")
+  .setOutputCol("hashes")
 
 val model = mh.fit(dfA)
 
 // Feature Transformation
+println("The hashed dataset where hashed values are stored in the 
column 'hashes':")
 model.transform(dfA).show()
-// Cache the transformed columns
-val transformedA = model.transform(dfA).cache()
-val transformedB = model.transform(dfB).cache()
 
-// Approximate similarity join
-model.approxSimilarityJoin(dfA, dfB, 0.6).show()
-model.approxSimilarityJoin(transformedA, transformedB, 0.6).show()
-// Self Join
-model.approxSimilarityJoin(dfA, dfA, 0.6).filter("datasetA.id < 
datasetB.id").show()
+// Compute the locality sensitive hashes for the input rows, then 
perform approximate
+// similarity join.
+// We could avoid computing hashes by passing in the 
already-transformed dataset, e.g.
+// `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
+println("Approximately joining dfA and dfB on Jaccard distance smaller 
than 0.6:")
+model.approxSimilarityJoin(dfA, dfB, 0.6)
+  .select(col("datasetA.id").alias("idA"),
+col("datasetB.id").alias("idB"),
+col("distCol").alias("JaccardDistance")).show()
--- End diff --

pass distCol as method parameter instead of alias


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread sethah

Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100971229
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala ---
@@ -21,9 +21,15 @@ package org.apache.spark.examples.ml
 // $example on$
 import org.apache.spark.ml.feature.MinHashLSH
 import org.apache.spark.ml.linalg.Vectors
+import org.apache.spark.sql.functions._
--- End diff --

just import col here and above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16909: [SPARK-13450] Introduce ExternalAppendOnlyUnsafeRowArray...

2017-02-13 Thread zhzhan

Github user zhzhan commented on the issue:

https://github.com/apache/spark/pull/16909
  
@tejasapatil  Do you want to fix the BufferedRowIterator for 
WholeStageCodegenExec as well? As for inner join, the LinkedList currentRows 
would cause the same issue as it buffer the rows from inner join, and takes 
more memory (probably double if left and right has similar size). Also they can 
share the similar iterator data structure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16913: [SPARK-15531] [DEPLOY] Complement launcher JVM memory se...

2017-02-13 Thread Pashugan

Github user Pashugan commented on the issue:

https://github.com/apache/spark/pull/16913
  
There must be some misunderstanding. May I have a chance you have a look at 
my micro-patch because it has nothing to do with the driver and its options. In 
fact, it fixes the call of the "launcher library", which is used to fill up a 
bash array, which is in turn used to run the actual driver. Then, my above 
explanations should hopefully become as clear as day. :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16921: [SPARK-19589][SQL] Removal of SQLGEN files

2017-02-13 Thread jiangxb1987

Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/16921
  
Thank you for doing this, this looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16919: [SPARK-19585][DOC][SQL] Fix the cacheTable and un...

2017-02-13 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/16919


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16919: [SPARK-19585][DOC][SQL] Fix the cacheTable and uncacheTa...

2017-02-13 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16919
  
LGTM. Merging to master!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16901: [SPARK-19565] Improve DAGScheduler tests.

2017-02-13 Thread jinxing64

Github user jinxing64 commented on a diff in the pull request:

https://github.com/apache/spark/pull/16901#discussion_r100968529
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala ---
@@ -2161,6 +2161,48 @@ class DAGSchedulerSuite extends SparkFunSuite with 
LocalSparkContext with Timeou
 }
   }
 
+  test("After fetching failed, success of old attempt of stage should be 
taken as valid.") {
+val rddA = new MyRDD(sc, 2, Nil)
+val shuffleDepA = new ShuffleDependency(rddA, new HashPartitioner(2))
+val shuffleIdA = shuffleDepA.shuffleId
+
+val rddB = new MyRDD(sc, 2, List(shuffleDepA))
+val shuffleDepB = new ShuffleDependency(rddB, new HashPartitioner(2))
+
+val rddC = new MyRDD(sc, 2, List(shuffleDepB))
+
+submit(rddC, Array(0, 1))
+assert(taskSets(0).stageId === 0 && taskSets(0).stageAttemptId === 0)
+
+complete(taskSets(0), Seq(
+  (Success, makeMapStatus("hostA", 2)),
+  (Success, makeMapStatus("hostA", 2
+
+// Fetch failed on hostA for task(partitionId=0) and success on hostB 
for task(partitionId=1)
+complete(taskSets(1), Seq(
+  (FetchFailed(makeBlockManagerId("hostA"), shuffleIdA, 0, 0,
+"Fetch failure of task: stageId=1, stageAttempt=0, 
partitionId=0"), null),
+  (Success, makeMapStatus("hostB", 2
+
+scheduler.resubmitFailedStages()
+assert(taskSets(2).stageId === 0 && taskSets(2).stageAttemptId === 1)
+complete(taskSets(2), Seq(
+  (Success, makeMapStatus("hostB", 2)),
+  (Success, makeMapStatus("hostB", 2
+
+assert(taskSets(3).stageId === 1 && taskSets(2).stageAttemptId === 1)
+runEvent(makeCompletionEvent(
+  taskSets(3).tasks(0), Success, makeMapStatus("hostB", 2)))
+
+// Thanks to the success from old attempt of stage(stageId=1), there's 
no pending
--- End diff --

Yes, the success should be moved. Sorry for this and I'll rectify.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16901: [SPARK-19565] Improve DAGScheduler tests.

2017-02-13 Thread jinxing64

Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/16901
  
@kayousterhout 
I've refined accordingly. Sorry for the stupid mistake I made. Please take 
another look at this : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16620: [SPARK-19263] DAGScheduler should avoid sending conflict...

2017-02-13 Thread jinxing64

Github user jinxing64 commented on the issue:

https://github.com/apache/spark/pull/16620
  
@kayousterhout 
I've refined accordingly, please take another look : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16620: [SPARK-19263] DAGScheduler should avoid sending conflict...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16620
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16919: [SPARK-19585][DOC][SQL] Fix the cacheTable and uncacheTa...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16919
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16919: [SPARK-19585][DOC][SQL] Fix the cacheTable and uncacheTa...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16919
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72854/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16620: [SPARK-19263] DAGScheduler should avoid sending conflict...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16620
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72849/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16620: [SPARK-19263] DAGScheduler should avoid sending conflict...

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16620
  
**[Test build #72849 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72849/testReport)**
 for PR 16620 at commit 
[`ab8d13e`](https://github.com/apache/spark/commit/ab8d13efaf12182517d3b311d74b2f0a8d2fbef8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16919: [SPARK-19585][DOC][SQL] Fix the cacheTable and uncacheTa...

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16919
  
**[Test build #72854 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72854/testReport)**
 for PR 16919 at commit 
[`cab668e`](https://github.com/apache/spark/commit/cab668e28225ae7484e83a22d359bd5b962d9d31).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16715: [Spark-18080][ML][PYTHON] Python API & Examples for Loca...

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16715
  
**[Test build #72856 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72856/testReport)**
 for PR 16715 at commit 
[`c64d50b`](https://github.com/apache/spark/commit/c64d50bd5a11f0f284e0964dcfce5a9040d1be99).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16715: [Spark-18080][ML][PYTHON] Python API & Examples for Loca...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16715
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16715: [Spark-18080][ML][PYTHON] Python API & Examples for Loca...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16715
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966548
  
--- Diff: docs/ml-features.md ---
@@ -1558,6 +1558,15 @@ for more details on the API.
 
 {% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
 
+
+
+
+Refer to the [BucketedRandomProjectionLSH Python 
docs](api/python/pyspark.ml.html#pyspark.ml.feature.BucketedRandomProjectionLSH)
+for more details on the API.
+
+{% include_example python/ml/bucketed_random_projection_lsh.py %}
--- End diff --

Sorry I forgot to retest after renaming the python examples. Thanks for the 
in formation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966555
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing (LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel(JavaModel):
+"""
+Mixin for Locality Sensitive Hashing (LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+.. note:: This method is experimental and will likely change 
behavior in the next release.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two datasets to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966541
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala ---
@@ -37,38 +38,44 @@ object MinHashLSHExample {
   (0, Vectors.sparse(6, Seq((0, 1.0), (1, 1.0), (2, 1.0,
   (1, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (4, 1.0,
   (2, Vectors.sparse(6, Seq((0, 1.0), (2, 1.0), (4, 1.0
-)).toDF("id", "keys")
+)).toDF("id", "features")
 
 val dfB = spark.createDataFrame(Seq(
   (3, Vectors.sparse(6, Seq((1, 1.0), (3, 1.0), (5, 1.0,
   (4, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (5, 1.0,
   (5, Vectors.sparse(6, Seq((1, 1.0), (2, 1.0), (4, 1.0
-)).toDF("id", "keys")
+)).toDF("id", "features")
 
 val key = Vectors.sparse(6, Seq((1, 1.0), (3, 1.0)))
 
 val mh = new MinHashLSH()
-  .setNumHashTables(3)
-  .setInputCol("keys")
-  .setOutputCol("values")
+  .setNumHashTables(5)
+  .setInputCol("features")
+  .setOutputCol("hashes")
 
 val model = mh.fit(dfA)
 
 // Feature Transformation
+println("The hashed dataset where hashed values are stored in the 
column 'hashes':")
 model.transform(dfA).show()
-// Cache the transformed columns
-val transformedA = model.transform(dfA).cache()
-val transformedB = model.transform(dfB).cache()
 
-// Approximate similarity join
-model.approxSimilarityJoin(dfA, dfB, 0.6).show()
-model.approxSimilarityJoin(transformedA, transformedB, 0.6).show()
-// Self Join
-model.approxSimilarityJoin(dfA, dfA, 0.6).filter("datasetA.id < 
datasetB.id").show()
+// Compute the locality sensitive hashes for the input rows, then 
perform approximate
+// similarity join.
+// We could avoid computing hashes by passing in the 
already-transformed dataset, e.g.
+// `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
+println("Approximately joining dfA and dfB on Jaccard distance smaller 
than 0.6:")
+model.approxSimilarityJoin(dfA, dfB, 0.6)
+  .select(col("datasetA.id").alias("idA"),
+col("datasetB.id").alias("idB"),
+col("distCol").alias("JaccardDistance")).show()
 
-// Approximate nearest neighbor search
+// Compute the locality sensitive hashes for the input rows, then 
perform approximate nearest
+// neighbor search.
+// We could avoid computing hashes by passing in the 
already-transformed dataset, e.g.
+// `model.approxNearestNeighbors(transformedA, key, 2)`
+// It may return less than 2 rows because of lack of elements in the 
hash buckets.
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966552
  
--- Diff: 
examples/src/main/python/ml/bucketed_random_projection_lsh_example.py ---
@@ -0,0 +1,81 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+from __future__ import print_function
+
+# $example on$
+from pyspark.ml.feature import BucketedRandomProjectionLSH
+from pyspark.ml.linalg import Vectors
+from pyspark.sql.functions import col
+# $example off$
+from pyspark.sql import SparkSession
+
+"""
+An example demonstrating BucketedRandomProjectionLSH.
+Run with:
+  bin/spark-submit 
examples/src/main/python/ml/bucketed_random_projection_lsh_example.py
--- End diff --

Added in 4 places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966561
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala
 ---
@@ -38,40 +39,45 @@ object BucketedRandomProjectionLSHExample {
   (1, Vectors.dense(1.0, -1.0)),
   (2, Vectors.dense(-1.0, -1.0)),
   (3, Vectors.dense(-1.0, 1.0))
-)).toDF("id", "keys")
+)).toDF("id", "features")
 
 val dfB = spark.createDataFrame(Seq(
   (4, Vectors.dense(1.0, 0.0)),
   (5, Vectors.dense(-1.0, 0.0)),
   (6, Vectors.dense(0.0, 1.0)),
   (7, Vectors.dense(0.0, -1.0))
-)).toDF("id", "keys")
+)).toDF("id", "features")
 
 val key = Vectors.dense(1.0, 0.0)
 
 val brp = new BucketedRandomProjectionLSH()
   .setBucketLength(2.0)
   .setNumHashTables(3)
-  .setInputCol("keys")
-  .setOutputCol("values")
+  .setInputCol("features")
+  .setOutputCol("hashes")
 
 val model = brp.fit(dfA)
 
 // Feature Transformation
+println("The hashed dataset where hashed values are stored in the 
column 'hashes':")
 model.transform(dfA).show()
-// Cache the transformed columns
-val transformedA = model.transform(dfA).cache()
-val transformedB = model.transform(dfB).cache()
 
-// Approximate similarity join
-model.approxSimilarityJoin(dfA, dfB, 1.5).show()
-model.approxSimilarityJoin(transformedA, transformedB, 1.5).show()
-// Self Join
-model.approxSimilarityJoin(dfA, dfA, 2.5).filter("datasetA.id < 
datasetB.id").show()
+// Compute the locality sensitive hashes for the input rows, then 
perform approximate
+// similarity join.
+// We could avoid computing hashes by passing in the 
already-transformed dataset, e.g.
+// `model.approxSimilarityJoin(transformedA, transformedB, 1.5)`
+println("Approximately joining dfA and dfB on Euclidean distance 
smaller than 1.5:")
+model.approxSimilarityJoin(dfA, dfB, 1.5)
+  .select(col("datasetA.id").alias("idA"),
+col("datasetB.id").alias("idB"),
+col("distCol").alias("EuclideanDistance")).show()
--- End diff --

Done in 6 places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966545
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java
 ---
@@ -35,6 +35,8 @@
 import org.apache.spark.sql.types.Metadata;
 import org.apache.spark.sql.types.StructField;
 import org.apache.spark.sql.types.StructType;
+
+import static org.apache.spark.sql.functions.*;
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966554
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing (LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel(JavaModel):
+"""
+Mixin for Locality Sensitive Hashing (LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+.. note:: This method is experimental and will likely change 
behavior in the next release.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two datasets to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
--- End diff --

Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966534
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing (LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel(JavaModel):
+"""
+Mixin for Locality Sensitive Hashing (LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+.. note:: This method is experimental and will likely change 
behavior in the next release.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
--- End diff --

Removed in 4 places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966539
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.scala
 ---
@@ -111,8 +111,8 @@ class BucketedRandomProjectionLSHModel private[ml](
  * Euclidean distance metrics.
  *
  * The input is dense or sparse vectors, each of which represents a point 
in the Euclidean
- * distance space. The output will be vectors of configurable dimension. 
Hash values in the
- * same dimension are calculated by the same hash function.
+ * distance space. The output will be vectors of configurable dimension. 
Hash values in the same
--- End diff --

Reverted


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966530
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing (LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel(JavaModel):
+"""
+Mixin for Locality Sensitive Hashing (LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+.. note:: This method is experimental and will likely change 
behavior in the next release.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16715: [Spark-18080][ML][PYTHON] Python API & Examples f...

2017-02-13 Thread Yunni

Github user Yunni commented on a diff in the pull request:

https://github.com/apache/spark/pull/16715#discussion_r100966537
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -120,6 +122,198 @@ def getThreshold(self):
 return self.getOrDefault(self.threshold)
 
 
+class LSHParams(Params):
+"""
+Mixin for Locality Sensitive Hashing (LSH) algorithm parameters.
+"""
+
+numHashTables = Param(Params._dummy(), "numHashTables", "number of 
hash tables, where " +
+  "increasing number of hash tables lowers the 
false negative rate, " +
+  "and decreasing it improves the running 
performance.",
+  typeConverter=TypeConverters.toInt)
+
+def __init__(self):
+super(LSHParams, self).__init__()
+
+@since("2.2.0")
+def setNumHashTables(self, value):
+"""
+Sets the value of :py:attr:`numHashTables`.
+"""
+return self._set(numHashTables=value)
+
+@since("2.2.0")
+def getNumHashTables(self):
+"""
+Gets the value of numHashTables or its default value.
+"""
+return self.getOrDefault(self.numHashTables)
+
+
+class LSHModel(JavaModel):
+"""
+Mixin for Locality Sensitive Hashing (LSH) models.
+"""
+
+@since("2.2.0")
+def approxNearestNeighbors(self, dataset, key, numNearestNeighbors, 
distCol="distCol"):
+"""
+Given a large dataset and an item, approximately find at most k 
items which have the
+closest distance to the item. If the :py:attr:`outputCol` is 
missing, the method will
+transform the data; if the :py:attr:`outputCol` exists, it will 
use that. This allows
+caching of the transformed data when necessary.
+
+.. note:: This method is experimental and will likely change 
behavior in the next release.
+
+:param dataset: The dataset to search for nearest neighbors of the 
key.
+:param key: Feature vector representing the item to search for.
+:param numNearestNeighbors: The maximum number of nearest 
neighbors.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A dataset containing at most k items closest to the key. 
A distCol is added
+ to show the distance between each row and the key.
+"""
+return self._call_java("approxNearestNeighbors", dataset, key, 
numNearestNeighbors,
+   distCol)
+
+@since("2.2.0")
+def approxSimilarityJoin(self, datasetA, datasetB, threshold, 
distCol="distCol"):
+"""
+Join two datasets to approximately find all pairs of rows whose 
distance are smaller than
+the threshold. If the :py:attr:`outputCol` is missing, the method 
will transform the data;
+if the :py:attr:`outputCol` exists, it will use that. This allows 
caching of the
+transformed data when necessary.
+
+:param datasetA: One of the datasets to join.
+:param datasetB: Another dataset to join.
+:param threshold: The threshold for the distance of row pairs.
+:param distCol: Output column for storing the distance between 
each result row and the key.
+Use "distCol" as default value if it's not 
specified.
+:return: A joined dataset containing pairs of rows. The original 
rows are in columns
+"datasetA" and "datasetB", and a distCol is added to show 
the distance of
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16672: [SPARK-19329][SQL]Reading from or writing to a datasourc...

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16672
  
**[Test build #72855 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72855/testReport)**
 for PR 16672 at commit 
[`0d947a5`](https://github.com/apache/spark/commit/0d947a55a80ecc63eb15092c29b2c44aeeb197e5).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16919: [SPARK-19585][DOC][SQL] Fix the cacheTable and uncacheTa...

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16919
  
**[Test build #72854 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72854/testReport)**
 for PR 16919 at commit 
[`cab668e`](https://github.com/apache/spark/commit/cab668e28225ae7484e83a22d359bd5b962d9d31).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16395: [SPARK-17075][SQL] implemented filter estimation

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16395
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16672: [SPARK-19329][SQL]Reading from or writing to a datasourc...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16672
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16395: [SPARK-17075][SQL] implemented filter estimation

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16395
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72847/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16919: [SPARK-19585][DOC][SQL] Fix the cacheTable and uncacheTa...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16919
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16672: [SPARK-19329][SQL]Reading from or writing to a datasourc...

2017-02-13 Thread windpiger

Github user windpiger commented on the issue:

https://github.com/apache/spark/pull/16672
  
@gatorsmile I'd like to take that. 
https://issues.apache.org/jira/browse/SPARK-19583 Thanks~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16919: [SPARK-19585][DOC][SQL] Fix the cacheTable and uncacheTa...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16919
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16672: [SPARK-19329][SQL]Reading from or writing to a datasourc...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16672
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16395: [SPARK-17075][SQL] implemented filter estimation

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16395
  
**[Test build #72847 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72847/testReport)**
 for PR 16395 at commit 
[`662acc0`](https://github.com/apache/spark/commit/662acc005f3eeec0e8e475e06e57a6c7296c4a79).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16919: [SPARK-19585][DOC][SQL] Fix the cacheTable and uncacheTa...

2017-02-13 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16919
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16921: [SPARK-19589][SQL] Removal of SQLGEN files

2017-02-13 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16921
  
cc @hvanhovell @jiangxb1987 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15193: [SQL]RowBasedKeyValueBatch reuse valueRow too

2017-02-13 Thread ooq

Github user ooq commented on the issue:

https://github.com/apache/spark/pull/15193
  
@yaooqinn Do you have any benchmarks on the performance difference? I think 
pointTo() is pretty cheap. And does the patch pass the tests? I think valueRow 
is not updated correctly in your patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...

2017-02-13 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/16739
  
Let me rewrite the test cases in Scala.

```Scala
val df = spark.range(0, 1, 1, 5)
assert(df.rdd.getNumPartitions == 5)
assert(df.coalesce(3).rdd.getNumPartitions == 3)
assert(df.coalesce(6).rdd.getNumPartitions == 5)

val df1 = df.coalesce(3)
assert(df1.rdd.getNumPartitions == 3)
assert(df1.coalesce(6).rdd.getNumPartitions == 5)
assert(df1.coalesce(4).rdd.getNumPartitions == 4)
assert(df1.coalesce(2).rdd.getNumPartitions == 2)

val df2 = df.repartition(10)
assert(df2.rdd.getNumPartitions == 10)
assert(df2.coalesce(13).rdd.getNumPartitions == 5)
assert(df2.coalesce(7).rdd.getNumPartitions == 5)
assert(df2.coalesce(3).rdd.getNumPartitions == 3)
```

The question is why the second one is `5` instead of `10`. If we do the 
explain, we got the following plan
```
== Parsed Logical Plan ==
Repartition 13, false
+- Repartition 10, true
   +- Range (0, 1, step=1, splits=Some(5))

== Analyzed Logical Plan ==
id: bigint
Repartition 13, false
+- Repartition 10, true
   +- Range (0, 1, step=1, splits=Some(5))

== Optimized Logical Plan ==
Repartition 13, false
+- Range (0, 1, step=1, splits=Some(5))

== Physical Plan ==
Coalesce 13
+- *Range (0, 1, step=1, splits=Some(5))
```

Ok... `Repartition 10, true` is removed by our Optimizer rule 
`CollapseRepartition`. It is a bug, I think. Your question is valid. Let me fix 
it. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16915: [SPARK-18871][SQL][TESTS] New test cases for IN/NOT IN s...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16915
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72846/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16915: [SPARK-18871][SQL][TESTS] New test cases for IN/NOT IN s...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16915
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16915: [SPARK-18871][SQL][TESTS] New test cases for IN/NOT IN s...

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16915
  
**[Test build #72846 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72846/testReport)**
 for PR 16915 at commit 
[`3dd57fd`](https://github.com/apache/spark/commit/3dd57fd7017e173cb00a53280a41783be634d6fe).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16920: [MINOR][DOCS] Add jira url in pull request description

2017-02-13 Thread uncleGen

Github user uncleGen commented on the issue:

https://github.com/apache/spark/pull/16920
  
cc @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16672: [SPARK-19329][SQL]Reading from or writing to a da...

2017-02-13 Thread windpiger

Github user windpiger commented on a diff in the pull request:

https://github.com/apache/spark/pull/16672#discussion_r100963830
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -1816,4 +1816,127 @@ class DDLSuite extends QueryTest with 
SharedSQLContext with BeforeAndAfterEach {
   }
 }
   }
+
+  test("insert data to a data source table which has a not existed 
location should succeed") {
+withTable("t") {
+  withTempDir { dir =>
+spark.sql(
+  s"""
+  |CREATE TABLE t(a string, b int)
+  |USING parquet
+  |OPTIONS(path "file:${dir.getCanonicalPath}")
+   """.stripMargin)
+val table = 
spark.sessionState.catalog.getTableMetadata(TableIdentifier("t"))
+val expectedPath = s"file:${dir.getAbsolutePath.stripSuffix("/")}"
+assert(table.location.stripSuffix("/") == expectedPath)
+
+dir.delete
+val tableLocFile = new File(table.location.stripPrefix("file:"))
+assert(!tableLocFile.exists)
+spark.sql("INSERT INTO TABLE t SELECT 'c', 1")
+assert(tableLocFile.exists)
+checkAnswer(spark.table("t"), Row("c", 1) :: Nil)
+
+Utils.deleteRecursively(dir)
+assert(!tableLocFile.exists)
+spark.sql("INSERT OVERWRITE TABLE t SELECT 'c', 1")
+assert(tableLocFile.exists)
+checkAnswer(spark.table("t"), Row("c", 1) :: Nil)
+
+val newDir = dir.getAbsolutePath.stripSuffix("/") + "/x"
+val newDirFile = new File(newDir)
+spark.sql(s"ALTER TABLE t SET LOCATION '$newDir'")
+spark.sessionState.catalog.refreshTable(TableIdentifier("t"))
+
+val table1 = 
spark.sessionState.catalog.getTableMetadata(TableIdentifier("t"))
+assert(table1.location == newDir)
+assert(!newDirFile.exists)
+
+spark.sql("INSERT INTO TABLE t SELECT 'c', 1")
+assert(newDirFile.exists)
+checkAnswer(spark.table("t"), Row("c", 1) :: Nil)
+  }
+}
+  }
+
+  test("insert into a data source table with no existed partition location 
should succeed") {
+withTable("t") {
+  withTempDir { dir =>
+spark.sql(
+  s"""
+  |CREATE TABLE t(a int, b int, c int, d int)
+  |USING parquet
+  |PARTITIONED BY(a, b)
+  |LOCATION "file:${dir.getCanonicalPath}"
+   """.stripMargin)
+val table = 
spark.sessionState.catalog.getTableMetadata(TableIdentifier("t"))
+val expectedPath = s"file:${dir.getAbsolutePath.stripSuffix("/")}"
+assert(table.location.stripSuffix("/") == expectedPath)
+
+spark.sql("INSERT INTO TABLE t PARTITION(a=1, b=2) SELECT 3, 4")
+checkAnswer(spark.table("t"), Row(3, 4, 1, 2) :: Nil)
+
+val partLoc = new File(s"${dir.getAbsolutePath}/a=1")
+Utils.deleteRecursively(partLoc)
+assert(!partLoc.exists())
+// insert overwrite into a partition which location has been 
deleted.
+spark.sql("INSERT OVERWRITE TABLE t PARTITION(a=1, b=2) SELECT 7, 
8")
+assert(partLoc.exists())
+checkAnswer(spark.table("t"), Row(7, 8, 1, 2) :: Nil)
+
+// TODO:insert into a partition after alter the partition location 
by alter command
--- End diff --

I found there is a bug in this situation. and I create a jira
https://issues.apache.org/jira/browse/SPARK-19577

shall we just forbid this situation or fix it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16921: [SPARK-19589][SQL] Removal of SQLGEN files

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16921
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16921: [SPARK-19589][SQL] Removal of SQLGEN files

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16921
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72848/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16920: [MINOR][DOCS] Add jira url in pull request description

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16920
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16920: [MINOR][DOCS] Add jira url in pull request description

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16920
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72845/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16920: [MINOR][DOCS] Add jira url in pull request description

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16920
  
**[Test build #72845 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72845/testReport)**
 for PR 16920 at commit 
[`e61125a`](https://github.com/apache/spark/commit/e61125acb60944e48a5be4b8218ae925e1b543b6).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16921: [SPARK-19589][SQL] Removal of SQLGEN files

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16921
  
**[Test build #72848 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72848/testReport)**
 for PR 16921 at commit 
[`e59cf4c`](https://github.com/apache/spark/commit/e59cf4c4d4d324475c7588317928b9dbfab84193).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16922
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16922
  
**[Test build #72852 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72852/testReport)**
 for PR 16922 at commit 
[`c5e46fb`](https://github.com/apache/spark/commit/c5e46fb88dbd8f6c06829f47d4cc34b131bc6472).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16922
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72852/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16674: [SPARK-19331][SQL][TESTS] Improve the test coverage of S...

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16674
  
**[Test build #72853 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72853/testReport)**
 for PR 16674 at commit 
[`4872918`](https://github.com/apache/spark/commit/48729186467e9b9d19ef86e1b635a9746735bfa0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16674: [SPARK-19331][SQL][TESTS] Improve the test coverage of S...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16674
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16674: [SPARK-19331][SQL][TESTS] Improve the test coverage of S...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16674
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16818: [SPARK-19451][SQL][Core] Underlying integer overflow in ...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16818
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72844/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16818: [SPARK-19451][SQL][Core] Underlying integer overflow in ...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16818
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16818: [SPARK-19451][SQL][Core] Underlying integer overflow in ...

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16818
  
**[Test build #72844 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72844/testReport)**
 for PR 16818 at commit 
[`7ae4e48`](https://github.com/apache/spark/commit/7ae4e4845b5049ed5df68b57c340cf4c347f9d5e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class ValuePreceding(value: Int) extends FrameBoundary `
  * `case class ValueFollowing(value: Int) extends FrameBoundary `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16922
  
**[Test build #72852 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72852/testReport)**
 for PR 16922 at commit 
[`c5e46fb`](https://github.com/apache/spark/commit/c5e46fb88dbd8f6c06829f47d4cc34b131bc6472).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16922
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16922
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16918: [SPARK-19584] [SS] [DOCS] update structured strea...

2017-02-13 Thread zsxwing

Github user zsxwing commented on a diff in the pull request:

https://github.com/apache/spark/pull/16918#discussion_r100961000
  
--- Diff: docs/structured-streaming-kafka-integration.md ---
@@ -119,6 +119,124 @@ ds3.selectExpr("CAST(key AS STRING)", "CAST(value AS 
STRING)")
 
 
 
+### Creating a Kafka Source Batch
+If you have a use case that is better suited to batch processing,
+you can create an Dataset/DataFrame for a defined range of offsets.
+
+
+
+{% highlight scala %}
+
+// Subscribe to 1 topic defaults to the earliest and latest offsets
+val ds1 = spark
+  .read
+  .format("kafka")
+  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+  .option("subscribe", "topic1")
+  .load()
+ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
+  .as[(String, String)]
+
+// Subscribe to multiple topics, specifying explicit Kafka offsets
+val ds2 = spark
+  .read
+  .format("kafka")
+  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+  .option("subscribe", "topic1,topic2")
+  .option("startingOffsets", 
"""{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
+  .option("endingOffsets", 
"""{"topic1":{"0":50,"1":-1},"topic2":{"0":-1}}""")
+  .load()
+ds2.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
+  .as[(String, String)]
+
+// Subscribe to a pattern, at the earliest and latest offsets
+val ds3 = spark
+  .read
+  .format("kafka")
+  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+  .option("subscribePattern", "topic.*")
+  .option("startingOffsets", "earliest")
+  .option("endingOffsets", "latest")
+  .load()
+ds3.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
+  .as[(String, String)]
+
+{% endhighlight %}
+
+
+{% highlight java %}
+
+// Subscribe to 1 topic defaults to the earliest and latest offsets
+Dataset ds1 = spark
+  .read()
+  .format("kafka")
+  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+  .option("subscribe", "topic1")
+  .load();
+ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)");
+
+// Subscribe to multiple topics, specifying explicit Kafka offsets
+Dataset ds2 = spark
+  .read()
+  .format("kafka")
+  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+  .option("subscribe", "topic1,topic2")
+  .option("startingOffsets", 
"{\"topic1\":{\"0\":23,\"1\":-2},\"topic2\":{\"0\":-2}}")
+  .option("endingOffsets", 
"{\"topic1\":{\"0\":50,\"1\":-1},\"topic2\":{\"0\":-1}}")
+  .load();
+ds2.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)");
+
+// Subscribe to a pattern, at the earliest and latest offsets
+Dataset ds3 = spark
+  .read()
+  .format("kafka")
+  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
+  .option("subscribePattern", "topic.*")
+  .option("startingOffsets", "earliest")
+  .option("endingOffsets", "latest")
+  .load();
+ds3.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)");
+
+{% endhighlight %}
+
+
+{% highlight python %}
+
+# Subscribe to 1 topic defaults to the earliest and latest offsets
+ds1 = spark \
+  .read
--- End diff --

You need to all `\` to all lines.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16922
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16922
  
**[Test build #72851 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72851/testReport)**
 for PR 16922 at commit 
[`9ce7cb8`](https://github.com/apache/spark/commit/9ce7cb867c8e1986bc75500fcbc4c5f0e4103f06).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16922
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72851/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16922
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16922
  
**[Test build #72851 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72851/testReport)**
 for PR 16922 at commit 
[`9ce7cb8`](https://github.com/apache/spark/commit/9ce7cb867c8e1986bc75500fcbc4c5f0e4103f06).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16922
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16922
  
**[Test build #72850 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72850/testReport)**
 for PR 16922 at commit 
[`25bdc0f`](https://github.com/apache/spark/commit/25bdc0f09f763b993ff78cb6f86a4a567eae4872).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16922
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72850/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16922
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16922
  
**[Test build #72850 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72850/testReport)**
 for PR 16922 at commit 
[`25bdc0f`](https://github.com/apache/spark/commit/25bdc0f09f763b993ff78cb6f86a4a567eae4872).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16922
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16922: [SPARK-19590][pyspark][ML] Update the document for Quant...

2017-02-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16922
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 7 >

1 - 100 of 618 matches

Mail list logo