[GitHub] spark pull request #15620: [SPARK-18091] [SQL] Deep if expressions cause Gen...

2016-11-30 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/15620#discussion_r90397578
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala
 ---
@@ -64,19 +64,74 @@ case class If(predicate: Expression, trueValue: 
Expression, falseValue: Expressi
 val trueEval = trueValue.genCode(ctx)
 val falseEval = falseValue.genCode(ctx)
 
-ev.copy(code = s"""
-  ${condEval.code}
-  boolean ${ev.isNull} = false;
-  ${ctx.javaType(dataType)} ${ev.value} = 
${ctx.defaultValue(dataType)};
-  if (!${condEval.isNull} && ${condEval.value}) {
-${trueEval.code}
-${ev.isNull} = ${trueEval.isNull};
-${ev.value} = ${trueEval.value};
-  } else {
-${falseEval.code}
-${ev.isNull} = ${falseEval.isNull};
-${ev.value} = ${falseEval.value};
-  }""")
+// place generated code of condition, true value and false value in 
separate methods if
+// their code combined is large
+val combinedLength = condEval.code.length + trueEval.code.length + 
falseEval.code.length
+val generatedCode = if (combinedLength > 1024 &&
+  // Split these expressions only if they are created from a row object
+  (ctx.INPUT_ROW != null && ctx.currentVars == null)) {
+
+  val (condFuncName, condGlobalIsNull, condGlobalValue) =
+createAndAddFunction(ctx, condEval, predicate.dataType, 
"evalIfCondExpr")
+  val (trueFuncName, trueGlobalIsNull, trueGlobalValue) =
+createAndAddFunction(ctx, trueEval, trueValue.dataType, 
"evalIfTrueExpr")
+  val (falseFuncName, falseGlobalIsNull, falseGlobalValue) =
+createAndAddFunction(ctx, falseEval, falseValue.dataType, 
"evalIfFalseExpr")
+  s"""
+$condFuncName(${ctx.INPUT_ROW});
+boolean ${ev.isNull} = false;
+${ctx.javaType(dataType)} ${ev.value} = 
${ctx.defaultValue(dataType)};
+if (!$condGlobalIsNull && $condGlobalValue) {
+  $trueFuncName(${ctx.INPUT_ROW});
+  ${ev.isNull} = $trueGlobalIsNull;
+  ${ev.value} = $trueGlobalValue;
+} else {
+  $falseFuncName(${ctx.INPUT_ROW});
+  ${ev.isNull} = $falseGlobalIsNull;
+  ${ev.value} = $falseGlobalValue;
+}
+  """
+}
+else {
+  s"""
+${condEval.code}
+boolean ${ev.isNull} = false;
+${ctx.javaType(dataType)} ${ev.value} = 
${ctx.defaultValue(dataType)};
+if (!${condEval.isNull} && ${condEval.value}) {
+  ${trueEval.code}
+  ${ev.isNull} = ${trueEval.isNull};
+  ${ev.value} = ${trueEval.value};
+} else {
+  ${falseEval.code}
+  ${ev.isNull} = ${falseEval.isNull};
+  ${ev.value} = ${falseEval.value};
+}
+  """
+}
+
+ev.copy(code = generatedCode)
+  }
+
+  private def createAndAddFunction(ctx: CodegenContext,
--- End diff --

the code style is still wrong, see 
https://github.com/apache/spark/pull/15620#discussion_r90185562


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using ...

2016-11-30 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/15975
  
thanks, merging to master!

Since https://github.com/apache/spark/pull/15868 is not backported to 2.1, 
this PR conflicts with 2.1, @gatorsmile can you send a backport PR? thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16069: [WIP][SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven p...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16069
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69453/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16069: [WIP][SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven p...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16069
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16069: [WIP][SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven p...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16069
  
**[Test build #69453 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69453/consoleFull)**
 for PR 16069 at commit 
[`62f0ddb`](https://github.com/apache/spark/commit/62f0ddb44da7711b1066923419762da8b3628780).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching...

2016-11-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15975


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16093: [SPARK-18663][SQL] Simplify CountMinSketch aggregate imp...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16093
  
**[Test build #69464 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69464/consoleFull)**
 for PR 16093 at commit 
[`b2985c4`](https://github.com/apache/spark/commit/b2985c4d817b416e434342e952fabf0ee37b9879).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16080: [SPARK-18647][SQL] do not put provider in table properti...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16080
  
**[Test build #69465 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69465/consoleFull)**
 for PR 16080 at commit 
[`5ee6489`](https://github.com/apache/spark/commit/5ee6489cdd1c22a1071c4fb6c7e4c4af126c9d50).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16056: [SPARK-18623][SQL] Add `returnNullable` to `StaticInvoke...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16056
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16056: [SPARK-18623][SQL] Add `returnNullable` to `StaticInvoke...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16056
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69454/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16056: [SPARK-18623][SQL] Add `returnNullable` to `StaticInvoke...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16056
  
**[Test build #69454 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69454/consoleFull)**
 for PR 16056 at commit 
[`f7b1aa0`](https://github.com/apache/spark/commit/f7b1aa05a64bc8efc43f6e932c1fcb06f18866f7).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90395065
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90395345
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90394053
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90394630
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/ApproxSimilarityJoinExample.scala
 ---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples.ml
+
+// $example on$
+import org.apache.spark.ml.feature.MinHashLSH
+import org.apache.spark.ml.linalg.Vectors
+// $example off$
+import org.apache.spark.sql.SparkSession
+
+object ApproxSimilarityJoinExample {
+  def main(args: Array[String]): Unit = {
+// Creates a SparkSession
+val spark = SparkSession
+  .builder
+  .appName("ApproxSimilarityJoinExample")
+  .getOrCreate()
+
+// $example on$
+val dfA = spark.createDataFrame(Seq(
+  (0, Vectors.sparse(6, Seq((0, 1.0), (1, 1.0), (2, 1.0,
+  (1, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (4, 1.0,
+  (2, Vectors.sparse(6, Seq((0, 1.0), (2, 1.0), (4, 1.0
+)).toDF("id", "keys")
+
+val dfB = spark.createDataFrame(Seq(
+  (3, Vectors.sparse(6, Seq((1, 1.0), (3, 1.0), (5, 1.0,
+  (4, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (5, 1.0,
+  (5, Vectors.sparse(6, Seq((1, 1.0), (2, 1.0), (4, 1.0
+)).toDF("id", "keys")
+
+val mh = new MinHashLSH()
+  .setNumHashTables(5)
+  .setInputCol("keys")
+  .setOutputCol("values")
+
+val model = mh.fit(dfA)
+model.approxSimilarityJoin(dfA, dfB, 0.6).show()
+
+// Cache the transformed columns
+val transformedA = model.transform(dfA)
+val transformedB = model.transform(dfB)
+model.approxSimilarityJoin(transformedA, transformedB, 0.6).show()
+
+// Self Join
+model.approxSimilarityJoin(dfA, dfA, 0.6).filter("datasetA.id < 
datasetB.id").show()
--- End diff --

Just a note - will `approxSimilarityJoin` return duplicates? We should 
think about removing them automatically then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90395495
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90395294
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90395451
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90394871
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala ---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples.ml
+
+// $example on$
+import org.apache.spark.ml.feature.MinHashLSH
+import org.apache.spark.ml.linalg.Vectors
+// $example off$
+import org.apache.spark.sql.SparkSession
+
+object MinHashLSHExample {
--- End diff --

This and the min hash transformation example are almost the same. Perhaps 
we can remove the transformation example, and adjust the user guide section to 
refer (and link) to the code examples for `MinHashLSH` and 
`BucketedRandomProjectionLSH`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90393584
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
--- End diff --

Should be updated to `MinHashLSH`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90395459
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %}
+
+
+
+## MinHash for Jaccard Distance
+[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in 
`spark.ml` for Jaccard distance where input features are sets of natural 
numbers. Jaccard distance of two sets is defined by the cardinality of their 
intersection and union:
+`\[
+d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap 
\mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|}
+\]`
+As its LSH family, MinHash applies a random hash function `g` to each 
elements in the set and take the minimum of all hashed values:
+`\[
+h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a))
+\]`
+
+The input sets for MinHash are represented as binary vectors, where the 
vector indices represent the elements themselves and the non-zero values in the 
vector represent the presence of that element in the set. While both dense and 
sparse vectors are supported, typically sparse vectors are recommended for 
efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 
1.0)])` means there are 10 elements in the space. This set contains elem 2, 
elem 3 and elem 5. All non-zero values are treated as binary "1" values.
+
+**Note:** Empty sets cannot be transformed by MinHash, which means any 
input vector must have at least 1 non-zero entry.
+
+
+
+
+Refer to the [MinHash Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
+
+
+
+
+Refer to the [MinHash Java 
docs](api/java/org/apache/spark/ml/feature/MinHash.html)
+for more details on the API.
+
+{% include_example 
java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}

[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90394571
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/ml/ApproxSimilarityJoinExample.scala
 ---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+// scalastyle:off println
+package org.apache.spark.examples.ml
+
+// $example on$
+import org.apache.spark.ml.feature.MinHashLSH
+import org.apache.spark.ml.linalg.Vectors
+// $example off$
+import org.apache.spark.sql.SparkSession
+
+object ApproxSimilarityJoinExample {
+  def main(args: Array[String]): Unit = {
+// Creates a SparkSession
+val spark = SparkSession
+  .builder
+  .appName("ApproxSimilarityJoinExample")
+  .getOrCreate()
+
+// $example on$
+val dfA = spark.createDataFrame(Seq(
+  (0, Vectors.sparse(6, Seq((0, 1.0), (1, 1.0), (2, 1.0,
+  (1, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (4, 1.0,
+  (2, Vectors.sparse(6, Seq((0, 1.0), (2, 1.0), (4, 1.0
+)).toDF("id", "keys")
+
+val dfB = spark.createDataFrame(Seq(
+  (3, Vectors.sparse(6, Seq((1, 1.0), (3, 1.0), (5, 1.0,
+  (4, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (5, 1.0,
+  (5, Vectors.sparse(6, Seq((1, 1.0), (2, 1.0), (4, 1.0
+)).toDF("id", "keys")
+
+val mh = new MinHashLSH()
+  .setNumHashTables(5)
+  .setInputCol("keys")
+  .setOutputCol("values")
+
+val model = mh.fit(dfA)
+model.approxSimilarityJoin(dfA, dfB, 0.6).show()
+
+// Cache the transformed columns
--- End diff --

This mentions caching but doesn't cache. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90393279
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
+for more details on the API.
+
+{% include_example 
scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
+
+
+
+
+Refer to the [RandomProjection Java 
docs](api/java/org/apache/spark/ml/feature/RandomProjection.html)
--- End diff --

Same here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/15795#discussion_r90393263
  
--- Diff: docs/ml-features.md ---
@@ -1478,3 +1478,139 @@ for more details on the API.
 {% include_example python/ml/chisq_selector_example.py %}
 
 
+
+# Locality Sensitive Hashing
+[Locality Sensitive 
Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an 
important class of hashing techniques, which is commonly used in clustering, 
approximate nearest neighbor search and outlier detection with large datasets.
+
+The general idea of LSH is to use a family of functions (we call them LSH 
families) to hash data points into buckets, so that the data points which are 
close to each other are in the same buckets with high probability, while data 
points that are far away from each other are very likely in different buckets. 
A formal definition of LSH family is as follows:
+
+In a metric space `(M, d)`, where `M` is a set and `d` is a distance 
function on `M`, an LSH family is a family of functions `h` that satisfy the 
following properties:
+`\[
+\forall p, q \in M,\\
+d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\
+d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2
+\]`
+This LSH family is called `(r1, r2, p1, p2)`-sensitive.
+
+In this section, we call a pair of input features a false positive if the 
two features are hashed into the same hash bucket but they are far away in 
distance, and we define false negative as the pair of features when their 
distance are close but they are not in the same hash bucket.
+
+## Bucketed Random Projection for Euclidean Distance
+
+[Bucketed Random 
Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions)
 is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance 
is defined as follows:
+`\[
+d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2}
+\]`
+Its LSH family projects features onto a random unit vector and divide the 
projected results to hash buckets:
+`\[
+h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor
+\]`
+where `v` is a normalized random unit vector and `r` is user-defined 
bucket length. The bucket length can be used to control the average size of 
hash buckets. A larger bucket length means higher probability for features to 
be in the same bucket.
+
+Bucketed Random Projection accepts arbitrary vectors as input features, 
and supports both sparse and dense vectors.
+
+
+
+
+Refer to the [RandomProjection Scala 
docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection)
--- End diff --

This Scaladoc link should be for `BucketedRandomProjection` now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16096: [SPARK-18617][BACKPORT] Follow up PR to Close "kryo auto...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16096
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69455/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16096: [SPARK-18617][BACKPORT] Follow up PR to Close "kryo auto...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16096
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16096: [SPARK-18617][BACKPORT] Follow up PR to Close "kryo auto...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16096
  
**[Test build #69455 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69455/consoleFull)**
 for PR 16096 at commit 
[`76e0143`](https://github.com/apache/spark/commit/76e01432398295f8c48606dbba4847eede3815a2).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16095: [SPARK-18666][Web UI] Remove the codes checking deprecat...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16095
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16095: [SPARK-18666][Web UI] Remove the codes checking deprecat...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16095
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69452/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16095: [SPARK-18666][Web UI] Remove the codes checking deprecat...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16095
  
**[Test build #69452 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69452/consoleFull)**
 for PR 16095 at commit 
[`1bf8528`](https://github.com/apache/spark/commit/1bf8528f248aa9ed4908ebdd922d247cf610e88e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16097: [SPARK-18665] set job to "ERROR" when job is canceled

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16097
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16097: [SPARK-18665] set job to "ERROR" when job is canceled

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16097
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69462/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16097: [SPARK-18665] set job to "ERROR" when job is canceled

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16097
  
**[Test build #69462 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69462/consoleFull)**
 for PR 16097 at commit 
[`8b9322d`](https://github.com/apache/spark/commit/8b9322d1f8421a1868d8d39472d1b6f3681b4de3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16080: [SPARK-18647][SQL] do not put provider in table properti...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16080
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69459/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16080: [SPARK-18647][SQL] do not put provider in table properti...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16080
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16097: [SPARK-18665] set job to "ERROR" when job is canc...

2016-11-30 Thread cenyuhai
Github user cenyuhai closed the pull request at:

https://github.com/apache/spark/pull/16097


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16080: [SPARK-18647][SQL] do not put provider in table properti...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16080
  
**[Test build #69459 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69459/consoleFull)**
 for PR 16080 at commit 
[`198d273`](https://github.com/apache/spark/commit/198d2734936fdadb39bdc77dc223b4ee41c660ba).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16037
  
**[Test build #69463 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69463/consoleFull)**
 for PR 16037 at commit 
[`d7ebc7d`](https://github.com/apache/spark/commit/d7ebc7df63b89d7ba7cd7e3a688089749c393082).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16037
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69463/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16037
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16097: [SPARK-18665] set job to "ERROR" when job is canceled

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16097
  
**[Test build #69462 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69462/consoleFull)**
 for PR 16097 at commit 
[`8b9322d`](https://github.com/apache/spark/commit/8b9322d1f8421a1868d8d39472d1b6f3681b4de3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16037
  
**[Test build #69463 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69463/consoleFull)**
 for PR 16037 at commit 
[`d7ebc7d`](https://github.com/apache/spark/commit/d7ebc7df63b89d7ba7cd7e3a688089749c393082).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16097: [SPARK-18665] set job to "ERROR" when job is canc...

2016-11-30 Thread cenyuhai
GitHub user cenyuhai opened a pull request:

https://github.com/apache/spark/pull/16097

[SPARK-18665]  set job to "ERROR" when job is canceled

## What changes were proposed in this pull request?
set job to "ERROR" when job is canceled

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cenyuhai/spark SPARK-18665

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16097.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16097


commit 869eaaf23f79eefbc6a8ff7a7b9efbc4a9f8c6b7
Author: 岑玉海 <261810...@qq.com>
Date:   2016-08-21T03:55:04Z

Merge pull request #8 from apache/master

merge latest code to my fork

commit b6b0d0a41c1aa59bc97a0aa438619d903b78b108
Author: 岑玉海 <261810...@qq.com>
Date:   2016-09-06T03:03:08Z

Merge pull request #9 from apache/master

Merge latest code to my fork

commit abd7924eab25b6dfdfd78c23a78dadcb3b9fbe1e
Author: 岑玉海 <261810...@qq.com>
Date:   2016-09-08T17:10:12Z

Merge pull request #10 from apache/master

Merge latest code to my fork

commit 4b460e218244cdb0884e73c5fca29cc43b516972
Author: cenyuhai 
Date:   2016-09-15T09:25:24Z

Merge remote-tracking branch 'remotes/apache/master'

commit 22cb0a6f6f60ffae4a449727959cdd2940699f8e
Author: 岑玉海 <261810...@qq.com>
Date:   2016-12-01T06:09:42Z

Merge pull request #12 from apache/master

Merge latest code to my branch

commit 8b9322d1f8421a1868d8d39472d1b6f3681b4de3
Author: cenyuhai <261810...@qq.com>
Date:   2016-12-01T06:36:26Z

set statement state to error after user canceled job




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/16037#discussion_r90391752
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +239,25 @@ object LBFGS extends Logging {
   val bcW = data.context.broadcast(w)
   val localGradient = gradient
 
-  val (gradientSum, lossSum) = data.treeAggregate((Vectors.zeros(n), 
0.0))(
-  seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, 
features)) =>
-val l = localGradient.compute(
-  features, label, bcW.value, grad)
-(grad, loss + l)
-  },
-  combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), 
(grad2, loss2)) =>
+  // Given (current accumulated gradient, current loss) and (label, 
features)
+  // tuples, updates the current gradient and current loss
+  val seqOp = (c: (Vector, Double), v: (Double, Vector)) =>
+(c, v) match {
+  case ((grad, loss), (label, features)) =>
+val denseGrad = grad.toDense
+val l = localGradient.compute(features, label, bcW.value, 
denseGrad)
+(Vectors.dense(denseGrad.values), loss + l)
--- End diff --

I can compile with my example code...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...

2016-11-30 Thread MLnick
Github user MLnick commented on the issue:

https://github.com/apache/spark/pull/16037
  
ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #9276: [SPARK-9858][SPARK-9859][SPARK-9861][SQL] Add an Exchange...

2016-11-30 Thread dreamworks007
Github user dreamworks007 commented on the issue:

https://github.com/apache/spark/pull/9276
  
@yhuai , could you please let us know is there any known issues / 
limitation with this feature ? Has this feature been tested under some large 
jobs ? 

We are also considering automatical determining shuffle partitions, and 
happened to see this PR, and therefore interested in exploring this feature a 
little bit to see if we could productionize it for all jobs (by default).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15915: [SPARK-18485][CORE] Underlying integer overflow when cre...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15915
  
**[Test build #69461 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69461/consoleFull)**
 for PR 15915 at commit 
[`7c3e2c7`](https://github.com/apache/spark/commit/7c3e2c7cb2b45c0481f5272d8ee1f32249095dec).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15915: [SPARK-18485][CORE] Underlying integer overflow when cre...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15915
  
**[Test build #69460 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69460/consoleFull)**
 for PR 15915 at commit 
[`fffd5f5`](https://github.com/apache/spark/commit/fffd5f5150ae1108bfe790112ce1a42d030d5576).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16096: [SPARK-18617][BACKPORT] backport to branch-2.0

2016-11-30 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/16096
  
Can you use a meaningful title & description instead of saying backport? 
This is important because it becomes part of the commit history.

You can put the backport message in the body. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...

2016-11-30 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/16037#discussion_r90388974
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +239,25 @@ object LBFGS extends Logging {
   val bcW = data.context.broadcast(w)
   val localGradient = gradient
 
-  val (gradientSum, lossSum) = data.treeAggregate((Vectors.zeros(n), 
0.0))(
-  seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, 
features)) =>
-val l = localGradient.compute(
-  features, label, bcW.value, grad)
-(grad, loss + l)
-  },
-  combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), 
(grad2, loss2)) =>
+  // Given (current accumulated gradient, current loss) and (label, 
features)
+  // tuples, updates the current gradient and current loss
+  val seqOp = (c: (Vector, Double), v: (Double, Vector)) =>
+(c, v) match {
+  case ((grad, loss), (label, features)) =>
+val denseGrad = grad.toDense
+val l = localGradient.compute(features, label, bcW.value, 
denseGrad)
+(Vectors.dense(denseGrad.values), loss + l)
--- End diff --

Is this really necessary? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13706: [SPARK-15988] [SQL] Implement DDL commands: Create/Drop ...

2016-11-30 Thread lshmouse
Github user lshmouse commented on the issue:

https://github.com/apache/spark/pull/13706
  
@lianhuiwang 
I think the problem is that no need to check if macroFunction is resolved. 
Data type may be cast dynamically according the sql data type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16076: [SPARK-18324][ML][DOC] Update ML programming and migrati...

2016-11-30 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/16076
  
@jkbradley I addressed other comments except for 
[SPARK-18291](https://issues.apache.org/jira/browse/SPARK-18291), since I think 
it's SparkR related issue and should be listed at [SparkR session of migration 
guide](http://spark.apache.org/docs/latest/sparkr.html#migration-guide). I'm 
preparing a PR to update SparkR migration guide. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16076: [SPARK-18324][ML][DOC] Update ML programming and ...

2016-11-30 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16076#discussion_r90386957
  
--- Diff: docs/ml-guide.md ---
@@ -60,152 +60,37 @@ MLlib is under active development.
 The APIs marked `Experimental`/`DeveloperApi` may change in future 
releases,
 and the migration guide below will explain all changes between releases.
 
-## From 1.6 to 2.0
+## From 2.0 to 2.1
 
 ### Breaking changes
 
-There were several breaking changes in Spark 2.0, which are outlined below.
-
-**Linear algebra classes for DataFrame-based APIs**
-
-Spark's linear algebra dependencies were moved to a new project, 
`mllib-local` 
-(see [SPARK-13944](https://issues.apache.org/jira/browse/SPARK-13944)). 
-As part of this change, the linear algebra classes were copied to a new 
package, `spark.ml.linalg`. 
-The DataFrame-based APIs in `spark.ml` now depend on the `spark.ml.linalg` 
classes, 
-leading to a few breaking changes, predominantly in various model classes 
-(see [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) for 
a full list).
-
-**Note:** the RDD-based APIs in `spark.mllib` continue to depend on the 
previous package `spark.mllib.linalg`.
-
-_Converting vectors and matrices_
-
-While most pipeline components support backward compatibility for loading, 
-some existing `DataFrames` and pipelines in Spark versions prior to 2.0, 
that contain vector or matrix 
-columns, may need to be migrated to the new `spark.ml` vector and matrix 
types. 
-Utilities for converting `DataFrame` columns from `spark.mllib.linalg` to 
`spark.ml.linalg` types
-(and vice versa) can be found in `spark.mllib.util.MLUtils`.
-
-There are also utility methods available for converting single instances 
of 
-vectors and matrices. Use the `asML` method on a `mllib.linalg.Vector` / 
`mllib.linalg.Matrix`
-for converting to `ml.linalg` types, and 
-`mllib.linalg.Vectors.fromML` / `mllib.linalg.Matrices.fromML` 
-for converting to `mllib.linalg` types.
-
-
-
-
-{% highlight scala %}
-import org.apache.spark.mllib.util.MLUtils
-
-// convert DataFrame columns
-val convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
-val convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
-// convert a single vector or matrix
-val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML
-val mlMat: org.apache.spark.ml.linalg.Matrix = mllibMat.asML
-{% endhighlight %}
-
-Refer to the [`MLUtils` Scala 
docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for further 
detail.
-
-
-
-
-{% highlight java %}
-import org.apache.spark.mllib.util.MLUtils;
-import org.apache.spark.sql.Dataset;
-
-// convert DataFrame columns
-Dataset convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF);
-Dataset convertedMatrixDF = 
MLUtils.convertMatrixColumnsToML(matrixDF);
-// convert a single vector or matrix
-org.apache.spark.ml.linalg.Vector mlVec = mllibVec.asML();
-org.apache.spark.ml.linalg.Matrix mlMat = mllibMat.asML();
-{% endhighlight %}
-
-Refer to the [`MLUtils` Java 
docs](api/java/org/apache/spark/mllib/util/MLUtils.html) for further detail.
-
-
-
-
-{% highlight python %}
-from pyspark.mllib.util import MLUtils
-
-# convert DataFrame columns
-convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF)
-convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF)
-# convert a single vector or matrix
-mlVec = mllibVec.asML()
-mlMat = mllibMat.asML()
-{% endhighlight %}
-
-Refer to the [`MLUtils` Python 
docs](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils) for further 
detail.
-
-
-
 **Deprecated methods removed**
 
-Several deprecated methods were removed in the `spark.mllib` and 
`spark.ml` packages:
-
-* `setScoreCol` in `ml.evaluation.BinaryClassificationEvaluator`
-* `weights` in `LinearRegression` and `LogisticRegression` in `spark.ml`
-* `setMaxNumIterations` in `mllib.optimization.LBFGS` (marked as 
`DeveloperApi`)
-* `treeReduce` and `treeAggregate` in `mllib.rdd.RDDFunctions` (these 
functions are available on `RDD`s directly, and were marked as `DeveloperApi`)
-* `defaultStategy` in `mllib.tree.configuration.Strategy`
-* `build` in `mllib.tree.Node`
-* libsvm loaders for multiclass and load/save labeledData methods in 
`mllib.util.MLUtils`
-
-A full list of breaking changes can be found at 
[SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810).
+* `setLabelCol` in `feature.ChiSqSelectorModel`
+* `numTrees` in `classification.RandomForestClassificationModel` (This now 
refers to the Param called `numTrees`)
+* `numTrees` in 

[GitHub] spark issue #16080: [SPARK-18647][SQL] do not put provider in table properti...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16080
  
**[Test build #69459 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69459/consoleFull)**
 for PR 16080 at commit 
[`198d273`](https://github.com/apache/spark/commit/198d2734936fdadb39bdc77dc223b4ee41c660ba).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16069: [WIP][SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven p...

2016-11-30 Thread weiqingy
Github user weiqingy commented on the issue:

https://github.com/apache/spark/pull/16069
  
I will update the PR to fix the deprecation warnings in 
project/SparkBuild.scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16076: [SPARK-18324][ML][DOC] Update ML programming and migrati...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16076
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69456/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16093: [SPARK-18663][SQL] Simplify CountMinSketch aggregate imp...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16093
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69451/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16076: [SPARK-18324][ML][DOC] Update ML programming and migrati...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16076
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16076: [SPARK-18324][ML][DOC] Update ML programming and migrati...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16076
  
**[Test build #69456 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69456/consoleFull)**
 for PR 16076 at commit 
[`e784901`](https://github.com/apache/spark/commit/e784901f1e2b1cd10c15537efa28077f8e67a768).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16093: [SPARK-18663][SQL] Simplify CountMinSketch aggregate imp...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16093
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16093: [SPARK-18663][SQL] Simplify CountMinSketch aggregate imp...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16093
  
**[Test build #69451 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69451/consoleFull)**
 for PR 16093 at commit 
[`9f07563`](https://github.com/apache/spark/commit/9f075633896677b2dbcbd37c93a60544e39a0fea).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16084: [SPARK-18654][SQL] Remove unreachable patterns in makeRo...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16084
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69450/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16084: [SPARK-18654][SQL] Remove unreachable patterns in makeRo...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16084
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16084: [SPARK-18654][SQL] Remove unreachable patterns in makeRo...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16084
  
**[Test build #69450 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69450/consoleFull)**
 for PR 16084 at commit 
[`3535acf`](https://github.com/apache/spark/commit/3535acf4f84a9057f4bbb88a81e4fff5f5167c0d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16083: [SPARK-18097][SQL] Add exception catch to handle ...

2016-11-30 Thread thomastechs
Github user thomastechs commented on a diff in the pull request:

https://github.com/apache/spark/pull/16083#discussion_r90386133
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -189,14 +189,18 @@ case class DropTableCommand(
 if (!catalog.isTemporaryTable(tableName) && 
catalog.tableExists(tableName)) {
   // If the command DROP VIEW is to drop a table or DROP TABLE is to 
drop a view
   // issue an exception.
-  catalog.getTableMetadata(tableName).tableType match {
-case CatalogTableType.VIEW if !isView =>
-  throw new AnalysisException(
-"Cannot drop a view with DROP TABLE. Please use DROP VIEW 
instead")
-case o if o != CatalogTableType.VIEW && isView =>
-  throw new AnalysisException(
-s"Cannot drop a table with DROP VIEW. Please use DROP TABLE 
instead")
-case _ =>
+  try {
+catalog.getTableMetadata(tableName).tableType match {
+  case CatalogTableType.VIEW if !isView =>
+throw new AnalysisException(
+  "Cannot drop a view with DROP TABLE. Please use DROP VIEW 
instead")
+  case o if o != CatalogTableType.VIEW && isView =>
+throw new AnalysisException(
+  s"Cannot drop a table with DROP VIEW. Please use DROP TABLE 
instead")
+  case _ =>
+}
+  } catch {
+  case e: QueryExecutionException => log.warn(e.toString, e)
--- End diff --

@gatorsmile  and @Davies  ; In that case, any suggestions to mock the 
metadata corrupt scenario?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16000: [SPARK-18537][Web UI]Add a REST api to spark streaming

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16000
  
**[Test build #69458 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69458/consoleFull)**
 for PR 16000 at commit 
[`651dc67`](https://github.com/apache/spark/commit/651dc679b865603be677ca9d30b975ce5c3c5df0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16062: [SPARK-18629][SQL] Fix numPartition of JDBCSuite Testcas...

2016-11-30 Thread weiqingy
Github user weiqingy commented on the issue:

https://github.com/apache/spark/pull/16062
  
@srowen Thanks for the review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16089: [SPARK-18658][SQL] Write text records directly to a File...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16089
  
**[Test build #69457 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69457/consoleFull)**
 for PR 16089 at commit 
[`56667bd`](https://github.com/apache/spark/commit/56667bd86c1dbb52fb47134042e5a529241a0637).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16069: [WIP][SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven p...

2016-11-30 Thread weiqingy
Github user weiqingy commented on the issue:

https://github.com/apache/spark/pull/16069
  
Thanks, @dongjoon-hyun Yes, Good catch. I have updated the description.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16089: [SPARK-18658][SQL] Write text records directly to...

2016-11-30 Thread NathanHowell
Github user NathanHowell commented on a diff in the pull request:

https://github.com/apache/spark/pull/16089#discussion_r90385252
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala
 ---
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import java.io.{OutputStream, OutputStreamWriter}
+import java.nio.charset.{Charset, StandardCharsets}
+
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.io.compress._
+import org.apache.hadoop.mapreduce.JobContext
+import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
+import org.apache.hadoop.util.ReflectionUtils
+
+private[spark] object CodecStreams {
--- End diff --

Looks that way, I've removed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16076: [SPARK-18324][ML][DOC] Update ML programming and migrati...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16076
  
**[Test build #69456 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69456/consoleFull)**
 for PR 16076 at commit 
[`e784901`](https://github.com/apache/spark/commit/e784901f1e2b1cd10c15537efa28077f8e67a768).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16089: [SPARK-18658][SQL] Write text records directly to a File...

2016-11-30 Thread NathanHowell
Github user NathanHowell commented on the issue:

https://github.com/apache/spark/pull/16089
  
Doh, forgot to run the Hive tests. Should be fixed now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16069: [WIP][SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven p...

2016-11-30 Thread weiqingy
Github user weiqingy commented on the issue:

https://github.com/apache/spark/pull/16069
  
@srowen Thanks for the information.

For sbt update, I think, currently, the only file needed to change is 
`project/build.properties`. The [Jenkins build console 
output](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69365/consoleFull)
  shows “`sbt 0.13.13`” has been downloaded and used in that build. Correct 
me if I miss anything.
```
Attempting to fetch sbt
Launching sbt from build/sbt-launch-0.13.13.jar
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option 
MaxPermSize=512m; support was removed in 8.0
Getting org.scala-sbt sbt 0.13.13 ...
```
I have updated zinc and maven plugins. `mvn 
versions:display-plugin-updates` shows the following plugin updates are 
available:
```
[INFO]   maven-assembly-plugin .. 2.6 -> 
3.0.0
[INFO]   maven-compiler-plugin  3.5.1 -> 
3.6.0
[INFO]   maven-jar-plugin ... 2.6 -> 
3.0.2
[INFO]   maven-javadoc-plugin ... 2.10.3 -> 
2.10.4
[INFO]   maven-source-plugin  2.4 -> 
3.0.1
[INFO]   org.codehaus.mojo:build-helper-maven-plugin  1.10 -> 
1.12
[INFO]   org.codehaus.mojo:exec-maven-plugin .. 1.4.0 -> 
1.5.0
```
Also, in `Building Spark Project External Flume Sink 2.1.0-SNAPSHOT`,
```
The following plugin updates are available:
[INFO]   org.apache.avro:avro-maven-plugin  1.7.7 -> 
1.8.1
```
```
 
org.apache.avro
avro-maven-plugin
${avro.version}

  
  
${project.basedir}/target/scala-${scala.binary.version}/src_managed/main/compiled_avro


  
generate-sources

  idl-protocol

  

  
```
If we want to update `avro-maven-plugin` from 1.7.7 to 1.8.1, we need to 
change the value of `1.7.7 `in `spark-parent_2.11` 
pom file. That will affect some decencies, e.g.
```

   org.apache.avro
   avro
   ${avro.version}
   ${hadoop.deps.scope}

```
```

   org.apache.avro
   avro-ipc
   tests
   ${avro.version}
   test

```
..

 

So avro-maven-plugin was not updated in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16096: [SPARK-18617][BACKPORT] backport to branch-2.0

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16096
  
**[Test build #69455 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69455/consoleFull)**
 for PR 16096 at commit 
[`76e0143`](https://github.com/apache/spark/commit/76e01432398295f8c48606dbba4847eede3815a2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16069: [WIP][SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven p...

2016-11-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/16069
  
nit. In PR description, 0.3.9 instead of 0.13.9 ?
> zinc: 0.13.9 -> 0.3.11,


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16086: [SPARK-18653][SQL] Fix incorrect space padding for unico...

2016-11-30 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/16086
  
@gatorsmile would it be possible to review this? You would be familiar with 
Kanji?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16096: [SPARK-18617][BACKPORT] backport to branch-2.0

2016-11-30 Thread uncleGen
GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/16096

[SPARK-18617][BACKPORT] backport to branch-2.0

## What changes were proposed in this pull request?

backport #16052 to branch-2.0 with incremental update in #16091 

## How was this patch tested?

new unit test

cc @zsxwing @rxin


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark branch-2.0-SPARK-18617

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16096.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16096


commit 76e01432398295f8c48606dbba4847eede3815a2
Author: uncleGen 
Date:   2016-12-01T05:07:59Z

backport to branch-2.0




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16056: [SPARK-18623][SQL] Add `returnNullable` to `StaticInvoke...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16056
  
**[Test build #69454 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69454/consoleFull)**
 for PR 16056 at commit 
[`f7b1aa0`](https://github.com/apache/spark/commit/f7b1aa05a64bc8efc43f6e932c1fcb06f18866f7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16069: [WIP][SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven p...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16069
  
**[Test build #69453 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69453/consoleFull)**
 for PR 16069 at commit 
[`62f0ddb`](https://github.com/apache/spark/commit/62f0ddb44da7711b1066923419762da8b3628780).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15918: [SPARK-18122][SQL][WIP]Fallback to Kryo for unsupported ...

2016-11-30 Thread koertkuipers
Github user koertkuipers commented on the issue:

https://github.com/apache/spark/pull/15918
  
if we do a flag i would also prefer it if the current implicits are more 
narrow if the flag is not set, if possible.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16086: [SPARK-18653][SQL] Fix incorrect space padding for unico...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16086
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69446/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16086: [SPARK-18653][SQL] Fix incorrect space padding for unico...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16086
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16086: [SPARK-18653][SQL] Fix incorrect space padding for unico...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16086
  
**[Test build #69446 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69446/consoleFull)**
 for PR 16086 at commit 
[`350e1ae`](https://github.com/apache/spark/commit/350e1ae6058014febf3b793f64fd7912c5cc814c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16095: [SPARK-18666][Web UI] Remove the codes checking deprecat...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16095
  
**[Test build #69452 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69452/consoleFull)**
 for PR 16095 at commit 
[`1bf8528`](https://github.com/apache/spark/commit/1bf8528f248aa9ed4908ebdd922d247cf610e88e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16021: [SPARK-18593][SQL] JDBCRDD returns incorrect results for...

2016-11-30 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/16021
  
Hi, @rxin .

All 6 commits are not cherry-pickable. Especially, the last 2 commits has 
inevitable conflicts due to some wide commits (about `import order` and 
`syntax` cleanups.)

But, I can find the following clean cherry-pick sequences for the first 4 
commits.

```
$ git cherry-pick -x 28112657ea5919451291c21b4b8e1eb3db0ec8d4
$ git cherry-pick -x 0f6936b5f1c9b0be1c33b98ffb62a72ae0c3e2a8
$ git cherry-pick -x 7f443a6879fa33ca8adb682bd85df2d56fb5fcda
$ git cherry-pick -x 2aad2d372469aaf2773876cae98ef002fef03aa3
$ git cherry-pick -x 554d840a9ade79722c96972257435a05e2aa9d88
$ git cherry-pick -x 8c1b867cee816d0943184c7b485cd11e255d8130
$ git cherry-pick -x 5c2682b0c8fd2aeae2af1adb716ee0d5f8b85135
$ git cherry-pick -x ad5b7cfcca7a5feb83b9ed94b6e725c6d789579b
$ git cherry-pick -x 94f7a12b3c8e4a6ecd969893e562feb7ffba4c24
$ git diff HEAD~9 --stat

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala
 | 102 +++-
sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala   
  |  68 -
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
  |  15 +
 3 files changed, 139 insertions(+), 46 deletions(-)
```

All are related commits in the above three files. If you don't mind, could 
you cherry-pick the above. The followings are the titles of them.

1. [SPARK-12236][SQL] JDBC filter tests all pass if filters are not really 
pushed down.
2. [SPARK-12249][SQL] JDBC non-equality comparison operator not pushed down.
3. [SPARK-12314][SQL] isnull operator not pushed down for JDBC datasource.
4. [SPARK-12315][SQL] isnotnull operator not pushed down for JDBC 
datasource.
5. Style fix for the previous 3 JDBC filter push down commits. (@rxin)
6. [SPARK-12446][SQL] Add unit tests for JDBCRDD internal functions 
7. [SPARK-12409][SPARK-12387][SPARK-12391][SQL] Support AND/OR/IN/LIKE 
push-down filters for JDBC
8. [SPARK-12409][SPARK-12387][SPARK-12391][SQL] Refactor filter pushdown 
for JDBCRDD and add few filters (Liang-Chi Hsieh)
9. [SPARK-10180][SQL] JDBC datasource are not processing EqualNullSafe 
filter 

After your cherry-picking, I will create two PRs for the remaining 
inevitable-conflict commits.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16086: [SPARK-18653][SQL] Fix incorrect space padding for unico...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16086
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69443/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16095: [UI] Remove the codes checking deprecated config ...

2016-11-30 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/16095

[UI] Remove the codes checking deprecated config spark.sql.unsafe.enabled

## What changes were proposed in this pull request?

`spark.sql.unsafe.enabled` is deprecated since 2.0. There still are codes 
in UI to check it. We should remove it and clean the codes.

## How was this patch tested?

Changes to related existing unit test.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 remove-deprecated-config-code

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/16095.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16095


commit 1bf8528f248aa9ed4908ebdd922d247cf610e88e
Author: Liang-Chi Hsieh 
Date:   2016-12-01T04:50:32Z

Remove the codes checking deprecated config spark.sql.unsafe.enabled.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using ...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15975
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using ...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15975
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69444/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using ...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15975
  
**[Test build #69444 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69444/consoleFull)**
 for PR 15975 at commit 
[`728c103`](https://github.com/apache/spark/commit/728c103fc10d5118eff4ff5bf9372da8557ecf60).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16077: [SPARK-18643][SPARKR] SparkR hangs at session start when...

2016-11-30 Thread yanboliang
Github user yanboliang commented on the issue:

https://github.com/apache/spark/pull/16077
  
LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15994: [SPARK-18555][SQL]DataFrameNaFunctions.fill miss up orig...

2016-11-30 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/15994
  
Hi @windpiger it seems something gone wrong. Would you try to rebase this 
please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16068: [SPARK-18637][SQL]Stateful UDF should be considered as n...

2016-11-30 Thread zhzhan
Github user zhzhan commented on the issue:

https://github.com/apache/spark/pull/16068
  
@hvanhovell Thanks for looking at this. We have a big number of UDFs that 
have this issue. For example, the UDF gives different result with different 
partition/sort, but the UDF is pushdown before the partition/sort, resulting in 
unexpected behavior. I will working on finding some test cases for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15910: [SPARK-18476][SPARKR][ML]:SparkR Logistic Regress...

2016-11-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15910


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15994: [SPARK-18555][SQL]DataFrameNaFunctions.fill miss up orig...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15994
  
Build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15994: [SPARK-18555][SQL]DataFrameNaFunctions.fill miss up orig...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/15994
  
**[Test build #69442 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69442/consoleFull)**
 for PR 15994 at commit 
[`4c9f3a0`](https://github.com/apache/spark/commit/4c9f3a0aa96adcf64ecd5350f291f5a212128e30).
 * This patch passes all tests.
 * This patch **does not merge cleanly**.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16089: [SPARK-18658][SQL] Write text records directly to a File...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16089
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16089: [SPARK-18658][SQL] Write text records directly to a File...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16089
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69449/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15994: [SPARK-18555][SQL]DataFrameNaFunctions.fill miss up orig...

2016-11-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/15994
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69442/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16089: [SPARK-18658][SQL] Write text records directly to a File...

2016-11-30 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16089
  
**[Test build #69449 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69449/consoleFull)**
 for PR 16089 at commit 
[`298e507`](https://github.com/apache/spark/commit/298e507d5c42328de610d6109afb11076aadfb96).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   >