[GitHub] spark pull request #15620: [SPARK-18091] [SQL] Deep if expressions cause Gen...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/15620#discussion_r90397578 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala --- @@ -64,19 +64,74 @@ case class If(predicate: Expression, trueValue: Expression, falseValue: Expressi val trueEval = trueValue.genCode(ctx) val falseEval = falseValue.genCode(ctx) -ev.copy(code = s""" - ${condEval.code} - boolean ${ev.isNull} = false; - ${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)}; - if (!${condEval.isNull} && ${condEval.value}) { -${trueEval.code} -${ev.isNull} = ${trueEval.isNull}; -${ev.value} = ${trueEval.value}; - } else { -${falseEval.code} -${ev.isNull} = ${falseEval.isNull}; -${ev.value} = ${falseEval.value}; - }""") +// place generated code of condition, true value and false value in separate methods if +// their code combined is large +val combinedLength = condEval.code.length + trueEval.code.length + falseEval.code.length +val generatedCode = if (combinedLength > 1024 && + // Split these expressions only if they are created from a row object + (ctx.INPUT_ROW != null && ctx.currentVars == null)) { + + val (condFuncName, condGlobalIsNull, condGlobalValue) = +createAndAddFunction(ctx, condEval, predicate.dataType, "evalIfCondExpr") + val (trueFuncName, trueGlobalIsNull, trueGlobalValue) = +createAndAddFunction(ctx, trueEval, trueValue.dataType, "evalIfTrueExpr") + val (falseFuncName, falseGlobalIsNull, falseGlobalValue) = +createAndAddFunction(ctx, falseEval, falseValue.dataType, "evalIfFalseExpr") + s""" +$condFuncName(${ctx.INPUT_ROW}); +boolean ${ev.isNull} = false; +${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)}; +if (!$condGlobalIsNull && $condGlobalValue) { + $trueFuncName(${ctx.INPUT_ROW}); + ${ev.isNull} = $trueGlobalIsNull; + ${ev.value} = $trueGlobalValue; +} else { + $falseFuncName(${ctx.INPUT_ROW}); + ${ev.isNull} = $falseGlobalIsNull; + ${ev.value} = $falseGlobalValue; +} + """ +} +else { + s""" +${condEval.code} +boolean ${ev.isNull} = false; +${ctx.javaType(dataType)} ${ev.value} = ${ctx.defaultValue(dataType)}; +if (!${condEval.isNull} && ${condEval.value}) { + ${trueEval.code} + ${ev.isNull} = ${trueEval.isNull}; + ${ev.value} = ${trueEval.value}; +} else { + ${falseEval.code} + ${ev.isNull} = ${falseEval.isNull}; + ${ev.value} = ${falseEval.value}; +} + """ +} + +ev.copy(code = generatedCode) + } + + private def createAndAddFunction(ctx: CodegenContext, --- End diff -- the code style is still wrong, see https://github.com/apache/spark/pull/15620#discussion_r90185562 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/15975 thanks, merging to master! Since https://github.com/apache/spark/pull/15868 is not backported to 2.1, this PR conflicts with 2.1, @gatorsmile can you send a backport PR? thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16069: [WIP][SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16069 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69453/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16069: [WIP][SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven p...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16069 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16069: [WIP][SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16069 **[Test build #69453 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69453/consoleFull)** for PR 16069 at commit [`62f0ddb`](https://github.com/apache/spark/commit/62f0ddb44da7711b1066923419762da8b3628780). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15975 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16093: [SPARK-18663][SQL] Simplify CountMinSketch aggregate imp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16093 **[Test build #69464 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69464/consoleFull)** for PR 16093 at commit [`b2985c4`](https://github.com/apache/spark/commit/b2985c4d817b416e434342e952fabf0ee37b9879). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16080: [SPARK-18647][SQL] do not put provider in table properti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16080 **[Test build #69465 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69465/consoleFull)** for PR 16080 at commit [`5ee6489`](https://github.com/apache/spark/commit/5ee6489cdd1c22a1071c4fb6c7e4c4af126c9d50). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16056: [SPARK-18623][SQL] Add `returnNullable` to `StaticInvoke...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16056 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16056: [SPARK-18623][SQL] Add `returnNullable` to `StaticInvoke...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16056 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69454/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16056: [SPARK-18623][SQL] Add `returnNullable` to `StaticInvoke...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16056 **[Test build #69454 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69454/consoleFull)** for PR 16056 at commit [`f7b1aa0`](https://github.com/apache/spark/commit/f7b1aa05a64bc8efc43f6e932c1fcb06f18866f7). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90395065 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py %} + +# Locality Sensitive Hashing +[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an important class of hashing techniques, which is commonly used in clustering, approximate nearest neighbor search and outlier detection with large datasets. + +The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows: + +In a metric space `(M, d)`, where `M` is a set and `d` is a distance function on `M`, an LSH family is a family of functions `h` that satisfy the following properties: +`\[ +\forall p, q \in M,\\ +d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\ +d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2 +\]` +This LSH family is called `(r1, r2, p1, p2)`-sensitive. + +In this section, we call a pair of input features a false positive if the two features are hashed into the same hash bucket but they are far away in distance, and we define false negative as the pair of features when their distance are close but they are not in the same hash bucket. + +## Bucketed Random Projection for Euclidean Distance + +[Bucketed Random Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions) is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance is defined as follows: +`\[ +d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2} +\]` +Its LSH family projects features onto a random unit vector and divide the projected results to hash buckets: +`\[ +h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor +\]` +where `v` is a normalized random unit vector and `r` is user-defined bucket length. The bucket length can be used to control the average size of hash buckets. A larger bucket length means higher probability for features to be in the same bucket. + +Bucketed Random Projection accepts arbitrary vectors as input features, and supports both sparse and dense vectors. + + + + +Refer to the [RandomProjection Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %} + + + + +Refer to the [RandomProjection Java docs](api/java/org/apache/spark/ml/feature/RandomProjection.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %} + + + +## MinHash for Jaccard Distance +[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in `spark.ml` for Jaccard distance where input features are sets of natural numbers. Jaccard distance of two sets is defined by the cardinality of their intersection and union: +`\[ +d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap \mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|} +\]` +As its LSH family, MinHash applies a random hash function `g` to each elements in the set and take the minimum of all hashed values: +`\[ +h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a)) +\]` + +The input sets for MinHash are represented as binary vectors, where the vector indices represent the elements themselves and the non-zero values in the vector represent the presence of that element in the set. While both dense and sparse vectors are supported, typically sparse vectors are recommended for efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. All non-zero values are treated as binary "1" values. + +**Note:** Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero entry. + + + + +Refer to the [MinHash Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %} + + + + +Refer to the [MinHash Java docs](api/java/org/apache/spark/ml/feature/MinHash.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90395345 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py %} + +# Locality Sensitive Hashing +[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an important class of hashing techniques, which is commonly used in clustering, approximate nearest neighbor search and outlier detection with large datasets. + +The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows: + +In a metric space `(M, d)`, where `M` is a set and `d` is a distance function on `M`, an LSH family is a family of functions `h` that satisfy the following properties: +`\[ +\forall p, q \in M,\\ +d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\ +d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2 +\]` +This LSH family is called `(r1, r2, p1, p2)`-sensitive. + +In this section, we call a pair of input features a false positive if the two features are hashed into the same hash bucket but they are far away in distance, and we define false negative as the pair of features when their distance are close but they are not in the same hash bucket. + +## Bucketed Random Projection for Euclidean Distance + +[Bucketed Random Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions) is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance is defined as follows: +`\[ +d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2} +\]` +Its LSH family projects features onto a random unit vector and divide the projected results to hash buckets: +`\[ +h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor +\]` +where `v` is a normalized random unit vector and `r` is user-defined bucket length. The bucket length can be used to control the average size of hash buckets. A larger bucket length means higher probability for features to be in the same bucket. + +Bucketed Random Projection accepts arbitrary vectors as input features, and supports both sparse and dense vectors. + + + + +Refer to the [RandomProjection Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %} + + + + +Refer to the [RandomProjection Java docs](api/java/org/apache/spark/ml/feature/RandomProjection.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %} + + + +## MinHash for Jaccard Distance +[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in `spark.ml` for Jaccard distance where input features are sets of natural numbers. Jaccard distance of two sets is defined by the cardinality of their intersection and union: +`\[ +d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap \mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|} +\]` +As its LSH family, MinHash applies a random hash function `g` to each elements in the set and take the minimum of all hashed values: +`\[ +h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a)) +\]` + +The input sets for MinHash are represented as binary vectors, where the vector indices represent the elements themselves and the non-zero values in the vector represent the presence of that element in the set. While both dense and sparse vectors are supported, typically sparse vectors are recommended for efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. All non-zero values are treated as binary "1" values. + +**Note:** Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero entry. + + + + +Refer to the [MinHash Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %} + + + + +Refer to the [MinHash Java docs](api/java/org/apache/spark/ml/feature/MinHash.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90394053 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py %} + +# Locality Sensitive Hashing +[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an important class of hashing techniques, which is commonly used in clustering, approximate nearest neighbor search and outlier detection with large datasets. + +The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows: + +In a metric space `(M, d)`, where `M` is a set and `d` is a distance function on `M`, an LSH family is a family of functions `h` that satisfy the following properties: +`\[ +\forall p, q \in M,\\ +d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\ +d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2 +\]` +This LSH family is called `(r1, r2, p1, p2)`-sensitive. + +In this section, we call a pair of input features a false positive if the two features are hashed into the same hash bucket but they are far away in distance, and we define false negative as the pair of features when their distance are close but they are not in the same hash bucket. + +## Bucketed Random Projection for Euclidean Distance + +[Bucketed Random Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions) is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance is defined as follows: +`\[ +d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2} +\]` +Its LSH family projects features onto a random unit vector and divide the projected results to hash buckets: +`\[ +h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor +\]` +where `v` is a normalized random unit vector and `r` is user-defined bucket length. The bucket length can be used to control the average size of hash buckets. A larger bucket length means higher probability for features to be in the same bucket. + +Bucketed Random Projection accepts arbitrary vectors as input features, and supports both sparse and dense vectors. + + + + +Refer to the [RandomProjection Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %} + + + + +Refer to the [RandomProjection Java docs](api/java/org/apache/spark/ml/feature/RandomProjection.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %} + + + +## MinHash for Jaccard Distance +[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in `spark.ml` for Jaccard distance where input features are sets of natural numbers. Jaccard distance of two sets is defined by the cardinality of their intersection and union: +`\[ +d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap \mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|} +\]` +As its LSH family, MinHash applies a random hash function `g` to each elements in the set and take the minimum of all hashed values: +`\[ +h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a)) +\]` + +The input sets for MinHash are represented as binary vectors, where the vector indices represent the elements themselves and the non-zero values in the vector represent the presence of that element in the set. While both dense and sparse vectors are supported, typically sparse vectors are recommended for efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. All non-zero values are treated as binary "1" values. + +**Note:** Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero entry. + + + + +Refer to the [MinHash Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %} + + + + +Refer to the [MinHash Java docs](api/java/org/apache/spark/ml/feature/MinHash.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90394630 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/ApproxSimilarityJoinExample.scala --- @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +// scalastyle:off println +package org.apache.spark.examples.ml + +// $example on$ +import org.apache.spark.ml.feature.MinHashLSH +import org.apache.spark.ml.linalg.Vectors +// $example off$ +import org.apache.spark.sql.SparkSession + +object ApproxSimilarityJoinExample { + def main(args: Array[String]): Unit = { +// Creates a SparkSession +val spark = SparkSession + .builder + .appName("ApproxSimilarityJoinExample") + .getOrCreate() + +// $example on$ +val dfA = spark.createDataFrame(Seq( + (0, Vectors.sparse(6, Seq((0, 1.0), (1, 1.0), (2, 1.0, + (1, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (4, 1.0, + (2, Vectors.sparse(6, Seq((0, 1.0), (2, 1.0), (4, 1.0 +)).toDF("id", "keys") + +val dfB = spark.createDataFrame(Seq( + (3, Vectors.sparse(6, Seq((1, 1.0), (3, 1.0), (5, 1.0, + (4, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (5, 1.0, + (5, Vectors.sparse(6, Seq((1, 1.0), (2, 1.0), (4, 1.0 +)).toDF("id", "keys") + +val mh = new MinHashLSH() + .setNumHashTables(5) + .setInputCol("keys") + .setOutputCol("values") + +val model = mh.fit(dfA) +model.approxSimilarityJoin(dfA, dfB, 0.6).show() + +// Cache the transformed columns +val transformedA = model.transform(dfA) +val transformedB = model.transform(dfB) +model.approxSimilarityJoin(transformedA, transformedB, 0.6).show() + +// Self Join +model.approxSimilarityJoin(dfA, dfA, 0.6).filter("datasetA.id < datasetB.id").show() --- End diff -- Just a note - will `approxSimilarityJoin` return duplicates? We should think about removing them automatically then? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90395495 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py %} + +# Locality Sensitive Hashing +[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an important class of hashing techniques, which is commonly used in clustering, approximate nearest neighbor search and outlier detection with large datasets. + +The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows: + +In a metric space `(M, d)`, where `M` is a set and `d` is a distance function on `M`, an LSH family is a family of functions `h` that satisfy the following properties: +`\[ +\forall p, q \in M,\\ +d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\ +d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2 +\]` +This LSH family is called `(r1, r2, p1, p2)`-sensitive. + +In this section, we call a pair of input features a false positive if the two features are hashed into the same hash bucket but they are far away in distance, and we define false negative as the pair of features when their distance are close but they are not in the same hash bucket. + +## Bucketed Random Projection for Euclidean Distance + +[Bucketed Random Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions) is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance is defined as follows: +`\[ +d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2} +\]` +Its LSH family projects features onto a random unit vector and divide the projected results to hash buckets: +`\[ +h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor +\]` +where `v` is a normalized random unit vector and `r` is user-defined bucket length. The bucket length can be used to control the average size of hash buckets. A larger bucket length means higher probability for features to be in the same bucket. + +Bucketed Random Projection accepts arbitrary vectors as input features, and supports both sparse and dense vectors. + + + + +Refer to the [RandomProjection Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %} + + + + +Refer to the [RandomProjection Java docs](api/java/org/apache/spark/ml/feature/RandomProjection.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %} + + + +## MinHash for Jaccard Distance +[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in `spark.ml` for Jaccard distance where input features are sets of natural numbers. Jaccard distance of two sets is defined by the cardinality of their intersection and union: +`\[ +d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap \mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|} +\]` +As its LSH family, MinHash applies a random hash function `g` to each elements in the set and take the minimum of all hashed values: +`\[ +h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a)) +\]` + +The input sets for MinHash are represented as binary vectors, where the vector indices represent the elements themselves and the non-zero values in the vector represent the presence of that element in the set. While both dense and sparse vectors are supported, typically sparse vectors are recommended for efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. All non-zero values are treated as binary "1" values. + +**Note:** Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero entry. + + + + +Refer to the [MinHash Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %} + + + + +Refer to the [MinHash Java docs](api/java/org/apache/spark/ml/feature/MinHash.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90395294 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py %} + +# Locality Sensitive Hashing +[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an important class of hashing techniques, which is commonly used in clustering, approximate nearest neighbor search and outlier detection with large datasets. + +The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows: + +In a metric space `(M, d)`, where `M` is a set and `d` is a distance function on `M`, an LSH family is a family of functions `h` that satisfy the following properties: +`\[ +\forall p, q \in M,\\ +d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\ +d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2 +\]` +This LSH family is called `(r1, r2, p1, p2)`-sensitive. + +In this section, we call a pair of input features a false positive if the two features are hashed into the same hash bucket but they are far away in distance, and we define false negative as the pair of features when their distance are close but they are not in the same hash bucket. + +## Bucketed Random Projection for Euclidean Distance + +[Bucketed Random Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions) is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance is defined as follows: +`\[ +d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2} +\]` +Its LSH family projects features onto a random unit vector and divide the projected results to hash buckets: +`\[ +h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor +\]` +where `v` is a normalized random unit vector and `r` is user-defined bucket length. The bucket length can be used to control the average size of hash buckets. A larger bucket length means higher probability for features to be in the same bucket. + +Bucketed Random Projection accepts arbitrary vectors as input features, and supports both sparse and dense vectors. + + + + +Refer to the [RandomProjection Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %} + + + + +Refer to the [RandomProjection Java docs](api/java/org/apache/spark/ml/feature/RandomProjection.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %} + + + +## MinHash for Jaccard Distance +[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in `spark.ml` for Jaccard distance where input features are sets of natural numbers. Jaccard distance of two sets is defined by the cardinality of their intersection and union: +`\[ +d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap \mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|} +\]` +As its LSH family, MinHash applies a random hash function `g` to each elements in the set and take the minimum of all hashed values: +`\[ +h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a)) +\]` + +The input sets for MinHash are represented as binary vectors, where the vector indices represent the elements themselves and the non-zero values in the vector represent the presence of that element in the set. While both dense and sparse vectors are supported, typically sparse vectors are recommended for efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. All non-zero values are treated as binary "1" values. + +**Note:** Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero entry. + + + + +Refer to the [MinHash Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %} + + + + +Refer to the [MinHash Java docs](api/java/org/apache/spark/ml/feature/MinHash.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90395451 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py %} + +# Locality Sensitive Hashing +[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an important class of hashing techniques, which is commonly used in clustering, approximate nearest neighbor search and outlier detection with large datasets. + +The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows: + +In a metric space `(M, d)`, where `M` is a set and `d` is a distance function on `M`, an LSH family is a family of functions `h` that satisfy the following properties: +`\[ +\forall p, q \in M,\\ +d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\ +d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2 +\]` +This LSH family is called `(r1, r2, p1, p2)`-sensitive. + +In this section, we call a pair of input features a false positive if the two features are hashed into the same hash bucket but they are far away in distance, and we define false negative as the pair of features when their distance are close but they are not in the same hash bucket. + +## Bucketed Random Projection for Euclidean Distance + +[Bucketed Random Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions) is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance is defined as follows: +`\[ +d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2} +\]` +Its LSH family projects features onto a random unit vector and divide the projected results to hash buckets: +`\[ +h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor +\]` +where `v` is a normalized random unit vector and `r` is user-defined bucket length. The bucket length can be used to control the average size of hash buckets. A larger bucket length means higher probability for features to be in the same bucket. + +Bucketed Random Projection accepts arbitrary vectors as input features, and supports both sparse and dense vectors. + + + + +Refer to the [RandomProjection Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %} + + + + +Refer to the [RandomProjection Java docs](api/java/org/apache/spark/ml/feature/RandomProjection.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %} + + + +## MinHash for Jaccard Distance +[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in `spark.ml` for Jaccard distance where input features are sets of natural numbers. Jaccard distance of two sets is defined by the cardinality of their intersection and union: +`\[ +d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap \mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|} +\]` +As its LSH family, MinHash applies a random hash function `g` to each elements in the set and take the minimum of all hashed values: +`\[ +h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a)) +\]` + +The input sets for MinHash are represented as binary vectors, where the vector indices represent the elements themselves and the non-zero values in the vector represent the presence of that element in the set. While both dense and sparse vectors are supported, typically sparse vectors are recommended for efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. All non-zero values are treated as binary "1" values. + +**Note:** Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero entry. + + + + +Refer to the [MinHash Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %} + + + + +Refer to the [MinHash Java docs](api/java/org/apache/spark/ml/feature/MinHash.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90394871 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/MinHashLSHExample.scala --- @@ -0,0 +1,54 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +// scalastyle:off println +package org.apache.spark.examples.ml + +// $example on$ +import org.apache.spark.ml.feature.MinHashLSH +import org.apache.spark.ml.linalg.Vectors +// $example off$ +import org.apache.spark.sql.SparkSession + +object MinHashLSHExample { --- End diff -- This and the min hash transformation example are almost the same. Perhaps we can remove the transformation example, and adjust the user guide section to refer (and link) to the code examples for `MinHashLSH` and `BucketedRandomProjectionLSH` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90393584 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py %} + +# Locality Sensitive Hashing +[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an important class of hashing techniques, which is commonly used in clustering, approximate nearest neighbor search and outlier detection with large datasets. + +The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows: + +In a metric space `(M, d)`, where `M` is a set and `d` is a distance function on `M`, an LSH family is a family of functions `h` that satisfy the following properties: +`\[ +\forall p, q \in M,\\ +d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\ +d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2 +\]` +This LSH family is called `(r1, r2, p1, p2)`-sensitive. + +In this section, we call a pair of input features a false positive if the two features are hashed into the same hash bucket but they are far away in distance, and we define false negative as the pair of features when their distance are close but they are not in the same hash bucket. + +## Bucketed Random Projection for Euclidean Distance + +[Bucketed Random Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions) is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance is defined as follows: +`\[ +d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2} +\]` +Its LSH family projects features onto a random unit vector and divide the projected results to hash buckets: +`\[ +h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor +\]` +where `v` is a normalized random unit vector and `r` is user-defined bucket length. The bucket length can be used to control the average size of hash buckets. A larger bucket length means higher probability for features to be in the same bucket. + +Bucketed Random Projection accepts arbitrary vectors as input features, and supports both sparse and dense vectors. + + + + +Refer to the [RandomProjection Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %} + + + + +Refer to the [RandomProjection Java docs](api/java/org/apache/spark/ml/feature/RandomProjection.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %} + + + +## MinHash for Jaccard Distance +[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in `spark.ml` for Jaccard distance where input features are sets of natural numbers. Jaccard distance of two sets is defined by the cardinality of their intersection and union: +`\[ +d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap \mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|} +\]` +As its LSH family, MinHash applies a random hash function `g` to each elements in the set and take the minimum of all hashed values: +`\[ +h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a)) +\]` + +The input sets for MinHash are represented as binary vectors, where the vector indices represent the elements themselves and the non-zero values in the vector represent the presence of that element in the set. While both dense and sparse vectors are supported, typically sparse vectors are recommended for efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. All non-zero values are treated as binary "1" values. + +**Note:** Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero entry. + + + + +Refer to the [MinHash Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash) --- End diff -- Should be updated to `MinHashLSH`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90395459 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py %} + +# Locality Sensitive Hashing +[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an important class of hashing techniques, which is commonly used in clustering, approximate nearest neighbor search and outlier detection with large datasets. + +The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows: + +In a metric space `(M, d)`, where `M` is a set and `d` is a distance function on `M`, an LSH family is a family of functions `h` that satisfy the following properties: +`\[ +\forall p, q \in M,\\ +d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\ +d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2 +\]` +This LSH family is called `(r1, r2, p1, p2)`-sensitive. + +In this section, we call a pair of input features a false positive if the two features are hashed into the same hash bucket but they are far away in distance, and we define false negative as the pair of features when their distance are close but they are not in the same hash bucket. + +## Bucketed Random Projection for Euclidean Distance + +[Bucketed Random Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions) is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance is defined as follows: +`\[ +d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2} +\]` +Its LSH family projects features onto a random unit vector and divide the projected results to hash buckets: +`\[ +h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor +\]` +where `v` is a normalized random unit vector and `r` is user-defined bucket length. The bucket length can be used to control the average size of hash buckets. A larger bucket length means higher probability for features to be in the same bucket. + +Bucketed Random Projection accepts arbitrary vectors as input features, and supports both sparse and dense vectors. + + + + +Refer to the [RandomProjection Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %} + + + + +Refer to the [RandomProjection Java docs](api/java/org/apache/spark/ml/feature/RandomProjection.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaBucketedRandomProjectionLSHExample.java %} + + + +## MinHash for Jaccard Distance +[MinHash](https://en.wikipedia.org/wiki/MinHash) is the LSH family in `spark.ml` for Jaccard distance where input features are sets of natural numbers. Jaccard distance of two sets is defined by the cardinality of their intersection and union: +`\[ +d(\mathbf{A}, \mathbf{B}) = 1 - \frac{|\mathbf{A} \cap \mathbf{B}|}{|\mathbf{A} \cup \mathbf{B}|} +\]` +As its LSH family, MinHash applies a random hash function `g` to each elements in the set and take the minimum of all hashed values: +`\[ +h(\mathbf{A}) = \min_{a \in \mathbf{A}}(g(a)) +\]` + +The input sets for MinHash are represented as binary vectors, where the vector indices represent the elements themselves and the non-zero values in the vector represent the presence of that element in the set. While both dense and sparse vectors are supported, typically sparse vectors are recommended for efficiency. For example, `Vectors.sparse(10, Array[(2, 1.0), (3, 1.0), (5, 1.0)])` means there are 10 elements in the space. This set contains elem 2, elem 3 and elem 5. All non-zero values are treated as binary "1" values. + +**Note:** Empty sets cannot be transformed by MinHash, which means any input vector must have at least 1 non-zero entry. + + + + +Refer to the [MinHash Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinHash) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %} + + + + +Refer to the [MinHash Java docs](api/java/org/apache/spark/ml/feature/MinHash.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaMinHashLSHExample.java %}
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90394571 --- Diff: examples/src/main/scala/org/apache/spark/examples/ml/ApproxSimilarityJoinExample.scala --- @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +// scalastyle:off println +package org.apache.spark.examples.ml + +// $example on$ +import org.apache.spark.ml.feature.MinHashLSH +import org.apache.spark.ml.linalg.Vectors +// $example off$ +import org.apache.spark.sql.SparkSession + +object ApproxSimilarityJoinExample { + def main(args: Array[String]): Unit = { +// Creates a SparkSession +val spark = SparkSession + .builder + .appName("ApproxSimilarityJoinExample") + .getOrCreate() + +// $example on$ +val dfA = spark.createDataFrame(Seq( + (0, Vectors.sparse(6, Seq((0, 1.0), (1, 1.0), (2, 1.0, + (1, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (4, 1.0, + (2, Vectors.sparse(6, Seq((0, 1.0), (2, 1.0), (4, 1.0 +)).toDF("id", "keys") + +val dfB = spark.createDataFrame(Seq( + (3, Vectors.sparse(6, Seq((1, 1.0), (3, 1.0), (5, 1.0, + (4, Vectors.sparse(6, Seq((2, 1.0), (3, 1.0), (5, 1.0, + (5, Vectors.sparse(6, Seq((1, 1.0), (2, 1.0), (4, 1.0 +)).toDF("id", "keys") + +val mh = new MinHashLSH() + .setNumHashTables(5) + .setInputCol("keys") + .setOutputCol("values") + +val model = mh.fit(dfA) +model.approxSimilarityJoin(dfA, dfB, 0.6).show() + +// Cache the transformed columns --- End diff -- This mentions caching but doesn't cache. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90393279 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py %} + +# Locality Sensitive Hashing +[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an important class of hashing techniques, which is commonly used in clustering, approximate nearest neighbor search and outlier detection with large datasets. + +The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows: + +In a metric space `(M, d)`, where `M` is a set and `d` is a distance function on `M`, an LSH family is a family of functions `h` that satisfy the following properties: +`\[ +\forall p, q \in M,\\ +d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\ +d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2 +\]` +This LSH family is called `(r1, r2, p1, p2)`-sensitive. + +In this section, we call a pair of input features a false positive if the two features are hashed into the same hash bucket but they are far away in distance, and we define false negative as the pair of features when their distance are close but they are not in the same hash bucket. + +## Bucketed Random Projection for Euclidean Distance + +[Bucketed Random Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions) is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance is defined as follows: +`\[ +d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2} +\]` +Its LSH family projects features onto a random unit vector and divide the projected results to hash buckets: +`\[ +h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor +\]` +where `v` is a normalized random unit vector and `r` is user-defined bucket length. The bucket length can be used to control the average size of hash buckets. A larger bucket length means higher probability for features to be in the same bucket. + +Bucketed Random Projection accepts arbitrary vectors as input features, and supports both sparse and dense vectors. + + + + +Refer to the [RandomProjection Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %} + + + + +Refer to the [RandomProjection Java docs](api/java/org/apache/spark/ml/feature/RandomProjection.html) --- End diff -- Same here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15795: [SPARK-18081] Add user guide for Locality Sensiti...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/15795#discussion_r90393263 --- Diff: docs/ml-features.md --- @@ -1478,3 +1478,139 @@ for more details on the API. {% include_example python/ml/chisq_selector_example.py %} + +# Locality Sensitive Hashing +[Locality Sensitive Hashing(LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) is an important class of hashing techniques, which is commonly used in clustering, approximate nearest neighbor search and outlier detection with large datasets. + +The general idea of LSH is to use a family of functions (we call them LSH families) to hash data points into buckets, so that the data points which are close to each other are in the same buckets with high probability, while data points that are far away from each other are very likely in different buckets. A formal definition of LSH family is as follows: + +In a metric space `(M, d)`, where `M` is a set and `d` is a distance function on `M`, an LSH family is a family of functions `h` that satisfy the following properties: +`\[ +\forall p, q \in M,\\ +d(p,q) < r1 \Rightarrow Pr(h(p)=h(q)) \geq p1\\ +d(p,q) > r2 \Rightarrow Pr(h(p)=h(q)) \leq p2 +\]` +This LSH family is called `(r1, r2, p1, p2)`-sensitive. + +In this section, we call a pair of input features a false positive if the two features are hashed into the same hash bucket but they are far away in distance, and we define false negative as the pair of features when their distance are close but they are not in the same hash bucket. + +## Bucketed Random Projection for Euclidean Distance + +[Bucketed Random Projection](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions) is the LSH family in `spark.ml` for Euclidean distance. The Euclidean distance is defined as follows: +`\[ +d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_i (x_i - y_i)^2} +\]` +Its LSH family projects features onto a random unit vector and divide the projected results to hash buckets: +`\[ +h(\mathbf{x}) = \lfloor \frac{\mathbf{x} \cdot \mathbf{v}}{r} \rfloor +\]` +where `v` is a normalized random unit vector and `r` is user-defined bucket length. The bucket length can be used to control the average size of hash buckets. A larger bucket length means higher probability for features to be in the same bucket. + +Bucketed Random Projection accepts arbitrary vectors as input features, and supports both sparse and dense vectors. + + + + +Refer to the [RandomProjection Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RandomProjection) --- End diff -- This Scaladoc link should be for `BucketedRandomProjection` now --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16096: [SPARK-18617][BACKPORT] Follow up PR to Close "kryo auto...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16096 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69455/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16096: [SPARK-18617][BACKPORT] Follow up PR to Close "kryo auto...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16096 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16096: [SPARK-18617][BACKPORT] Follow up PR to Close "kryo auto...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16096 **[Test build #69455 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69455/consoleFull)** for PR 16096 at commit [`76e0143`](https://github.com/apache/spark/commit/76e01432398295f8c48606dbba4847eede3815a2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16095: [SPARK-18666][Web UI] Remove the codes checking deprecat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16095 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16095: [SPARK-18666][Web UI] Remove the codes checking deprecat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16095 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69452/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16095: [SPARK-18666][Web UI] Remove the codes checking deprecat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16095 **[Test build #69452 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69452/consoleFull)** for PR 16095 at commit [`1bf8528`](https://github.com/apache/spark/commit/1bf8528f248aa9ed4908ebdd922d247cf610e88e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16097: [SPARK-18665] set job to "ERROR" when job is canceled
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16097 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16097: [SPARK-18665] set job to "ERROR" when job is canceled
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16097 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69462/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16097: [SPARK-18665] set job to "ERROR" when job is canceled
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16097 **[Test build #69462 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69462/consoleFull)** for PR 16097 at commit [`8b9322d`](https://github.com/apache/spark/commit/8b9322d1f8421a1868d8d39472d1b6f3681b4de3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16080: [SPARK-18647][SQL] do not put provider in table properti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16080 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69459/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16080: [SPARK-18647][SQL] do not put provider in table properti...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16080 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16097: [SPARK-18665] set job to "ERROR" when job is canc...
Github user cenyuhai closed the pull request at: https://github.com/apache/spark/pull/16097 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16080: [SPARK-18647][SQL] do not put provider in table properti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16080 **[Test build #69459 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69459/consoleFull)** for PR 16080 at commit [`198d273`](https://github.com/apache/spark/commit/198d2734936fdadb39bdc77dc223b4ee41c660ba). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16037 **[Test build #69463 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69463/consoleFull)** for PR 16037 at commit [`d7ebc7d`](https://github.com/apache/spark/commit/d7ebc7df63b89d7ba7cd7e3a688089749c393082). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16037 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69463/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16037 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16097: [SPARK-18665] set job to "ERROR" when job is canceled
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16097 **[Test build #69462 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69462/consoleFull)** for PR 16097 at commit [`8b9322d`](https://github.com/apache/spark/commit/8b9322d1f8421a1868d8d39472d1b6f3681b4de3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16037 **[Test build #69463 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69463/consoleFull)** for PR 16037 at commit [`d7ebc7d`](https://github.com/apache/spark/commit/d7ebc7df63b89d7ba7cd7e3a688089749c393082). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16097: [SPARK-18665] set job to "ERROR" when job is canc...
GitHub user cenyuhai opened a pull request: https://github.com/apache/spark/pull/16097 [SPARK-18665] set job to "ERROR" when job is canceled ## What changes were proposed in this pull request? set job to "ERROR" when job is canceled You can merge this pull request into a Git repository by running: $ git pull https://github.com/cenyuhai/spark SPARK-18665 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16097.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16097 commit 869eaaf23f79eefbc6a8ff7a7b9efbc4a9f8c6b7 Author: å²çæµ· <261810...@qq.com> Date: 2016-08-21T03:55:04Z Merge pull request #8 from apache/master merge latest code to my fork commit b6b0d0a41c1aa59bc97a0aa438619d903b78b108 Author: å²çæµ· <261810...@qq.com> Date: 2016-09-06T03:03:08Z Merge pull request #9 from apache/master Merge latest code to my fork commit abd7924eab25b6dfdfd78c23a78dadcb3b9fbe1e Author: å²çæµ· <261810...@qq.com> Date: 2016-09-08T17:10:12Z Merge pull request #10 from apache/master Merge latest code to my fork commit 4b460e218244cdb0884e73c5fca29cc43b516972 Author: cenyuhaiDate: 2016-09-15T09:25:24Z Merge remote-tracking branch 'remotes/apache/master' commit 22cb0a6f6f60ffae4a449727959cdd2940699f8e Author: å²çæµ· <261810...@qq.com> Date: 2016-12-01T06:09:42Z Merge pull request #12 from apache/master Merge latest code to my branch commit 8b9322d1f8421a1868d8d39472d1b6f3681b4de3 Author: cenyuhai <261810...@qq.com> Date: 2016-12-01T06:36:26Z set statement state to error after user canceled job --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16037#discussion_r90391752 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala --- @@ -241,16 +239,25 @@ object LBFGS extends Logging { val bcW = data.context.broadcast(w) val localGradient = gradient - val (gradientSum, lossSum) = data.treeAggregate((Vectors.zeros(n), 0.0))( - seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, features)) => -val l = localGradient.compute( - features, label, bcW.value, grad) -(grad, loss + l) - }, - combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) => + // Given (current accumulated gradient, current loss) and (label, features) + // tuples, updates the current gradient and current loss + val seqOp = (c: (Vector, Double), v: (Double, Vector)) => +(c, v) match { + case ((grad, loss), (label, features)) => +val denseGrad = grad.toDense +val l = localGradient.compute(features, label, bcW.value, denseGrad) +(Vectors.dense(denseGrad.values), loss + l) --- End diff -- I can compile with my example code... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge vector...
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/16037 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #9276: [SPARK-9858][SPARK-9859][SPARK-9861][SQL] Add an Exchange...
Github user dreamworks007 commented on the issue: https://github.com/apache/spark/pull/9276 @yhuai , could you please let us know is there any known issues / limitation with this feature ? Has this feature been tested under some large jobs ? We are also considering automatical determining shuffle partitions, and happened to see this PR, and therefore interested in exploring this feature a little bit to see if we could productionize it for all jobs (by default). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15915: [SPARK-18485][CORE] Underlying integer overflow when cre...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15915 **[Test build #69461 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69461/consoleFull)** for PR 15915 at commit [`7c3e2c7`](https://github.com/apache/spark/commit/7c3e2c7cb2b45c0481f5272d8ee1f32249095dec). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15915: [SPARK-18485][CORE] Underlying integer overflow when cre...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15915 **[Test build #69460 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69460/consoleFull)** for PR 15915 at commit [`fffd5f5`](https://github.com/apache/spark/commit/fffd5f5150ae1108bfe790112ce1a42d030d5576). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16096: [SPARK-18617][BACKPORT] backport to branch-2.0
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16096 Can you use a meaningful title & description instead of saying backport? This is important because it becomes part of the commit history. You can put the backport message in the body. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16037: [SPARK-18471][MLLIB] In LBFGS, avoid sending huge...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/16037#discussion_r90388974 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala --- @@ -241,16 +239,25 @@ object LBFGS extends Logging { val bcW = data.context.broadcast(w) val localGradient = gradient - val (gradientSum, lossSum) = data.treeAggregate((Vectors.zeros(n), 0.0))( - seqOp = (c, v) => (c, v) match { case ((grad, loss), (label, features)) => -val l = localGradient.compute( - features, label, bcW.value, grad) -(grad, loss + l) - }, - combOp = (c1, c2) => (c1, c2) match { case ((grad1, loss1), (grad2, loss2)) => + // Given (current accumulated gradient, current loss) and (label, features) + // tuples, updates the current gradient and current loss + val seqOp = (c: (Vector, Double), v: (Double, Vector)) => +(c, v) match { + case ((grad, loss), (label, features)) => +val denseGrad = grad.toDense +val l = localGradient.compute(features, label, bcW.value, denseGrad) +(Vectors.dense(denseGrad.values), loss + l) --- End diff -- Is this really necessary? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13706: [SPARK-15988] [SQL] Implement DDL commands: Create/Drop ...
Github user lshmouse commented on the issue: https://github.com/apache/spark/pull/13706 @lianhuiwang I think the problem is that no need to check if macroFunction is resolved. Data type may be cast dynamically according the sql data type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16076: [SPARK-18324][ML][DOC] Update ML programming and migrati...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/16076 @jkbradley I addressed other comments except for [SPARK-18291](https://issues.apache.org/jira/browse/SPARK-18291), since I think it's SparkR related issue and should be listed at [SparkR session of migration guide](http://spark.apache.org/docs/latest/sparkr.html#migration-guide). I'm preparing a PR to update SparkR migration guide. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16076: [SPARK-18324][ML][DOC] Update ML programming and ...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/16076#discussion_r90386957 --- Diff: docs/ml-guide.md --- @@ -60,152 +60,37 @@ MLlib is under active development. The APIs marked `Experimental`/`DeveloperApi` may change in future releases, and the migration guide below will explain all changes between releases. -## From 1.6 to 2.0 +## From 2.0 to 2.1 ### Breaking changes -There were several breaking changes in Spark 2.0, which are outlined below. - -**Linear algebra classes for DataFrame-based APIs** - -Spark's linear algebra dependencies were moved to a new project, `mllib-local` -(see [SPARK-13944](https://issues.apache.org/jira/browse/SPARK-13944)). -As part of this change, the linear algebra classes were copied to a new package, `spark.ml.linalg`. -The DataFrame-based APIs in `spark.ml` now depend on the `spark.ml.linalg` classes, -leading to a few breaking changes, predominantly in various model classes -(see [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) for a full list). - -**Note:** the RDD-based APIs in `spark.mllib` continue to depend on the previous package `spark.mllib.linalg`. - -_Converting vectors and matrices_ - -While most pipeline components support backward compatibility for loading, -some existing `DataFrames` and pipelines in Spark versions prior to 2.0, that contain vector or matrix -columns, may need to be migrated to the new `spark.ml` vector and matrix types. -Utilities for converting `DataFrame` columns from `spark.mllib.linalg` to `spark.ml.linalg` types -(and vice versa) can be found in `spark.mllib.util.MLUtils`. - -There are also utility methods available for converting single instances of -vectors and matrices. Use the `asML` method on a `mllib.linalg.Vector` / `mllib.linalg.Matrix` -for converting to `ml.linalg` types, and -`mllib.linalg.Vectors.fromML` / `mllib.linalg.Matrices.fromML` -for converting to `mllib.linalg` types. - - - - -{% highlight scala %} -import org.apache.spark.mllib.util.MLUtils - -// convert DataFrame columns -val convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF) -val convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF) -// convert a single vector or matrix -val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML -val mlMat: org.apache.spark.ml.linalg.Matrix = mllibMat.asML -{% endhighlight %} - -Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for further detail. - - - - -{% highlight java %} -import org.apache.spark.mllib.util.MLUtils; -import org.apache.spark.sql.Dataset; - -// convert DataFrame columns -Dataset convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF); -Dataset convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF); -// convert a single vector or matrix -org.apache.spark.ml.linalg.Vector mlVec = mllibVec.asML(); -org.apache.spark.ml.linalg.Matrix mlMat = mllibMat.asML(); -{% endhighlight %} - -Refer to the [`MLUtils` Java docs](api/java/org/apache/spark/mllib/util/MLUtils.html) for further detail. - - - - -{% highlight python %} -from pyspark.mllib.util import MLUtils - -# convert DataFrame columns -convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF) -convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF) -# convert a single vector or matrix -mlVec = mllibVec.asML() -mlMat = mllibMat.asML() -{% endhighlight %} - -Refer to the [`MLUtils` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils) for further detail. - - - **Deprecated methods removed** -Several deprecated methods were removed in the `spark.mllib` and `spark.ml` packages: - -* `setScoreCol` in `ml.evaluation.BinaryClassificationEvaluator` -* `weights` in `LinearRegression` and `LogisticRegression` in `spark.ml` -* `setMaxNumIterations` in `mllib.optimization.LBFGS` (marked as `DeveloperApi`) -* `treeReduce` and `treeAggregate` in `mllib.rdd.RDDFunctions` (these functions are available on `RDD`s directly, and were marked as `DeveloperApi`) -* `defaultStategy` in `mllib.tree.configuration.Strategy` -* `build` in `mllib.tree.Node` -* libsvm loaders for multiclass and load/save labeledData methods in `mllib.util.MLUtils` - -A full list of breaking changes can be found at [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810). +* `setLabelCol` in `feature.ChiSqSelectorModel` +* `numTrees` in `classification.RandomForestClassificationModel` (This now refers to the Param called `numTrees`) +* `numTrees` in
[GitHub] spark issue #16080: [SPARK-18647][SQL] do not put provider in table properti...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16080 **[Test build #69459 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69459/consoleFull)** for PR 16080 at commit [`198d273`](https://github.com/apache/spark/commit/198d2734936fdadb39bdc77dc223b4ee41c660ba). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16069: [WIP][SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven p...
Github user weiqingy commented on the issue: https://github.com/apache/spark/pull/16069 I will update the PR to fix the deprecation warnings in project/SparkBuild.scala. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16076: [SPARK-18324][ML][DOC] Update ML programming and migrati...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16076 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69456/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16093: [SPARK-18663][SQL] Simplify CountMinSketch aggregate imp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16093 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69451/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16076: [SPARK-18324][ML][DOC] Update ML programming and migrati...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16076 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16076: [SPARK-18324][ML][DOC] Update ML programming and migrati...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16076 **[Test build #69456 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69456/consoleFull)** for PR 16076 at commit [`e784901`](https://github.com/apache/spark/commit/e784901f1e2b1cd10c15537efa28077f8e67a768). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16093: [SPARK-18663][SQL] Simplify CountMinSketch aggregate imp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16093 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16093: [SPARK-18663][SQL] Simplify CountMinSketch aggregate imp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16093 **[Test build #69451 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69451/consoleFull)** for PR 16093 at commit [`9f07563`](https://github.com/apache/spark/commit/9f075633896677b2dbcbd37c93a60544e39a0fea). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16084: [SPARK-18654][SQL] Remove unreachable patterns in makeRo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16084 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69450/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16084: [SPARK-18654][SQL] Remove unreachable patterns in makeRo...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16084 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16084: [SPARK-18654][SQL] Remove unreachable patterns in makeRo...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16084 **[Test build #69450 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69450/consoleFull)** for PR 16084 at commit [`3535acf`](https://github.com/apache/spark/commit/3535acf4f84a9057f4bbb88a81e4fff5f5167c0d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16083: [SPARK-18097][SQL] Add exception catch to handle ...
Github user thomastechs commented on a diff in the pull request: https://github.com/apache/spark/pull/16083#discussion_r90386133 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala --- @@ -189,14 +189,18 @@ case class DropTableCommand( if (!catalog.isTemporaryTable(tableName) && catalog.tableExists(tableName)) { // If the command DROP VIEW is to drop a table or DROP TABLE is to drop a view // issue an exception. - catalog.getTableMetadata(tableName).tableType match { -case CatalogTableType.VIEW if !isView => - throw new AnalysisException( -"Cannot drop a view with DROP TABLE. Please use DROP VIEW instead") -case o if o != CatalogTableType.VIEW && isView => - throw new AnalysisException( -s"Cannot drop a table with DROP VIEW. Please use DROP TABLE instead") -case _ => + try { +catalog.getTableMetadata(tableName).tableType match { + case CatalogTableType.VIEW if !isView => +throw new AnalysisException( + "Cannot drop a view with DROP TABLE. Please use DROP VIEW instead") + case o if o != CatalogTableType.VIEW && isView => +throw new AnalysisException( + s"Cannot drop a table with DROP VIEW. Please use DROP TABLE instead") + case _ => +} + } catch { + case e: QueryExecutionException => log.warn(e.toString, e) --- End diff -- @gatorsmile and @Davies ; In that case, any suggestions to mock the metadata corrupt scenario? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16000: [SPARK-18537][Web UI]Add a REST api to spark streaming
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16000 **[Test build #69458 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69458/consoleFull)** for PR 16000 at commit [`651dc67`](https://github.com/apache/spark/commit/651dc679b865603be677ca9d30b975ce5c3c5df0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16062: [SPARK-18629][SQL] Fix numPartition of JDBCSuite Testcas...
Github user weiqingy commented on the issue: https://github.com/apache/spark/pull/16062 @srowen Thanks for the review. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16089: [SPARK-18658][SQL] Write text records directly to a File...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16089 **[Test build #69457 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69457/consoleFull)** for PR 16089 at commit [`56667bd`](https://github.com/apache/spark/commit/56667bd86c1dbb52fb47134042e5a529241a0637). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16069: [WIP][SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven p...
Github user weiqingy commented on the issue: https://github.com/apache/spark/pull/16069 Thanks, @dongjoon-hyun Yes, Good catch. I have updated the description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16089: [SPARK-18658][SQL] Write text records directly to...
Github user NathanHowell commented on a diff in the pull request: https://github.com/apache/spark/pull/16089#discussion_r90385252 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CodecStreams.scala --- @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources + +import java.io.{OutputStream, OutputStreamWriter} +import java.nio.charset.{Charset, StandardCharsets} + +import org.apache.hadoop.fs.Path +import org.apache.hadoop.io.compress._ +import org.apache.hadoop.mapreduce.JobContext +import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat +import org.apache.hadoop.util.ReflectionUtils + +private[spark] object CodecStreams { --- End diff -- Looks that way, I've removed it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16076: [SPARK-18324][ML][DOC] Update ML programming and migrati...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16076 **[Test build #69456 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69456/consoleFull)** for PR 16076 at commit [`e784901`](https://github.com/apache/spark/commit/e784901f1e2b1cd10c15537efa28077f8e67a768). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16089: [SPARK-18658][SQL] Write text records directly to a File...
Github user NathanHowell commented on the issue: https://github.com/apache/spark/pull/16089 Doh, forgot to run the Hive tests. Should be fixed now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16069: [WIP][SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven p...
Github user weiqingy commented on the issue: https://github.com/apache/spark/pull/16069 @srowen Thanks for the information. For sbt update, I think, currently, the only file needed to change is `project/build.properties`. The [Jenkins build console output](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69365/consoleFull) shows â`sbt 0.13.13`â has been downloaded and used in that build. Correct me if I miss anything. ``` Attempting to fetch sbt Launching sbt from build/sbt-launch-0.13.13.jar Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0 Getting org.scala-sbt sbt 0.13.13 ... ``` I have updated zinc and maven plugins. `mvn versions:display-plugin-updates` shows the following plugin updates are available: ``` [INFO] maven-assembly-plugin .. 2.6 -> 3.0.0 [INFO] maven-compiler-plugin 3.5.1 -> 3.6.0 [INFO] maven-jar-plugin ... 2.6 -> 3.0.2 [INFO] maven-javadoc-plugin ... 2.10.3 -> 2.10.4 [INFO] maven-source-plugin 2.4 -> 3.0.1 [INFO] org.codehaus.mojo:build-helper-maven-plugin 1.10 -> 1.12 [INFO] org.codehaus.mojo:exec-maven-plugin .. 1.4.0 -> 1.5.0 ``` Also, in `Building Spark Project External Flume Sink 2.1.0-SNAPSHOT`, ``` The following plugin updates are available: [INFO] org.apache.avro:avro-maven-plugin 1.7.7 -> 1.8.1 ``` ``` org.apache.avro avro-maven-plugin ${avro.version} ${project.basedir}/target/scala-${scala.binary.version}/src_managed/main/compiled_avro generate-sources idl-protocol ``` If we want to update `avro-maven-plugin` from 1.7.7 to 1.8.1, we need to change the value of `1.7.7 `in `spark-parent_2.11` pom file. That will affect some decencies, e.g. ``` org.apache.avro avro ${avro.version} ${hadoop.deps.scope} ``` ``` org.apache.avro avro-ipc tests ${avro.version} test ``` .. So avro-maven-plugin was not updated in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16096: [SPARK-18617][BACKPORT] backport to branch-2.0
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16096 **[Test build #69455 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69455/consoleFull)** for PR 16096 at commit [`76e0143`](https://github.com/apache/spark/commit/76e01432398295f8c48606dbba4847eede3815a2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16069: [WIP][SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven p...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16069 nit. In PR description, 0.3.9 instead of 0.13.9 ? > zinc: 0.13.9 -> 0.3.11, --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16086: [SPARK-18653][SQL] Fix incorrect space padding for unico...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/16086 @gatorsmile would it be possible to review this? You would be familiar with Kanji? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16096: [SPARK-18617][BACKPORT] backport to branch-2.0
GitHub user uncleGen opened a pull request: https://github.com/apache/spark/pull/16096 [SPARK-18617][BACKPORT] backport to branch-2.0 ## What changes were proposed in this pull request? backport #16052 to branch-2.0 with incremental update in #16091 ## How was this patch tested? new unit test cc @zsxwing @rxin You can merge this pull request into a Git repository by running: $ git pull https://github.com/uncleGen/spark branch-2.0-SPARK-18617 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16096.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16096 commit 76e01432398295f8c48606dbba4847eede3815a2 Author: uncleGenDate: 2016-12-01T05:07:59Z backport to branch-2.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16056: [SPARK-18623][SQL] Add `returnNullable` to `StaticInvoke...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16056 **[Test build #69454 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69454/consoleFull)** for PR 16056 at commit [`f7b1aa0`](https://github.com/apache/spark/commit/f7b1aa05a64bc8efc43f6e932c1fcb06f18866f7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16069: [WIP][SPARK-18638][BUILD] Upgrade sbt, Zinc, and Maven p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16069 **[Test build #69453 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69453/consoleFull)** for PR 16069 at commit [`62f0ddb`](https://github.com/apache/spark/commit/62f0ddb44da7711b1066923419762da8b3628780). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15918: [SPARK-18122][SQL][WIP]Fallback to Kryo for unsupported ...
Github user koertkuipers commented on the issue: https://github.com/apache/spark/pull/15918 if we do a flag i would also prefer it if the current implicits are more narrow if the flag is not set, if possible. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16086: [SPARK-18653][SQL] Fix incorrect space padding for unico...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16086 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69446/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16086: [SPARK-18653][SQL] Fix incorrect space padding for unico...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16086 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16086: [SPARK-18653][SQL] Fix incorrect space padding for unico...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16086 **[Test build #69446 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69446/consoleFull)** for PR 16086 at commit [`350e1ae`](https://github.com/apache/spark/commit/350e1ae6058014febf3b793f64fd7912c5cc814c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16095: [SPARK-18666][Web UI] Remove the codes checking deprecat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16095 **[Test build #69452 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69452/consoleFull)** for PR 16095 at commit [`1bf8528`](https://github.com/apache/spark/commit/1bf8528f248aa9ed4908ebdd922d247cf610e88e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16021: [SPARK-18593][SQL] JDBCRDD returns incorrect results for...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16021 Hi, @rxin . All 6 commits are not cherry-pickable. Especially, the last 2 commits has inevitable conflicts due to some wide commits (about `import order` and `syntax` cleanups.) But, I can find the following clean cherry-pick sequences for the first 4 commits. ``` $ git cherry-pick -x 28112657ea5919451291c21b4b8e1eb3db0ec8d4 $ git cherry-pick -x 0f6936b5f1c9b0be1c33b98ffb62a72ae0c3e2a8 $ git cherry-pick -x 7f443a6879fa33ca8adb682bd85df2d56fb5fcda $ git cherry-pick -x 2aad2d372469aaf2773876cae98ef002fef03aa3 $ git cherry-pick -x 554d840a9ade79722c96972257435a05e2aa9d88 $ git cherry-pick -x 8c1b867cee816d0943184c7b485cd11e255d8130 $ git cherry-pick -x 5c2682b0c8fd2aeae2af1adb716ee0d5f8b85135 $ git cherry-pick -x ad5b7cfcca7a5feb83b9ed94b6e725c6d789579b $ git cherry-pick -x 94f7a12b3c8e4a6ecd969893e562feb7ffba4c24 $ git diff HEAD~9 --stat sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala | 102 +++- sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala | 68 - sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala | 15 + 3 files changed, 139 insertions(+), 46 deletions(-) ``` All are related commits in the above three files. If you don't mind, could you cherry-pick the above. The followings are the titles of them. 1. [SPARK-12236][SQL] JDBC filter tests all pass if filters are not really pushed down. 2. [SPARK-12249][SQL] JDBC non-equality comparison operator not pushed down. 3. [SPARK-12314][SQL] isnull operator not pushed down for JDBC datasource. 4. [SPARK-12315][SQL] isnotnull operator not pushed down for JDBC datasource. 5. Style fix for the previous 3 JDBC filter push down commits. (@rxin) 6. [SPARK-12446][SQL] Add unit tests for JDBCRDD internal functions 7. [SPARK-12409][SPARK-12387][SPARK-12391][SQL] Support AND/OR/IN/LIKE push-down filters for JDBC 8. [SPARK-12409][SPARK-12387][SPARK-12391][SQL] Refactor filter pushdown for JDBCRDD and add few filters (Liang-Chi Hsieh) 9. [SPARK-10180][SQL] JDBC datasource are not processing EqualNullSafe filter After your cherry-picking, I will create two PRs for the remaining inevitable-conflict commits. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16086: [SPARK-18653][SQL] Fix incorrect space padding for unico...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16086 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69443/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16095: [UI] Remove the codes checking deprecated config ...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/16095 [UI] Remove the codes checking deprecated config spark.sql.unsafe.enabled ## What changes were proposed in this pull request? `spark.sql.unsafe.enabled` is deprecated since 2.0. There still are codes in UI to check it. We should remove it and clean the codes. ## How was this patch tested? Changes to related existing unit test. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 remove-deprecated-config-code Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16095.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16095 commit 1bf8528f248aa9ed4908ebdd922d247cf610e88e Author: Liang-Chi HsiehDate: 2016-12-01T04:50:32Z Remove the codes checking deprecated config spark.sql.unsafe.enabled. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15975 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15975 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69444/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15975: [SPARK-18538] [SQL] Fix Concurrent Table Fetching Using ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15975 **[Test build #69444 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69444/consoleFull)** for PR 15975 at commit [`728c103`](https://github.com/apache/spark/commit/728c103fc10d5118eff4ff5bf9372da8557ecf60). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16077: [SPARK-18643][SPARKR] SparkR hangs at session start when...
Github user yanboliang commented on the issue: https://github.com/apache/spark/pull/16077 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15994: [SPARK-18555][SQL]DataFrameNaFunctions.fill miss up orig...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15994 Hi @windpiger it seems something gone wrong. Would you try to rebase this please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16068: [SPARK-18637][SQL]Stateful UDF should be considered as n...
Github user zhzhan commented on the issue: https://github.com/apache/spark/pull/16068 @hvanhovell Thanks for looking at this. We have a big number of UDFs that have this issue. For example, the UDF gives different result with different partition/sort, but the UDF is pushdown before the partition/sort, resulting in unexpected behavior. I will working on finding some test cases for it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15910: [SPARK-18476][SPARKR][ML]:SparkR Logistic Regress...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15910 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15994: [SPARK-18555][SQL]DataFrameNaFunctions.fill miss up orig...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15994 Build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15994: [SPARK-18555][SQL]DataFrameNaFunctions.fill miss up orig...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15994 **[Test build #69442 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69442/consoleFull)** for PR 15994 at commit [`4c9f3a0`](https://github.com/apache/spark/commit/4c9f3a0aa96adcf64ecd5350f291f5a212128e30). * This patch passes all tests. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16089: [SPARK-18658][SQL] Write text records directly to a File...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16089 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16089: [SPARK-18658][SQL] Write text records directly to a File...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16089 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69449/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15994: [SPARK-18555][SQL]DataFrameNaFunctions.fill miss up orig...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15994 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/69442/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16089: [SPARK-18658][SQL] Write text records directly to a File...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16089 **[Test build #69449 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/69449/consoleFull)** for PR 16089 at commit [`298e507`](https://github.com/apache/spark/commit/298e507d5c42328de610d6109afb11076aadfb96). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org