[GitHub] spark pull request #23234: [SPARK-26233][SQL][BACKPORT-2.2] CheckOverflow wh...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/23234 [SPARK-26233][SQL][BACKPORT-2.2] CheckOverflow when encoding a decimal value ## What changes were proposed in this pull request? When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations. ## How was this patch tested? added UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-26233_2.2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23234.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23234 commit 930c51029b845c74357305e7ec30a4f2e6ea748a Author: Marco Gaido Date: 2018-12-04T18:33:27Z [SPARK-26233][SQL] CheckOverflow when encoding a decimal value When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations. added UT Closes #23210 from mgaido91/SPARK-26233. Authored-by: Marco Gaido Signed-off-by: Dongjoon Hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23234: [SPARK-26233][SQL][BACKPORT-2.2] CheckOverflow when enco...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/23234 cc @cloud-fan @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23232: [SPARK-26233][SQL][BACKPORT-2.4] CheckOverflow when enco...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23232 **[Test build #99716 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99716/testReport)** for PR 23232 at commit [`821db48`](https://github.com/apache/spark/commit/821db4854c0e685aac3168da75a1c839681dbfc4). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23233: [SPARK-26233][SQL][BACKPORT-2.3] CheckOverflow when enco...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/23233 cc @cloud-fan @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23233: [SPARK-26233][SQL][BACKPORT-2.3] CheckOverflow wh...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/23233 [SPARK-26233][SQL][BACKPORT-2.3] CheckOverflow when encoding a decimal value ## What changes were proposed in this pull request? When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations. ## How was this patch tested? added UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-26233_2.3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23233.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23233 commit a1e77445c2675137fbcddf73181c47469f159dbf Author: Marco Gaido Date: 2018-12-04T18:33:27Z [SPARK-26233][SQL] CheckOverflow when encoding a decimal value When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations. added UT Closes #23210 from mgaido91/SPARK-26233. Authored-by: Marco Gaido Signed-off-by: Dongjoon Hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23232: [SPARK-26233][SQL][BACKPORT-2.4] CheckOverflow when enco...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/23232 cc @cloud-fan @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23232: [SPARK-26233][SQL][BACKPORT-2.4] CheckOverflow wh...
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/23232 [SPARK-26233][SQL][BACKPORT-2.4] CheckOverflow when encoding a decimal value When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations. added UT Closes #23210 from mgaido91/SPARK-26233. Authored-by: Marco Gaido Signed-off-by: Dongjoon Hyun ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-26233_2.4 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23232.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23232 commit 821db4854c0e685aac3168da75a1c839681dbfc4 Author: Marco Gaido Date: 2018-12-04T18:33:27Z [SPARK-26233][SQL] CheckOverflow when encoding a decimal value When we encode a Decimal from external source we don't check for overflow. That method is useful not only in order to enforce that we can represent the correct value in the specified range, but it also changes the underlying data to the right precision/scale. Since in our code generation we assume that a decimal has exactly the same precision and scale of its data type, missing to enforce it can lead to corrupted output/results when there are subsequent transformations. added UT Closes #23210 from mgaido91/SPARK-26233. Authored-by: Marco Gaido Signed-off-by: Dongjoon Hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23229: [MINOR][CORE] Modify some field name because it may be c...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23229 I don't think it's worth to change naming the variable in a single PR. Let's do that when we fix some codes around here, or let other people try to fix later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23224: [MINOR][SQL][TEST] WholeStageCodegen metrics should be t...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23224 Can we file a JIRA? I think it's not minor. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23230 Oh, the original one was 3.0. Although this doc change can go to branch-2.4 alone as well, let me revert it in branch-2.4 for management simplicity. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEnc...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23230 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23159 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23159 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5760/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23230 Merged to master and branch-2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23196 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23196 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5758/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23159: [SPARK-26191][SQL] Control truncation of Spark plans via...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23159 **[Test build #99715 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99715/testReport)** for PR 23159 at commit [`e0aa626`](https://github.com/apache/spark/commit/e0aa626c886976489348a6c0179d160bbe3252da). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23196: [SPARK-26243][SQL] Use java.time API for parsing timesta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23196 **[Test build #99714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99714/testReport)** for PR 23196 at commit [`07fcf46`](https://github.com/apache/spark/commit/07fcf4666a96928c8096db7a131e6514013679f0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22957 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5759/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22957 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22957 **[Test build #99713 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99713/testReport)** for PR 22957 at commit [`e4f617f`](https://github.com/apache/spark/commit/e4f617fc7e47d7c49f3d773ac2d91c5508c0a239). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23229: [MINOR][CORE] Modify some field name because it may be c...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/23229 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as alias to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23231 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5757/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as alias to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23231 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed confi...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23213 how about `wholeStage=false, factoryMode=CODE_ONLY`? I think it's different from `wholeStage=false, factoryMode=NO_CODEGEN`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as alias to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23231 **[Test build #99712 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99712/testReport)** for PR 23231 at commit [`453d60f`](https://github.com/apache/spark/commit/453d60f42b99de621a7ee3fab6bc6138fc20ed05). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23230 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23230 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99710/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23230 **[Test build #99710 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99710/testReport)** for PR 23230 at commit [`5c7f6be`](https://github.com/apache/spark/commit/5c7f6be3c52e39924953f613d13225e32e8a63f9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23224: [MINOR][SQL][TEST] WholeStageCodegen metrics should be t...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23224 **[Test build #99711 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99711/testReport)** for PR 23224 at commit [`021728c`](https://github.com/apache/spark/commit/021728ccc70cf971592c560cfc5492dedbdc362a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23224: [MINOR][SQL][TEST] WholeStageCodegen metrics should be t...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/23224 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23230 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99706/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23230 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23230 **[Test build #99706 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99706/testReport)** for PR 23230 at commit [`63b7183`](https://github.com/apache/spark/commit/63b71834b101c800973b73490640a44e507306d1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23230 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23230 **[Test build #99710 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99710/testReport)** for PR 23230 at commit [`5c7f6be`](https://github.com/apache/spark/commit/5c7f6be3c52e39924953f613d13225e32e8a63f9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as alias to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23231 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23230 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5756/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as alias to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23231 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5755/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as alias to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23231 **[Test build #99709 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99709/testReport)** for PR 23231 at commit [`f5ed812`](https://github.com/apache/spark/commit/f5ed81279d95b765ccf11752e09e3e66230b047a). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class OneHotEncoderEstimator @Since(\"2.3.0\") (@Since(\"2.3.0\") override val uid: String)` * `class OneHotEncoderEstimator(JavaEstimator, HasInputCols, HasOutputCols, HasHandleInvalid,` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as alias to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23231 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99709/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as alias to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23231 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23229: [MINOR][CORE] Modify some field name because it may be c...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/23229 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23230 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99705/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23230 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as a...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23231#discussion_r239011539 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.annotation.Since +import org.apache.spark.ml.Estimator +import org.apache.spark.ml.param._ +import org.apache.spark.ml.util._ +import org.apache.spark.sql.Dataset +import org.apache.spark.sql.types.StructType + +/** + * A one-hot encoder that maps a column of category indices to a column of binary vectors, with + * at most a single one-value per row that indicates the input category index. + * For example with 5 categories, an input value of 2.0 would map to an output vector of + * `[0.0, 0.0, 1.0, 0.0]`. + * The last category is not included by default (configurable via `dropLast`), + * because it makes the vector entries sum up to one, and hence linearly dependent. + * So an input value of 4.0 maps to `[0.0, 0.0, 0.0, 0.0]`. + * + * @note This is different from scikit-learn's OneHotEncoder, which keeps all categories. + * The output vectors are sparse. + * + * When `handleInvalid` is configured to 'keep', an extra "category" indicating invalid values is + * added as last category. So when `dropLast` is true, invalid values are encoded as all-zeros + * vector. + * + * @note When encoding multi-column by using `inputCols` and `outputCols` params, input/output cols + * come in pairs, specified by the order in the arrays, and each pair is treated independently. + * + * @note `OneHotEncoderEstimator` is renamed to `OneHotEncoder` in 3.0.0. This + * `OneHotEncoderEstimator` is kept as an alias and will be removed in further version. + * + * @see `StringIndexer` for converting categorical values into category indices + */ +@Since("2.3.0") --- End diff -- These since tags are from original OneHotEncoderEstimator. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as alias to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23231 **[Test build #99709 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99709/testReport)** for PR 23231 at commit [`f5ed812`](https://github.com/apache/spark/commit/f5ed81279d95b765ccf11752e09e3e66230b047a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23230 **[Test build #99705 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99705/testReport)** for PR 23230 at commit [`c84886a`](https://github.com/apache/spark/commit/c84886aef9a53d0d58ca4f0f68ece57ee80f88c8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as alias to...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/23231 cc @srowen @dbtsai --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23196: [SPARK-26243][SQL] Use java.time API for parsing ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/23196#discussion_r239010321 --- Diff: sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala --- @@ -49,8 +49,8 @@ class HiveCompatibilitySuite extends HiveQueryFileTest with BeforeAndAfter { override def beforeAll() { super.beforeAll() TestHive.setCacheTables(true) -// Timezone is fixed to America/Los_Angeles for those timezone sensitive tests (timestamp_*) -TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) +// Timezone is fixed to GMT for those timezone sensitive tests (timestamp_*) --- End diff -- Our current approach for converting dates is inconsistent in a few places, for example: - `UTF8String` -> `num days` uses hardcoded `GMT` and ignores SQL config: https://github.com/apache/spark/blob/f982ca07e80074bdc1e3b742c5e21cf368e4ede2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L493 - `String` -> `java.util.Date` ignores Spark's time zone settings, and uses system time zone: https://github.com/apache/spark/blob/f982ca07e80074bdc1e3b742c5e21cf368e4ede2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L186 - In many places even a function accepts timeZone parameter, it is not passed (used default time zone - **not from config but from TimeZone.getDefault()**). For example: https://github.com/apache/spark/blob/36edbac1c8337a4719f90e4abd58d38738b2e1fb/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala#L187 . - Casting to the date type depends on type of argument, if it is `TimestampType`, expression-wise timezone is used, otherwise `GMT`: https://github.com/apache/spark/blob/d03e0af80d7659f12821cc2442efaeaee94d3985/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L403-L410 I do really think to disable new parser/formatter outside of CSV/JSON datasources because it is hard to guarantee consistent behavior in combination with other date/timestamp functions. @srowen @gatorsmile @HyukjinKwon WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as alias to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23231 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5753/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as alias to...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23231 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23163 Build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as alias to...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23231 **[Test build #99707 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99707/testReport)** for PR 23231 at commit [`1716071`](https://github.com/apache/spark/commit/17160710cadc49b54f4385ae3ca9ddb0eb4034b0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23163 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5754/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23163 **[Test build #99708 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99708/testReport)** for PR 23163 at commit [`6cb993b`](https://github.com/apache/spark/commit/6cb993b26e6b6867b3315228b55624b98acf1dcb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as a...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23231#discussion_r239008438 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderEstimatorSuite.scala --- @@ -0,0 +1,423 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.spark.ml.attribute.{AttributeGroup, BinaryAttribute, NominalAttribute} +import org.apache.spark.ml.linalg.{Vector, Vectors, VectorUDT} +import org.apache.spark.ml.param.ParamsSuite +import org.apache.spark.ml.util.{DefaultReadWriteTest, MLTest} +import org.apache.spark.sql.{Encoder, Row} +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.functions.col +import org.apache.spark.sql.types._ + +class OneHotEncoderEstimatorSuite extends MLTest with DefaultReadWriteTest { --- End diff -- The fitting of OneHotEncoderEstimator is actually done by OneHotEncoder. OneHotEncoderEstimator is just an alias. I'm not sure if we really need to add this test suite for it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/23163 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23231: [SPARK-26273][ML] Add OneHotEncoderEstimator as a...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/23231 [SPARK-26273][ML] Add OneHotEncoderEstimator as alias to OneHotEncoder ## What changes were proposed in this pull request? SPARK-26133 removed deprecated OneHotEncoder and renamed OneHotEncoderEstimator to OneHotEncoder. Based on ml migration doc, we need to keep OneHotEncoderEstimator as an alias to OneHotEncoder. This task is going to add it. ## How was this patch tested? Added tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 one-hot-encoder-estimator-alias Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23231.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23231 commit 17160710cadc49b54f4385ae3ca9ddb0eb4034b0 Author: Liang-Chi Hsieh Date: 2018-12-05T09:27:58Z Add OneHotEncoderEstimator as alias to OneHotEncoder. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23230 **[Test build #99706 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99706/testReport)** for PR 23230 at commit [`63b7183`](https://github.com/apache/spark/commit/63b71834b101c800973b73490640a44e507306d1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23230 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5752/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed confi...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/23213 Yes, I am wondering too: which is the difference between: `spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=NO_CODEGEN` and `spark.sql.codegen.wholeStage=true,spark.sql.codegen.factoryMode=NO_CODEGEN`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23230 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23230 **[Test build #99705 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99705/testReport)** for PR 23230 at commit [`c84886a`](https://github.com/apache/spark/commit/c84886aef9a53d0d58ca4f0f68ece57ee80f88c8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder
Github user viirya commented on the issue: https://github.com/apache/spark/pull/23230 cc @HyukjinKwon @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23230: [SPARK-26133][ML][Followup] Fix doc for OneHotEnc...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/23230 [SPARK-26133][ML][Followup] Fix doc for OneHotEncoder ## What changes were proposed in this pull request? This fixes doc of renamed OneHotEncoder in PySpark. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 remove_one_hot_encoder_followup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23230.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23230 commit c84886aef9a53d0d58ca4f0f68ece57ee80f88c8 Author: Liang-Chi Hsieh Date: 2018-12-05T10:08:01Z Fix doc for OneHotEncoder. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23213: [SPARK-26262][SQL] Runs SQLQueryTestSuite on mixed confi...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/23213 Sorry, my bad; it was longer than the current master by ~2 times. That's because the current master has already run two config set patterns (`wholeStage=true,factoryMode=CODEGEN_ONLY` and `wholeStage=true,factoryMode=NO_CODEGEN`) in `SQLQueryTestSuite`. The second test run (`wholeStage=true,factoryMode=NO_CODEGEN`) was introduced in my previous pr (#22512). IMHO two config set patterns below could cover most code paths in Spark? - wholeStage=true, factoryMode=CODEGEN_ONLY - wholeStage=false, factoryMode=NO_CODEGEN In this case, there is little change in the test time; ``` // the current master === Codegen/Interpreter Time Metrics === Total time: 358.584989321 seconds Configs Run Time spark.sql.codegen.wholeStage=true,spark.sql.codegen.factoryMode=NO_CODEGEN 165961038511 spark.sql.codegen.wholeStage=true,spark.sql.codegen.factoryMode=CODEGEN_ONLY 192623950810 // with this pr === Codegen/Interpreter Time Metrics === Total time: 345.468455247 seconds Configs Run Time spark.sql.codegen.wholeStage=true,spark.sql.codegen.factoryMode=CODEGEN_ONLY 196572976377 spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=NO_CODEGEN 148895478870 ``` WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23195: [SPARK-26236][SS] Add kafka delegation token supp...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/23195#discussion_r238995809 --- Diff: docs/structured-streaming-kafka-integration.md --- @@ -624,3 +624,57 @@ For experimenting on `spark-shell`, you can also use `--packages` to add `spark- See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies. + +## Security + +Kafka 0.9.0.0 introduced several features that increases security in a cluster. For detailed +description about these possibilities, see [Kafka security docs](http://kafka.apache.org/documentation.html#security). + +It's worth noting that security is optional and turned off by default. + +Spark supports the following ways to authenticate against Kafka cluster: +- **Delegation token (introduced in Kafka broker 1.1.0)**: This way the application can be configured + via Spark parameters and may not need JAAS login configuration (Spark can use Kafka's dynamic JAAS + configuration feature). For further information about delegation tokens, see + [Kafka delegation token docs](http://kafka.apache.org/documentation/#security_delegation_token). + + The process is initiated by Spark's Kafka delegation token provider. When `spark.kafka.bootstrap.servers` + set Spark looks for authentication information in the following order and choose the first available to log in: + - **JAAS login configuration** + - **Keytab file**, such as, + +./bin/spark-submit \ +--keytab \ +--principal \ +--conf spark.kafka.bootstrap.servers= \ +... + + - **Kerberos credential cache**, such as, + +./bin/spark-submit \ +--conf spark.kafka.bootstrap.servers= \ +... + + Kafka delegation token provider can be turned off by setting `spark.security.credentials.kafka.enabled` to `false` (default: `true`). + + Spark can be configured to use the following authentication protocols to obtain token: + - **SASL SSL (default)**: With `GSSAPI` mechanism Kerberos used for authentication and SSL for encryption. + - **SSL**: It's leveraging a capability from SSL called 2-way authentication. The server authenticates +clients through certificates. Please note 2-way authentication must be enabled on Kafka brokers. + - **SASL PLAINTEXT (for testing)**: With `GSSAPI` mechanism Kerberos used for authentication but +because there is no encryption it's only for testing purposes. + + After obtaining delegation token successfully, Spark distributes it across nodes and renews it accordingly. + Delegation token uses `SCRAM` login module for authentication and because of that the appropriate + `sasl.mechanism` has to be configured on source/sink. --- End diff -- It means exactly that. This is missing, added and example. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23195: [SPARK-26236][SS] Add kafka delegation token supp...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/23195#discussion_r238995441 --- Diff: docs/structured-streaming-kafka-integration.md --- @@ -624,3 +624,57 @@ For experimenting on `spark-shell`, you can also use `--packages` to add `spark- See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies. + +## Security + +Kafka 0.9.0.0 introduced several features that increases security in a cluster. For detailed +description about these possibilities, see [Kafka security docs](http://kafka.apache.org/documentation.html#security). + +It's worth noting that security is optional and turned off by default. + +Spark supports the following ways to authenticate against Kafka cluster: +- **Delegation token (introduced in Kafka broker 1.1.0)**: This way the application can be configured + via Spark parameters and may not need JAAS login configuration (Spark can use Kafka's dynamic JAAS + configuration feature). For further information about delegation tokens, see + [Kafka delegation token docs](http://kafka.apache.org/documentation/#security_delegation_token). + + The process is initiated by Spark's Kafka delegation token provider. When `spark.kafka.bootstrap.servers` + set Spark looks for authentication information in the following order and choose the first available to log in: + - **JAAS login configuration** + - **Keytab file**, such as, + +./bin/spark-submit \ +--keytab \ +--principal \ +--conf spark.kafka.bootstrap.servers= \ +... + + - **Kerberos credential cache**, such as, + +./bin/spark-submit \ +--conf spark.kafka.bootstrap.servers= \ +... + + Kafka delegation token provider can be turned off by setting `spark.security.credentials.kafka.enabled` to `false` (default: `true`). --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23195: [SPARK-26236][SS] Add kafka delegation token supp...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/23195#discussion_r238995312 --- Diff: docs/structured-streaming-kafka-integration.md --- @@ -624,3 +624,57 @@ For experimenting on `spark-shell`, you can also use `--packages` to add `spark- See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies. + +## Security + +Kafka 0.9.0.0 introduced several features that increases security in a cluster. For detailed +description about these possibilities, see [Kafka security docs](http://kafka.apache.org/documentation.html#security). + +It's worth noting that security is optional and turned off by default. + +Spark supports the following ways to authenticate against Kafka cluster: +- **Delegation token (introduced in Kafka broker 1.1.0)**: This way the application can be configured + via Spark parameters and may not need JAAS login configuration (Spark can use Kafka's dynamic JAAS + configuration feature). For further information about delegation tokens, see + [Kafka delegation token docs](http://kafka.apache.org/documentation/#security_delegation_token). + + The process is initiated by Spark's Kafka delegation token provider. When `spark.kafka.bootstrap.servers` + set Spark looks for authentication information in the following order and choose the first available to log in: --- End diff -- Fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23195: [SPARK-26236][SS] Add kafka delegation token supp...
Github user gaborgsomogyi commented on a diff in the pull request: https://github.com/apache/spark/pull/23195#discussion_r238994314 --- Diff: docs/structured-streaming-kafka-integration.md --- @@ -624,3 +624,56 @@ For experimenting on `spark-shell`, you can also use `--packages` to add `spark- See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies. + +## Security + +Kafka 0.9.0.0 introduced several features that increases security in a cluster. For detailed +description about these possibilities, see [Kafka security docs](http://kafka.apache.org/documentation.html#security). + +It's worth noting that security is optional and turned off by default. + +Spark supports the following ways to authenticate against Kafka cluster: +- **Delegation token (introduced in Kafka broker 1.1.0)**: This way the application can be configured + via Spark parameters and may not need JAAS login configuration (Spark can use Kafka's dynamic JAAS + configuration feature). For further information about delegation tokens, see + [Kafka delegation token docs](http://kafka.apache.org/documentation/#security_delegation_token). + + The process is initiated by Spark's Kafka delegation token provider. This is enabled by default + but can be turned off with `spark.security.credentials.kafka.enabled`. When + `spark.kafka.bootstrap.servers` set Spark looks for authentication information in the following + order and choose the first available to log in: + - **JAAS login configuration** + - **Keytab file**, such as, + +./bin/spark-submit \ +--keytab \ +--principal \ +--conf spark.kafka.bootstrap.servers= \ +... + + - **Kerberos credential cache**, such as, + +./bin/spark-submit \ +--conf spark.kafka.bootstrap.servers= \ +... + + Spark supports the following authentication protocols to obtain token: --- End diff -- OK, fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22952: [SPARK-20568][SS] Provide option to clean up completed f...
Github user gaborgsomogyi commented on the issue: https://github.com/apache/spark/pull/22952 @HeartSaVioR @steveloughran As I see not only `*` and `?` missing but `[]` also. * Having glob parser in spark and supporting it I think it's too heavy and brittle. * Considering these I would solve it with warnings + caveat message in the doc (mentioning the slow globbing on object stores). As a separate offtopic just wondering how hadoop's globbing works if expander doesn't support all the glob elements. Maybe other operators (like `[]`) handled in different code part!? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23229: [MINOR][CORE] Modify some field name because it may be c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23229 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23229: [MINOR][CORE] Modify some field name because it may be c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23229 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23229: [MINOR][CORE] Modify some field name because it may be c...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23229 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23229: [MINOR][CORE] Modify some field name because it m...
GitHub user wangjiaochun opened a pull request: https://github.com/apache/spark/pull/23229 [MINOR][CORE] Modify some field name because it may be cause confusion ## What changes were proposed in this pull request? There is different field name style for tracking allocated data pages, such as class BytesToBytesMap use field name dataPages for allocated data pages, class UnsafeExternalSorter and ShuffleExternalSorter use field name allocatedPages for allocated data pages They are all belong to memory consumer, so I think it is best to use unified name; and class TaskMemoryManager filed name allocatedPages is modified to pagesBitSetï¼used to indicate the function of bitmap ; ## How was this patch tested? Existing tests (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangjiaochun/spark memory_consumer_name Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23229.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23229 commit 00fa455a6e145350a2bc5750df54cd0a9d1f0cdc Author: 10087686 Date: 2018-12-05T08:48:08Z modify field name in MemoryConsumer --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23227: [SPARK-26271][FOLLOW-UP][SQL] remove unuse object SparkP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23227 **[Test build #99704 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99704/testReport)** for PR 23227 at commit [`5cb416d`](https://github.com/apache/spark/commit/5cb416df5f03b0d750c83e1a8a344b8ea44b1735). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23227: [SPARK-26271][FOLLOW-UP][SQL] remove unuse object SparkP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23227 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5751/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23227: [SPARK-26271][FOLLOW-UP][SQL] remove unuse object SparkP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23227 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23228: [MINOR][DOC]The condition description of serialized shuf...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23228 **[Test build #99703 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99703/testReport)** for PR 23228 at commit [`d5dadbf`](https://github.com/apache/spark/commit/d5dadbf30d5429c36ec3d5c2845a71c2717fd6f3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23228: [MINOR][DOC]The condition description of serialized shuf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23228 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5750/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23228: [MINOR][DOC]The condition description of serialized shuf...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23228 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23228: [MINOR][DOC]The condition description of serializ...
GitHub user 10110346 opened a pull request: https://github.com/apache/spark/pull/23228 [MINOR][DOC]The condition description of serialized shuffle is not very accurate ## What changes were proposed in this pull request? `1. The shuffle dependency specifies no aggregation or output ordering.` If the shuffle dependency specifies aggregation, but it only aggregates at the reducer side, serialized shuffle can still be used. `3. The shuffle produces fewer than 16777216 output partitions.` If the number of output partitions is 16777216 , we can use serialized shuffle. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/10110346/spark SerializedShuffle_doc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23228.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23228 commit d5dadbf30d5429c36ec3d5c2845a71c2717fd6f3 Author: liuxian Date: 2018-12-05T08:55:20Z fix --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23227: [SPARK-26271][FOLLOW-UP][SQL] remove unuse object SparkP...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23227 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23227: [SPARK-16958][FOLLOW-UP][SQL] remove unuse object SparkP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23227 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5749/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23227: [SPARK-16958][FOLLOW-UP][SQL] remove unuse object SparkP...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23227 **[Test build #99702 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99702/testReport)** for PR 23227 at commit [`5cb416d`](https://github.com/apache/spark/commit/5cb416df5f03b0d750c83e1a8a344b8ea44b1735). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23227: [SPARK-16958][FOLLOW-UP][SQL] remove unuse object SparkP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23227 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23226: [MINOR][TEST] Add MAXIMUM_PAGE_SIZE_BYTES Excepti...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23226#discussion_r238977650 --- Diff: core/src/test/java/org/apache/spark/unsafe/map/AbstractBytesToBytesMapSuite.java --- @@ -622,6 +622,17 @@ public void initialCapacityBoundsChecking() { } catch (IllegalArgumentException e) { // expected exception } + +try { + new BytesToBytesMap( + taskMemoryManager, --- End diff -- Let's keep the indentation consistent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23227: [SPARK-16958][FOLLOW-UP][SQL] remove unuse object SparkP...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/23227 cc @cloud-fan, @gatorsmile, @hvanhovell ,@davies --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23225: [MINOR][CORE]Don't need to create an empty spill file wh...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23225 Also, it needs a JIRA. it's not minor one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23227: [SPARK-16958][FOLLOW-UP][SQL] remove unuse object SparkP...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23227 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23225: [MINOR][CORE]Don't need to create an empty spill file wh...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23225 How come existing tests cover if the empty file is created or not? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23227: [SPARK-16958][FOLLOW-UP][SQL] remove unuse object...
GitHub user heary-cao opened a pull request: https://github.com/apache/spark/pull/23227 [SPARK-16958][FOLLOW-UP][SQL] remove unuse object SparkPlan ## What changes were proposed in this pull request? this code come from PR: https://github.com/apache/spark/pull/11190, but this code has never been used, only since PR: https://github.com/apache/spark/pull/14548, Let's continue fix it. thanks. ## How was this patch tested? N / A You can merge this pull request into a Git repository by running: $ git pull https://github.com/heary-cao/spark unuseSparkPlan Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23227.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23227 commit 5cb416df5f03b0d750c83e1a8a344b8ea44b1735 Author: caoxuewen Date: 2018-12-05T08:52:23Z [SPARK-16958][FOLLOW-UP][SQL] remove unuse object SparkPlan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23222: [SPARK-20636] Add the rule TransposeWindow to the optimi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23222 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23222: [SPARK-20636] Add the rule TransposeWindow to the optimi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23222 **[Test build #99701 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99701/testReport)** for PR 23222 at commit [`1270e89`](https://github.com/apache/spark/commit/1270e89026d80c862137c03edbeee53e56f3ed6d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23222: [SPARK-20636] Add the rule TransposeWindow to the optimi...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23222 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5748/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23222: [SPARK-20636] Add the rule TransposeWindow to the optimi...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/23222 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22683: [SPARK-25696] The storage memory displayed on spark Appl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22683 **[Test build #99700 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99700/testReport)** for PR 22683 at commit [`8cc05a5`](https://github.com/apache/spark/commit/8cc05a57e8ecaa3e2a2f67d125b12645bb4eb3a2). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23226: [MINOR][TEST] Add MAXIMUM_PAGE_SIZE_BYTES Exception test
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23226 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23226: [MINOR][TEST] Add MAXIMUM_PAGE_SIZE_BYTES Exception test
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23226 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org