[GitHub] spark pull request #18029: [SPARK-20168] [DStream] Add changes to use kinesi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/18029 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20081 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85392/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20081 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20081 **[Test build #85392 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85392/testReport)** for PR 20081 at commit [`10a80b2`](https://github.com/apache/spark/commit/10a80b272e898043e250c2b24a792c9474cf0d10). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18029: [SPARK-20168] [DStream] Add changes to use kinesis fetch...
Github user brkyvz commented on the issue: https://github.com/apache/spark/pull/18029 Merged to master. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20081 FYI, there is a JIRA for a doc about `spark.sql.parquet.writeLegacyFormat ` - https://issues-test.apache.org/jira/plugins/servlet/mobile#issue/SPARK-20937 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19813 Which one is more common? A chain of arithmetic expressions? Or a deeply nested expression? I don't see strong evidence that supports statement output from the discussion. The only one possibility for now is to reducing code size. This is also for performance, not stability. On the contrary, isn't using local variable more stable? Don't forget we need to introduce other mechanism to fix the problem of statement output like re-evaluation I pointed out above. I'm not saying it is not good to support statement output. But for now, the reason to support it is very vague. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/19954#discussion_r158675633 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/init-container/Dockerfile --- @@ -0,0 +1,24 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +FROM spark-base + +# If this docker file is being used in the context of building your images from a Spark distribution, the docker build +# command should be invoked from the top level directory of the Spark distribution. E.g.: +# docker build -t spark-init:latest -f dockerfiles/init-container/Dockerfile . --- End diff -- `kubernetes/dockerfiles/..` instead of `dockerfiles/..` Btw, only nits but seems like paths here in `Dockerfile`s for driver/executor are wrong: `kubernetes/dockerfiles/driver/Dockerfile` and `kubernetes/dockerfiles/executor/Dockerfile` respectively? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20081 > spark.sql.parquet.writeLegacyFormat - if you don't use this configuration, hive external table won't be able to access parquet data. Well, that's really an undocumented feature... Can you submit a PR to update the description of `SQLConf.PARQUET_WRITE_LEGACY_FORMAT` and add a test? > repartition and coalesce is most common use case in Industry to control N Number of files under directory while doing partitioning data. Yea I know, but that's not accurate. It assumes each task would output one file, which is not true if `spark.sql.files.maxRecordsPerFile` is set to a small number. Anyway this is not a Hive feature, we should probably put it in the `SQL Programming Guide`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20020: [SPARK-22834][SQL] Make insertion commands have real chi...
Github user gengliangwang commented on the issue: https://github.com/apache/spark/pull/20020 All these insertion commands are from `postHocResolutionRules`, while there are other batches after it. Skipping the batches after `postHocResolutionRules` will cause analysis error. I decide not to add `AnalysisBarrier` for correctness and robustness. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20020: [SPARK-22834][SQL] Make insertion commands have real chi...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20020 **[Test build #85396 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85396/testReport)** for PR 20020 at commit [`cd2bbf8`](https://github.com/apache/spark/commit/cd2bbf8434a6b142f89f427db8654aeef36cec11). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20080: [SPARK-22870][CORE] Dynamic allocation should allow 0 id...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/20080 cc @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/19929 Could you change the JIRA number to https://issues.apache.org/jira/browse/SPARK-22901 ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20076 **[Test build #85395 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85395/testReport)** for PR 20076 at commit [`9229e6f`](https://github.com/apache/spark/commit/9229e6f1fa8f9fe58d279c6ab14cb1d20068a277). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20076 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20076 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85394/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20076 **[Test build #85394 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85394/testReport)** for PR 20076 at commit [`e510b48`](https://github.com/apache/spark/commit/e510b486ab1cea2f2f4f855747c86cd8af73728c). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class CompressionCodecPrecedenceSuite extends SQLTestUtils with SharedSQLContext ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20076 **[Test build #85394 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85394/testReport)** for PR 20076 at commit [`e510b48`](https://github.com/apache/spark/commit/e510b486ab1cea2f2f4f855747c86cd8af73728c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20004: [Spark-22818][SQL] csv escape of quote escape
Github user ep1804 commented on the issue: https://github.com/apache/spark/pull/20004 Revision followed: - comment on the default values. - applying charToEscapeQuoteEscaping using Option type. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20004: [Spark-22818][SQL] csv escape of quote escape
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20004 **[Test build #85393 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85393/testReport)** for PR 20004 at commit [`c2f877d`](https://github.com/apache/spark/commit/c2f877d9d29668114b8672ec8481636a95c53987). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11994: [SPARK-14151] Expose metrics Source and Sink interface
Github user CodingCat commented on the issue: https://github.com/apache/spark/pull/11994 if I understand correctly, the only issue here is that we exposed codehale's MetricsRegistry in Sink base class..https://github.com/apache/spark/pull/11994/files#diff-9ffc4de02d8a9b4961815f89557ca472R39 Fortunately, we only utilize this registry for registering reporter, how about provide an abstract method for creating reporter in Sink class ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20080: [SPARK-22870][CORE] Dynamic allocation should allow 0 id...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20080 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85390/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20080: [SPARK-22870][CORE] Dynamic allocation should allow 0 id...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20080 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20080: [SPARK-22870][CORE] Dynamic allocation should allow 0 id...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20080 **[Test build #85390 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85390/testReport)** for PR 20080 at commit [`1dcec41`](https://github.com/apache/spark/commit/1dcec41a3c1e2c001b0f9fed92aa6f03b6c47f3a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20081 @cloud-fan spark.sql.files.maxRecordsPerFile didn't worked out when i was working with mine 30 TB of Spark Hive workload whereas repartition and coalesce made sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...
Github user chetkhatri commented on the issue: https://github.com/apache/spark/pull/20081 @cloud-fan Thanks for PR 4. spark.sql.parquet.writeLegacyFormat - if you don't use this configuration, hive external table won't be able to access parquet data. 5. repartition and coalesce is most common use case in Industry to control N Number of files under directory while doing partitioning data. i.e If Data volume is very huge, then every partitions would have many small-small files which may harm downstream query performance due to File I/O, Bandwidth I/O, Network I/O, Disk I/O. Else I am good this your approach. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19813 I think all arithmetic, predicate and bitwise expressions can benefit from it, and they are very common expressions in SQL. More importantly, allowing expressions to output statement may have other benefits that we haven't discovered yet, I don't think we should sacrifice it just for supporting splitting code in whole stage codegen, which is only for performance not stability. For now I think we can fix the 64kb compile error caused by the whole stage codegen framework not expressions. I remember @maropu has a PR to fix that and I prefer to take priority to review that PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20081 **[Test build #85392 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85392/testReport)** for PR 20081 at commit [`10a80b2`](https://github.com/apache/spark/commit/10a80b272e898043e250c2b24a792c9474cf0d10). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examp...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/20081 @chetkhatri @srowen @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20081: [SPARK-22833][EXAMPLE] Improvement SparkHive Scal...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/20081 [SPARK-22833][EXAMPLE] Improvement SparkHive Scala Examples ## What changes were proposed in this pull request? Some improvements: 1. Point out we are using both Spark SQ native syntax and HQL syntax in the example 2. Avoid using the same table name with temp view, to not confuse users. 3. Create the external hive table with a directory that already has data, which is a more common use case. 4. Remove the usage of `spark.sql.parquet.writeLegacyFormat`. This config was introduced by https://github.com/apache/spark/pull/8566 and has nothing to do with Hive. 5. Remove `repartition` and `coalesce` example. These 2 are not Hive specific, we should put them in a different example file. BTW they can't accurately control the number of output files, `spark.sql.files.maxRecordsPerFile` also controls it. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark minor Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20081.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20081 commit 10a80b272e898043e250c2b24a792c9474cf0d10 Author: Wenchen Fan Date: 2017-12-26T04:30:10Z clean up --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user fjh100456 commented on the issue: https://github.com/apache/spark/pull/20076 Well, I'll revert back the renaming. Any comments? @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20076 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20076 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85388/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20076 **[Test build #85388 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85388/testReport)** for PR 20076 at commit [`2ab2d29`](https://github.com/apache/spark/commit/2ab2d293a0548b66070e840372e589eb2949a0ff). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/20076 Sure, let's revert back the rename then. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20067: [SPARK-22894][SQL] DateTimeOperations should accept SQL ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20067 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20067: [SPARK-22894][SQL] DateTimeOperations should accept SQL ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20067 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85387/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20067: [SPARK-22894][SQL] DateTimeOperations should accept SQL ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20067 **[Test build #85387 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85387/testReport)** for PR 20067 at commit [`ae998ec`](https://github.com/apache/spark/commit/ae998ec2b5548b7028d741da4813473dde1ad81e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19813 This is only valid when by coincidence the all expressions involved can use statement as output. As I looked at the codebase, I think only few expressions can output statement. This may not apply generally to reduce code size. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19813 I did a search but can't find one in the current codebase, but I do think this is a valid idea, e.g. a simple example would be `a + b + z`, if expressions can output statement, then we just generate code like ``` int result = a + b ... + z; boolean isNull = false; ``` instead of ``` int result 1 = a + b; boolean isNull1 = false; int result2 = result1 + c; boolean isNull2 = false; ... ``` This can apply to both whole stage codegen and normal codegen, and reduce the code size dramatically, and make whole stage codegen less likely to hit 64kb compile error. Another thing I'm working on is: do not create global variables if `ctx.spiltExpression` doesn't spit. This optimization should be much more useful if combined with this optimization. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/20076 Thanks for the PR. Why are we complicating the PR by doing the rename? Does this actually gain anything other than minor cosmetic changes? It makes the simple PR pretty long ... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20076#discussion_r158663731 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala --- @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive --- End diff -- Move it to sql/core. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20076: [SPARK-21786][SQL] When acquiring 'compressionCod...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20076#discussion_r158663721 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala --- @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hive + +import org.apache.parquet.hadoop.ParquetOutputFormat + +import org.apache.spark.sql.execution.datasources.parquet.ParquetOptions +import org.apache.spark.sql.hive.test.TestHiveSingleton +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SQLTestUtils + +class CompressionCodecSuite extends TestHiveSingleton with SQLTestUtils { --- End diff -- This suite does not need `TestHiveSingleton `. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19527 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19527 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85391/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19527 **[Test build #85391 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85391/testReport)** for PR 19527 at commit [`587ad42`](https://github.com/apache/spark/commit/587ad427a6682e98e1fefe592ecf278c674767f3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/19527 Unit tests are reformatted too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19527 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85389/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19527 **[Test build #85389 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85389/testReport)** for PR 19527 at commit [`144f07d`](https://github.com/apache/spark/commit/144f07d5e92bf5cbc10cb2dc990fc32f15405977). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19527 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20080: [SPARK-22870][CORE] Dynamic allocation should allow 0 id...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20080 **[Test build #85390 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85390/testReport)** for PR 20080 at commit [`1dcec41`](https://github.com/apache/spark/commit/1dcec41a3c1e2c001b0f9fed92aa6f03b6c47f3a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19527 **[Test build #85391 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85391/testReport)** for PR 19527 at commit [`587ad42`](https://github.com/apache/spark/commit/587ad427a6682e98e1fefe592ecf278c674767f3). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20080: [SPARK-22870][CORE] Dynamic allocation should all...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/20080 [SPARK-22870][CORE] Dynamic allocation should allow 0 idle time ## What changes were proposed in this pull request? This pr to make `0` as a valid value for `spark.dynamicAllocation.executorIdleTimeout`. For details, see the jira description: https://issues.apache.org/jira/browse/SPARK-22870. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-22870 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20080.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20080 commit 1dcec41a3c1e2c001b0f9fed92aa6f03b6c47f3a Author: Yuming Wang Date: 2017-12-26T01:58:49Z Dynamic allocation should allow 0 idle time --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r158659399 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,456 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, lit, udf} +import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, StructType} + +/** Private trait for params and common methods for OneHotEncoderEstimator and OneHotEncoderModel */ +private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid +with HasInputCols with HasOutputCols { + + /** + * Param for how to handle invalid data. + * Options are 'keep' (invalid data produces a vector of zeros) or 'error' (throw an error). + * Default: "error" + * @group param + */ + @Since("2.3.0") + override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", +"How to handle invalid data " + +"Options are 'keep' (invalid data produces a vector of zeros) or error (throw an error).", + ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids)) + + setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID) + + /** + * Whether to drop the last category in the encoded vector (default: true) + * @group param + */ + @Since("2.3.0") + final val dropLast: BooleanParam = +new BooleanParam(this, "dropLast", "whether to drop the last category") + setDefault(dropLast -> true) + + /** @group getParam */ + @Since("2.3.0") + def getDropLast: Boolean = $(dropLast) + + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputColNames = $(inputCols) +val outputColNames = $(outputCols) +val existingFields = schema.fields + +require(inputColNames.length == outputColNames.length, + s"The number of input columns ${inputColNames.length} must be the same as the number of " + +s"output columns ${outputColNames.length}.") + +inputColNames.zip(outputColNames).map { case (inputColName, outputColName) => + require(schema(inputColName).dataType.isInstanceOf[NumericType], +s"Input column must be of type NumericType but got ${schema(inputColName).dataType}") + require(!existingFields.exists(_.name == outputColName), +s"Output column $outputColName already exists.") +} + +// Prepares output columns with proper attributes by examining input columns. +val inputFields = $(inputCols).map(schema(_)) + +val outputFields = inputFields.zip(outputColNames).map { case (inputField, outputColName) => + OneHotEncoderCommon.transformOutputColumnSchema( +inputField, $(dropLast), outputColName) +} +StructType(schema.fields ++ outputFields) + } +} + +/** + * A one-hot encoder that maps a column of category indices to a column of binary vectors, with + * at most a single one-value per row that indicates the input category index. + * For example with 5 categories, an input value of 2.0 would map to an output vector of + * `[0.0, 0.0, 1.0, 0.0]`. + * The last category is not included by default (configurable via `dropLast`), + * because it makes the vector entries sum up to one, and
[GitHub] spark issue #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator for OneH...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19527 **[Test build #85389 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85389/testReport)** for PR 19527 at commit [`144f07d`](https://github.com/apache/spark/commit/144f07d5e92bf5cbc10cb2dc990fc32f15405977). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r158659167 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,479 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, lit, udf} +import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, StructType} + +/** Private trait for params and common methods for OneHotEncoderEstimator and OneHotEncoderModel */ +private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid +with HasInputCols with HasOutputCols { + + /** + * Param for how to handle invalid data. + * Options are 'keep' (invalid data presented as an extra categorical feature) or + * 'error' (throw an error). + * Default: "error" + * @group param + */ + @Since("2.3.0") + override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", +"How to handle invalid data " + +"Options are 'keep' (invalid data presented as an extra categorical feature) " + +"or error (throw an error).", + ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids)) + + setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID) + + /** + * Whether to drop the last category in the encoded vector (default: true) + * @group param + */ + @Since("2.3.0") + final val dropLast: BooleanParam = +new BooleanParam(this, "dropLast", "whether to drop the last category") + setDefault(dropLast -> true) + + /** @group getParam */ + @Since("2.3.0") + def getDropLast: Boolean = $(dropLast) + + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputColNames = $(inputCols) +val outputColNames = $(outputCols) +val existingFields = schema.fields + +require(inputColNames.length == outputColNames.length, + s"The number of input columns ${inputColNames.length} must be the same as the number of " + +s"output columns ${outputColNames.length}.") + +inputColNames.zip(outputColNames).map { case (inputColName, outputColName) => + require(schema(inputColName).dataType.isInstanceOf[NumericType], +s"Input column must be of type NumericType but got ${schema(inputColName).dataType}") + require(!existingFields.exists(_.name == outputColName), +s"Output column $outputColName already exists.") +} + +// Prepares output columns with proper attributes by examining input columns. +val inputFields = $(inputCols).map(schema(_)) +val keepInvalid = $(handleInvalid) == OneHotEncoderEstimator.KEEP_INVALID + +val outputFields = inputFields.zip(outputColNames).map { case (inputField, outputColName) => + OneHotEncoderCommon.transformOutputColumnSchema( +inputField, $(dropLast), outputColName, keepInvalid) +} +StructType(schema.fields ++ outputFields) + } +} + +/** + * A one-hot encoder that maps a column of category indices to a column of binary vectors, with + * at most a single one-value per row that indicates the input category index. + * For example with 5 categories, an input value of 2.0 would map to an output vector of + * `[0.0, 0.0, 1.0, 0.0
[GitHub] spark pull request #19527: [SPARK-13030][ML] Create OneHotEncoderEstimator f...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/19527#discussion_r158659174 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoderEstimator.scala --- @@ -0,0 +1,479 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.Since +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.attribute._ +import org.apache.spark.ml.linalg.Vectors +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, HasOutputCols} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} +import org.apache.spark.sql.expressions.UserDefinedFunction +import org.apache.spark.sql.functions.{col, lit, udf} +import org.apache.spark.sql.types.{DoubleType, NumericType, StructField, StructType} + +/** Private trait for params and common methods for OneHotEncoderEstimator and OneHotEncoderModel */ +private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid +with HasInputCols with HasOutputCols { + + /** + * Param for how to handle invalid data. + * Options are 'keep' (invalid data presented as an extra categorical feature) or + * 'error' (throw an error). + * Default: "error" + * @group param + */ + @Since("2.3.0") + override val handleInvalid: Param[String] = new Param[String](this, "handleInvalid", +"How to handle invalid data " + +"Options are 'keep' (invalid data presented as an extra categorical feature) " + +"or error (throw an error).", + ParamValidators.inArray(OneHotEncoderEstimator.supportedHandleInvalids)) + + setDefault(handleInvalid, OneHotEncoderEstimator.ERROR_INVALID) + + /** + * Whether to drop the last category in the encoded vector (default: true) + * @group param + */ + @Since("2.3.0") + final val dropLast: BooleanParam = +new BooleanParam(this, "dropLast", "whether to drop the last category") + setDefault(dropLast -> true) + + /** @group getParam */ + @Since("2.3.0") + def getDropLast: Boolean = $(dropLast) + + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputColNames = $(inputCols) +val outputColNames = $(outputCols) +val existingFields = schema.fields + +require(inputColNames.length == outputColNames.length, + s"The number of input columns ${inputColNames.length} must be the same as the number of " + +s"output columns ${outputColNames.length}.") + +inputColNames.zip(outputColNames).map { case (inputColName, outputColName) => + require(schema(inputColName).dataType.isInstanceOf[NumericType], +s"Input column must be of type NumericType but got ${schema(inputColName).dataType}") + require(!existingFields.exists(_.name == outputColName), +s"Output column $outputColName already exists.") +} + +// Prepares output columns with proper attributes by examining input columns. +val inputFields = $(inputCols).map(schema(_)) +val keepInvalid = $(handleInvalid) == OneHotEncoderEstimator.KEEP_INVALID + +val outputFields = inputFields.zip(outputColNames).map { case (inputField, outputColName) => + OneHotEncoderCommon.transformOutputColumnSchema( +inputField, $(dropLast), outputColName, keepInvalid) +} +StructType(schema.fields ++ outputFields) + } +} + +/** + * A one-hot encoder that maps a column of category indices to a column of binary vectors, with + * at most a single one-value per row that indicates the input category index. + * For example with 5 categories, an input value of 2.0 would map to an output vector of + * `[0.0, 0.0, 1.0, 0.0
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20076 **[Test build #85388 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85388/testReport)** for PR 20076 at commit [`2ab2d29`](https://github.com/apache/spark/commit/2ab2d293a0548b66070e840372e589eb2949a0ff). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20076: [SPARK-21786][SQL] When acquiring 'compressionCodecClass...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20076 Retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20067: [SPARK-22894][SQL] DateTimeOperations should accept SQL ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20067 **[Test build #85387 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85387/testReport)** for PR 20067 at commit [`ae998ec`](https://github.com/apache/spark/commit/ae998ec2b5548b7028d741da4813473dde1ad81e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20079 Thanks you, @gatorsmile and @wangyum --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20067: [SPARK-22894][SQL] DateTimeOperations should accept SQL ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20067 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20067: [SPARK-22894][SQL] DateTimeOperations should acce...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20067#discussion_r158657480 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -2760,6 +2760,17 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { } } + test("SPARK-22894: DateTimeOperations should accept SQL like string type") { +val date = "2017-12-24" +val str = sql(s"SELECT CAST('$date' as STRING) + interval 2 months 2 seconds") --- End diff -- I saw the original PR https://github.com/apache/spark/pull/7754/files#r35821191 Maybe the SQL API should support it since we do support it in DataFrame APIs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20023 The problem found by this PR is just one of the issues that returns NULL when SPARK SQL is unable to process it. Below is another example. I believe we can find more. ```SQL SELECT CAST('a' AS TIMESTAMP) ``` Before we deciding how to fix these issues (one by one or as a whole), we need to do more investigation and identify all of them. We also need to clearly document our current behaviors and then our users can know what is the result they can expect. Yeah! Please go ahead to create a new PR for adding more tests. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20079 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20079 Thanks! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/20079 LGTM, thanks @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20059 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85385/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20059 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20059 **[Test build #85385 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85385/testReport)** for PR 20059 at commit [`818abaf`](https://github.com/apache/spark/commit/818abaf46d8cb4d92f9940e2b59ad6cf27e5da44). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19954 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85384/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19954 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19954 **[Test build #85384 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85384/testReport)** for PR 19954 at commit [`c51bc56`](https://github.com/apache/spark/commit/c51bc560bb2ae0d5ea8d914e84d7485d333f497e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20079 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20079 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85386/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20079 **[Test build #85386 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85386/testReport)** for PR 20079 at commit [`cb2868d`](https://github.com/apache/spark/commit/cb2868d9de9c2d1a89bcb410a314bbc29f1003f1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20072 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20072 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85383/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20072: [SPARK-22790][SQL] add a configurable factor to describe...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20072 **[Test build #85383 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85383/testReport)** for PR 20072 at commit [`ec275a8`](https://github.com/apache/spark/commit/ec275a841a7bb4c23b277f915debeed54e6cf7ea). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19929: [SPARK-22629][PYTHON] Add deterministic flag to pyspark ...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/19929 @gatorsmile, yes, the reason why seed doesn't work is in the way Python UDFs are executed, i.e. a new python process is created for each partition to evaluate the Python UDF. Thus the seed is set only on the driver, but not in the process where the UDF is executed. What I am saying can be easily confirmed by this: ``` >>> from pyspark.sql.functions import udf >>> import os >>> pid_udf = udf(lambda: str(os.getpid())) >>> spark.range(2).select(pid_udf()).show() +--+ |()| +--+ | 4132| | 4130| +--+ >>> os.getpid() 4070 ``` Therefore there is no easy way to set the seed. If I set it inside the UDF, the UDF would become deterministic. Therefore I think that the best option is the current test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20079 Thank you, @gatorsmile ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20079 **[Test build #85386 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85386/testReport)** for PR 20079 at commit [`cb2868d`](https://github.com/apache/spark/commit/cb2868d9de9c2d1a89bcb410a314bbc29f1003f1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of Versio...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20079 cc @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20064: [SPARK-22893][SQL] Unified the data type mismatch messag...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20064 Hi, @gatorsmile and @wangyum . This PR seems to break Jenkins tests. Please see my hotfix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20079: [SPARK-22893][SQL][HOTFIX] Fix a error message of...
GitHub user dongjoon-hyun opened a pull request: https://github.com/apache/spark/pull/20079 [SPARK-22893][SQL][HOTFIX] Fix a error message of VersionsSuite ## What changes were proposed in this pull request? https://github.com/apache/spark/pull/20064 breaks Jenkins tests because it missed to update one error message for Hive 0.12 and Hive 0.13. This PR fixes that. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/3924/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/3977/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/4226/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4260/ ## How was this patch tested? Pass the Jenkins without failure. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-22893 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20079.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20079 commit cb2868d9de9c2d1a89bcb410a314bbc29f1003f1 Author: Dongjoon Hyun Date: 2017-12-25T19:08:14Z [SPARK-22893][SQL][HOTFIX] Fix a error message of VersionsSuite --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/20023 Thanks @gatorsmile. Then should I create a follow up PR for #20008 in order to cover the cases 2 and 3 before going on with this PR or can we go on with this PR and the test cases added in this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20059: [SPARK-22648][K8s] Add documentation covering init conta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20059 **[Test build #85385 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85385/testReport)** for PR 20059 at commit [`818abaf`](https://github.com/apache/spark/commit/818abaf46d8cb4d92f9940e2b59ad6cf27e5da44). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19954 **[Test build #85384 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85384/testReport)** for PR 19954 at commit [`c51bc56`](https://github.com/apache/spark/commit/c51bc560bb2ae0d5ea8d914e84d7485d333f497e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19954#discussion_r158651951 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala --- @@ -45,6 +45,59 @@ private[spark] class KubernetesClusterManager extends ExternalClusterManager wit masterURL: String, scheduler: TaskScheduler): SchedulerBackend = { val sparkConf = sc.getConf +val initContainerConfigMap = sparkConf.get(INIT_CONTAINER_CONFIG_MAP_NAME) +val initContainerConfigMapKey = sparkConf.get(INIT_CONTAINER_CONFIG_MAP_KEY_CONF) + +if (initContainerConfigMap.isEmpty) { + logWarning("The executor's init-container config map was not specified. Executors will " + --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19954#discussion_r158651945 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/rest/k8s/SparkPodInitContainer.scala --- @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.rest.k8s + +import java.io.File +import java.util.concurrent.TimeUnit + +import scala.concurrent.{ExecutionContext, Future} + +import org.apache.spark.{SecurityManager => SparkSecurityManager, SparkConf} +import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.deploy.k8s.Config._ +import org.apache.spark.internal.Logging +import org.apache.spark.util.{ThreadUtils, Utils} + +/** + * Process that fetches files from a resource staging server and/or arbitrary remote locations. + * + * The init-container can handle fetching files from any of those sources, but not all of the + * sources need to be specified. This allows for composing multiple instances of this container + * with different configurations for different download sources, or using the same container to + * download everything at once. + */ +private[spark] class SparkPodInitContainer( +sparkConf: SparkConf, +fileFetcher: FileFetcher) extends Logging { + + private val maxThreadPoolSize = sparkConf.get(INIT_CONTAINER_MAX_THREAD_POOL_SIZE) + private implicit val downloadExecutor = ExecutionContext.fromExecutorService( +ThreadUtils.newDaemonCachedThreadPool("download-executor", maxThreadPoolSize)) + + private val jarsDownloadDir = new File(sparkConf.get(JARS_DOWNLOAD_LOCATION)) + private val filesDownloadDir = new File(sparkConf.get(FILES_DOWNLOAD_LOCATION)) + + private val remoteJars = sparkConf.get(INIT_CONTAINER_REMOTE_JARS) + private val remoteFiles = sparkConf.get(INIT_CONTAINER_REMOTE_FILES) + + private val downloadTimeoutMinutes = sparkConf.get(INIT_CONTAINER_MOUNT_TIMEOUT) + + def run(): Unit = { +logInfo(s"Downloading remote jars: $remoteJars") +downloadFiles( + remoteJars, + jarsDownloadDir, + s"Remote jars download directory specified at $jarsDownloadDir does not exist " + +"or is not a directory.") + +logInfo(s"Downloading remote files: $remoteFiles") +downloadFiles( + remoteFiles, + filesDownloadDir, + s"Remote files download directory specified at $filesDownloadDir does not exist " + +"or is not a directory.") + +downloadExecutor.shutdown() +downloadExecutor.awaitTermination(downloadTimeoutMinutes, TimeUnit.MINUTES) + } + + private def downloadFiles( + filesCommaSeparated: Option[String], + downloadDir: File, + errMessageOnDestinationNotADirectory: String): Unit = { --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19954#discussion_r158651953 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala --- @@ -45,6 +45,59 @@ private[spark] class KubernetesClusterManager extends ExternalClusterManager wit masterURL: String, scheduler: TaskScheduler): SchedulerBackend = { val sparkConf = sc.getConf +val initContainerConfigMap = sparkConf.get(INIT_CONTAINER_CONFIG_MAP_NAME) +val initContainerConfigMapKey = sparkConf.get(INIT_CONTAINER_CONFIG_MAP_KEY_CONF) + +if (initContainerConfigMap.isEmpty) { + logWarning("The executor's init-container config map was not specified. Executors will " + +"therefore not attempt to fetch remote or submitted dependencies.") +} + +if (initContainerConfigMapKey.isEmpty) { + logWarning("The executor's init-container config map key was not specified. Executors will " + --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19683 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85382/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19954#discussion_r158651933 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala --- @@ -132,30 +131,84 @@ private[spark] object Config extends Logging { val JARS_DOWNLOAD_LOCATION = ConfigBuilder("spark.kubernetes.mountDependencies.jarsDownloadDir") - .doc("Location to download jars to in the driver and executors. When using" + -" spark-submit, this directory must be empty and will be mounted as an empty directory" + -" volume on the driver and executor pod.") + .doc("Location to download jars to in the driver and executors. When using " + +"spark-submit, this directory must be empty and will be mounted as an empty directory " + +"volume on the driver and executor pod.") .stringConf .createWithDefault("/var/spark-data/spark-jars") val FILES_DOWNLOAD_LOCATION = ConfigBuilder("spark.kubernetes.mountDependencies.filesDownloadDir") - .doc("Location to download files to in the driver and executors. When using" + -" spark-submit, this directory must be empty and will be mounted as an empty directory" + -" volume on the driver and executor pods.") + .doc("Location to download files to in the driver and executors. When using " + +"spark-submit, this directory must be empty and will be mounted as an empty directory " + +"volume on the driver and executor pods.") .stringConf .createWithDefault("/var/spark-data/spark-files") + val INIT_CONTAINER_IMAGE = +ConfigBuilder("spark.kubernetes.initContainer.image") + .doc("Image for the driver and executor's init-container for downloading dependencies.") + .stringConf + .createOptional + + val INIT_CONTAINER_MOUNT_TIMEOUT = +ConfigBuilder("spark.kubernetes.mountDependencies.timeout") + .doc("Timeout before aborting the attempt to download and unpack dependencies from remote " + +"locations into the driver and executor pods.") + .timeConf(TimeUnit.MINUTES) --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19683 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19954#discussion_r158651930 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/DriverConfigOrchestrator.scala --- @@ -98,28 +109,62 @@ private[spark] class DriverConfigurationStepsOrchestrator( None } -val sparkJars = submissionSparkConf.getOption("spark.jars") +val sparkJars = sparkConf.getOption("spark.jars") .map(_.split(",")) .getOrElse(Array.empty[String]) ++ additionalMainAppJar.toSeq -val sparkFiles = submissionSparkConf.getOption("spark.files") +val sparkFiles = sparkConf.getOption("spark.files") .map(_.split(",")) .getOrElse(Array.empty[String]) -val maybeDependencyResolutionStep = if (sparkJars.nonEmpty || sparkFiles.nonEmpty) { - Some(new DependencyResolutionStep( +val dependencyResolutionStep = if (sparkJars.nonEmpty || sparkFiles.nonEmpty) { + Seq(new DependencyResolutionStep( sparkJars, sparkFiles, jarsDownloadPath, filesDownloadPath)) } else { - None + Nil +} + +val initContainerBootstrapStep = if (areAnyFilesNonContainerLocal(sparkJars ++ sparkFiles)) { + val orchestrator = new InitContainerConfigOrchestrator( +sparkJars, +sparkFiles, +jarsDownloadPath, +filesDownloadPath, +imagePullPolicy, +initContainerConfigMapName, +INIT_CONTAINER_PROPERTIES_FILE_NAME, +sparkConf) + val bootstrapStep = new DriverInitContainerBootstrapStep( +orchestrator.getAllConfigurationSteps, +initContainerConfigMapName, +INIT_CONTAINER_PROPERTIES_FILE_NAME) + + Seq(bootstrapStep) +} else { + Nil +} + +val mountSecretsStep = if (secretNamesToMountPaths.nonEmpty) { + Seq(new DriverMountSecretsStep(new MountSecretsBootstrap(secretNamesToMountPaths))) +} else { + Nil } Seq( initialSubmissionStep, - driverAddressStep, + serviceBootstrapStep, kubernetesCredentialsStep) ++ - maybeDependencyResolutionStep.toSeq + dependencyResolutionStep ++ + initContainerBootstrapStep ++ + mountSecretsStep + } + + private def areAnyFilesNonContainerLocal(files: Seq[String]): Boolean = { --- End diff -- Done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19954#discussion_r158651915 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala --- @@ -132,30 +131,84 @@ private[spark] object Config extends Logging { val JARS_DOWNLOAD_LOCATION = ConfigBuilder("spark.kubernetes.mountDependencies.jarsDownloadDir") - .doc("Location to download jars to in the driver and executors. When using" + -" spark-submit, this directory must be empty and will be mounted as an empty directory" + -" volume on the driver and executor pod.") + .doc("Location to download jars to in the driver and executors. When using " + +"spark-submit, this directory must be empty and will be mounted as an empty directory " + +"volume on the driver and executor pod.") .stringConf .createWithDefault("/var/spark-data/spark-jars") val FILES_DOWNLOAD_LOCATION = ConfigBuilder("spark.kubernetes.mountDependencies.filesDownloadDir") - .doc("Location to download files to in the driver and executors. When using" + -" spark-submit, this directory must be empty and will be mounted as an empty directory" + -" volume on the driver and executor pods.") + .doc("Location to download files to in the driver and executors. When using " + +"spark-submit, this directory must be empty and will be mounted as an empty directory " + +"volume on the driver and executor pods.") .stringConf .createWithDefault("/var/spark-data/spark-files") + val INIT_CONTAINER_IMAGE = +ConfigBuilder("spark.kubernetes.initContainer.image") + .doc("Image for the driver and executor's init-container for downloading dependencies.") + .stringConf + .createOptional + + val INIT_CONTAINER_MOUNT_TIMEOUT = +ConfigBuilder("spark.kubernetes.mountDependencies.timeout") --- End diff -- Please see the response regarding `spark.kubernetes.mountDependencies.maxSimultaneousDownloads`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19683: [SPARK-21657][SQL] optimize explode quadratic memory con...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19683 **[Test build #85382 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85382/testReport)** for PR 19683 at commit [`6caa0d5`](https://github.com/apache/spark/commit/6caa0d5f75336d954808109eddd207c56262ad04). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/19954#discussion_r158651909 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala --- @@ -132,30 +131,84 @@ private[spark] object Config extends Logging { val JARS_DOWNLOAD_LOCATION = ConfigBuilder("spark.kubernetes.mountDependencies.jarsDownloadDir") - .doc("Location to download jars to in the driver and executors. When using" + -" spark-submit, this directory must be empty and will be mounted as an empty directory" + -" volume on the driver and executor pod.") + .doc("Location to download jars to in the driver and executors. When using " + +"spark-submit, this directory must be empty and will be mounted as an empty directory " + +"volume on the driver and executor pod.") .stringConf .createWithDefault("/var/spark-data/spark-jars") val FILES_DOWNLOAD_LOCATION = ConfigBuilder("spark.kubernetes.mountDependencies.filesDownloadDir") - .doc("Location to download files to in the driver and executors. When using" + -" spark-submit, this directory must be empty and will be mounted as an empty directory" + -" volume on the driver and executor pods.") + .doc("Location to download files to in the driver and executors. When using " + +"spark-submit, this directory must be empty and will be mounted as an empty directory " + +"volume on the driver and executor pods.") .stringConf .createWithDefault("/var/spark-data/spark-files") + val INIT_CONTAINER_IMAGE = +ConfigBuilder("spark.kubernetes.initContainer.image") + .doc("Image for the driver and executor's init-container for downloading dependencies.") + .stringConf + .createOptional + + val INIT_CONTAINER_MOUNT_TIMEOUT = +ConfigBuilder("spark.kubernetes.mountDependencies.timeout") + .doc("Timeout before aborting the attempt to download and unpack dependencies from remote " + +"locations into the driver and executor pods.") + .timeConf(TimeUnit.MINUTES) + .createWithDefault(5) + + val INIT_CONTAINER_MAX_THREAD_POOL_SIZE = + ConfigBuilder("spark.kubernetes.mountDependencies.maxSimultaneousDownloads") --- End diff -- I think the current name is already pretty long. Adding `initContainer` makes it even longer without much added value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19954#discussion_r158651386 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/rest/k8s/SparkPodInitContainer.scala --- @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.rest.k8s + +import java.io.File +import java.util.concurrent.TimeUnit + +import scala.concurrent.{ExecutionContext, Future} + +import org.apache.spark.{SecurityManager => SparkSecurityManager, SparkConf} +import org.apache.spark.deploy.SparkHadoopUtil +import org.apache.spark.deploy.k8s.Config._ +import org.apache.spark.internal.Logging +import org.apache.spark.util.{ThreadUtils, Utils} + +/** + * Process that fetches files from a resource staging server and/or arbitrary remote locations. + * + * The init-container can handle fetching files from any of those sources, but not all of the + * sources need to be specified. This allows for composing multiple instances of this container + * with different configurations for different download sources, or using the same container to + * download everything at once. + */ +private[spark] class SparkPodInitContainer( +sparkConf: SparkConf, +fileFetcher: FileFetcher) extends Logging { + + private val maxThreadPoolSize = sparkConf.get(INIT_CONTAINER_MAX_THREAD_POOL_SIZE) + private implicit val downloadExecutor = ExecutionContext.fromExecutorService( +ThreadUtils.newDaemonCachedThreadPool("download-executor", maxThreadPoolSize)) + + private val jarsDownloadDir = new File(sparkConf.get(JARS_DOWNLOAD_LOCATION)) + private val filesDownloadDir = new File(sparkConf.get(FILES_DOWNLOAD_LOCATION)) + + private val remoteJars = sparkConf.get(INIT_CONTAINER_REMOTE_JARS) + private val remoteFiles = sparkConf.get(INIT_CONTAINER_REMOTE_FILES) + + private val downloadTimeoutMinutes = sparkConf.get(INIT_CONTAINER_MOUNT_TIMEOUT) + + def run(): Unit = { +logInfo(s"Downloading remote jars: $remoteJars") +downloadFiles( + remoteJars, + jarsDownloadDir, + s"Remote jars download directory specified at $jarsDownloadDir does not exist " + +"or is not a directory.") + +logInfo(s"Downloading remote files: $remoteFiles") +downloadFiles( + remoteFiles, + filesDownloadDir, + s"Remote files download directory specified at $filesDownloadDir does not exist " + +"or is not a directory.") + +downloadExecutor.shutdown() +downloadExecutor.awaitTermination(downloadTimeoutMinutes, TimeUnit.MINUTES) + } + + private def downloadFiles( + filesCommaSeparated: Option[String], + downloadDir: File, + errMessageOnDestinationNotADirectory: String): Unit = { --- End diff -- nit: `errMessageOnDestinationNotADirectory` -> `errMessage`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19954: [SPARK-22757][Kubernetes] Enable use of remote de...
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/19954#discussion_r158651523 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala --- @@ -45,6 +45,59 @@ private[spark] class KubernetesClusterManager extends ExternalClusterManager wit masterURL: String, scheduler: TaskScheduler): SchedulerBackend = { val sparkConf = sc.getConf +val initContainerConfigMap = sparkConf.get(INIT_CONTAINER_CONFIG_MAP_NAME) +val initContainerConfigMapKey = sparkConf.get(INIT_CONTAINER_CONFIG_MAP_KEY_CONF) + +if (initContainerConfigMap.isEmpty) { + logWarning("The executor's init-container config map was not specified. Executors will " + +"therefore not attempt to fetch remote or submitted dependencies.") +} + +if (initContainerConfigMapKey.isEmpty) { + logWarning("The executor's init-container config map key was not specified. Executors will " + --- End diff -- nit: `was not` -> `is not` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org