[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 @mridulm shuffled RDD will never be deterministic unless the shuffle key is the entire record and key ordering is specified. The reduce task fetches multiple remote shuffle blocks at the same time, so the order is always random. In Addition, Spark SQL never specifies key ordering. Checkpointing will cut down the RDD lineage, and change the RDD dependency to a `OneToOneDependency` of `CheckpointRDD`, so we don't need to care about it. @tgravescs Forget to mention that it's a temporary workaround to fail with result task. I looked into it and we need to change the semantics of `FileCommitProtocol` to fix it. Maybe it's better to do it in Spark 3.0? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22112: [SPARK-23243][Core] Fix RDD.repartition() data co...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22112#discussion_r211065925 --- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala --- @@ -1864,6 +1877,22 @@ abstract class RDD[T: ClassTag]( // From performance concern, cache the value to avoid repeatedly compute `isBarrier()` on a long // RDD chain. @transient protected lazy val isBarrier_ : Boolean = dependencies.exists(_.rdd.isBarrier()) + + /** + * Whether the RDD's computing function is idempotent. Idempotent means the computing function + * not only satisfies the requirement, but also produce the same output sequence(the output order + * can't vary) given the same input sequence. Spark assumes all the RDDs are idempotent, except + * for the shuffle RDD and RDDs derived from non-idempotent RDD. + */ --- End diff -- yes, that is expected, unless the computing function sorts the input data. For this case, we can override the `isIdempotent`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20725: [SPARK-23555][PYTHON] Add BinaryType support for Arrow i...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20725 merged to master, thanks @shaneknapp and @HyukjinKwon ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20725: [SPARK-23555][PYTHON] Add BinaryType support for ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20725 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22138 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22138 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94913/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22138 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94914/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22138 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22138 **[Test build #94913 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94913/testReport)** for PR 22138 at commit [`94231fe`](https://github.com/apache/spark/commit/94231fef1f2f59cea1625fd1f71bd99372a8e800). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22138 **[Test build #94914 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94914/testReport)** for PR 22138 at commit [`fd728ef`](https://github.com/apache/spark/commit/fd728ef8c99ebb33d6dba5466e6a8dba8984248d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/20637 cc @ueshin @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21320 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21320 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94915/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21320 **[Test build #94915 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94915/testReport)** for PR 21320 at commit [`1573ae8`](https://github.com/apache/spark/commit/1573ae888d651a51e0d60683117714fba7c55fb0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22137: [MINOR][DOC][SQL] use one line for annotation arg value
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22137 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94911/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22137: [MINOR][DOC][SQL] use one line for annotation arg value
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22137 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22137: [MINOR][DOC][SQL] use one line for annotation arg value
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22137 **[Test build #94911 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94911/testReport)** for PR 22137 at commit [`e9a9376`](https://github.com/apache/spark/commit/e9a93762aeeb219cf9ab600da248a0d1f295d09f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22131: [SPARK-25141][SQL][TEST] Modify tests for higher-order f...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22131 **[Test build #94919 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94919/testReport)** for PR 22131 at commit [`6f9660d`](https://github.com/apache/spark/commit/6f9660d79e2ae8b7c64dbfea850c514ad3404f37). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22131: [SPARK-25141][SQL][TEST] Modify tests for higher-order f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22131 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2295/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22131: [SPARK-25141][SQL][TEST] Modify tests for higher-order f...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22131 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22131: [SPARK-25141][SQL][TEST] Modify tests for higher-order f...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22131 @mgaido91 @mn-mikke On second thought, how about this? If you don't like it, I'll revert it soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22130: [SPARK-25137][Spark Shell] NumberFormatException` when s...
Github user vinodkc commented on the issue: https://github.com/apache/spark/pull/22130 @dongjoon-hyun , Thanks for taking a look at this PR, I've added Mac OS version in the PR description, IMO, an update of ncurses is causing this issue Reference : https://github.com/jline/jline2/issues/281 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21909 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21909 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94909/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21909 **[Test build #94909 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94909/testReport)** for PR 21909 at commit [`96a94cc`](https://github.com/apache/spark/commit/96a94ccaed1f68fa7eaf3fc286540e531d9a9506). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL] Override `nodeName` for all *ScanExec...
Github user maropu commented on the issue: https://github.com/apache/spark/pull/20226 sure, will do, too. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21909 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94908/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21909 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21909 **[Test build #94908 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94908/testReport)** for PR 21909 at commit [`2d8e754`](https://github.com/apache/spark/commit/2d8e754e699076c8a5915e7faf971e4bd2a5c1fd). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21306 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21306 **[Test build #94918 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94918/testReport)** for PR 21306 at commit [`dca4bf8`](https://github.com/apache/spark/commit/dca4bf8176eaa92de295de54488c3398256e0f7a). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class V1TableCatalog(sessionState: SessionState) extends TableCatalog ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21306 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94918/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21306 **[Test build #94918 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94918/testReport)** for PR 21306 at commit [`dca4bf8`](https://github.com/apache/spark/commit/dca4bf8176eaa92de295de54488c3398256e0f7a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21306 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21306 **[Test build #94917 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94917/testReport)** for PR 21306 at commit [`fa0edeb`](https://github.com/apache/spark/commit/fa0edeb1570485cc7d6cd0f848caaaf20480f384). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class V1TableCatalog(sessionState: SessionState) extends TableCatalog ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21306 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94917/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21306 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21306 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2294/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add catalog registration and t...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21306#discussion_r211057651 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/catalog/v2/V1MetadataTable.scala --- @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalog.v2 + +import java.util + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.SaveMode +import org.apache.spark.sql.catalog.v2.PartitionTransforms.{bucket, identity} +import org.apache.spark.sql.catalyst.catalog.CatalogTable +import org.apache.spark.sql.sources.v2.{DataSourceOptions, DataSourceV2, ReadSupport, WriteSupport} +import org.apache.spark.sql.sources.v2.reader.DataSourceReader +import org.apache.spark.sql.sources.v2.writer.DataSourceWriter +import org.apache.spark.sql.types.StructType + +/** + * An implementation of catalog v2 [[Table]] to expose v1 table metadata. + */ +private[sql] class V1MetadataTable( --- End diff -- @cloud-fan, I updated this PR that adds the `TableCatalog` API to include an implementation that uses the existing `SessionCatalog`. This `Table` class demonstrates how `Table` would implement `ReadSupport` and `WriteSupport`. The catalog returns these tables, which have `ReadSupport` and `WriteSupport` mixed in depending on whether the underlying `DataSourceV2` also supports them. In your updated API, it would use the `ReadSupportProvider` instead of the `DataSourceV2` directly, but the difference isn't very large. The follow-up PR for CTAS and RTAS, #21877, demonstrates how this would be used in the new logical plans. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22139: [SPARK-25149][GraphX] Update Parallel Personalized Page ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22139 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94916/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22139: [SPARK-25149][GraphX] Update Parallel Personalized Page ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22139 **[Test build #94916 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94916/testReport)** for PR 22139 at commit [`25dc63a`](https://github.com/apache/spark/commit/25dc63a0ac09ef900770c31e817f230ec98f658f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22139: [SPARK-25149][GraphX] Update Parallel Personalized Page ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22139 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20226: [SPARK-23034][SQL] Override `nodeName` for all *ScanExec...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20226 @maropu Could you take this over? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21306 **[Test build #94917 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94917/testReport)** for PR 21306 at commit [`fa0edeb`](https://github.com/apache/spark/commit/fa0edeb1570485cc7d6cd0f848caaaf20480f384). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21306 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21306 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2293/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22139: [SPARK-25149][GraphX] Update Parallel Personalized Page ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22139 **[Test build #94916 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94916/testReport)** for PR 22139 at commit [`25dc63a`](https://github.com/apache/spark/commit/25dc63a0ac09ef900770c31e817f230ec98f658f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22139: [SPARK-25149][GraphX] Update Parallel Personalized Page ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22139 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22139: [SPARK-25149][GraphX] Update Parallel Personalized Page ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22139 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2292/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22139: [SPARK-25149][GraphX] Update Parallel Personalize...
GitHub user MrBago opened a pull request: https://github.com/apache/spark/pull/22139 [SPARK-25149][GraphX] Update Parallel Personalized Page Rank to test with large vertexIds ## What changes were proposed in this pull request? runParallelPersonalizedPageRank in graphx checks that `sources` are <= Int.MaxValue.toLong, but this is not actually required. This check seems to have been added because we use sparse vectors in the implementation and sparse vectors cannot be indexed by values > MAX_INT. However we do not ever index the sparse vector by the source vertexIds so this isn't an issue. I've added a test with large vertexIds to confirm this works as expected. ## How was this patch tested? Unit tests. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/MrBago/spark remove-veretexId-check-pppr Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22139.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22139 commit e720eab9a435a738be9f08ccaefba2f4eb7dc867 Author: Bago Amirbekian Date: 2018-08-17T23:43:25Z Update Parallel Personalized Page Rank to test with large vertexIds --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21320 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21320 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2291/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21320 **[Test build #94915 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94915/testReport)** for PR 21320 at commit [`1573ae8`](https://github.com/apache/spark/commit/1573ae888d651a51e0d60683117714fba7c55fb0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22138 **[Test build #94914 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94914/testReport)** for PR 22138 at commit [`fd728ef`](https://github.com/apache/spark/commit/fd728ef8c99ebb33d6dba5466e6a8dba8984248d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22138 **[Test build #94913 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94913/testReport)** for PR 22138 at commit [`94231fe`](https://github.com/apache/spark/commit/94231fef1f2f59cea1625fd1f71bd99372a8e800). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/22138 cc. @tdas @zsxwing @koeninger @arunmahadevan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22138: [SPARK-25151][SS] Apply Apache Commons Pool to Ka...
Github user HeartSaVioR commented on a diff in the pull request: https://github.com/apache/spark/pull/22138#discussion_r211053868 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala --- @@ -425,70 +381,36 @@ private[kafka010] object KafkaDataConsumer extends Logging { def acquire( topicPartition: TopicPartition, kafkaParams: ju.Map[String, Object], - useCache: Boolean): KafkaDataConsumer = synchronized { -val key = new CacheKey(topicPartition, kafkaParams) -val existingInternalConsumer = cache.get(key) + useCache: Boolean): KafkaDataConsumer = { -lazy val newInternalConsumer = new InternalKafkaConsumer(topicPartition, kafkaParams) +if (!useCache) { + return NonCachedKafkaDataConsumer(new InternalKafkaConsumer(topicPartition, kafkaParams)) +} -if (TaskContext.get != null && TaskContext.get.attemptNumber >= 1) { - // If this is reattempt at running the task, then invalidate cached consumer if any and - // start with a new one. - if (existingInternalConsumer != null) { -// Consumer exists in cache. If its in use, mark it for closing later, or close it now. -if (existingInternalConsumer.inUse) { - existingInternalConsumer.markedForClose = true -} else { - existingInternalConsumer.close() -} - } - cache.remove(key) // Invalidate the cache in any case - NonCachedKafkaDataConsumer(newInternalConsumer) +val key = new CacheKey(topicPartition, kafkaParams) -} else if (!useCache) { - // If planner asks to not reuse consumers, then do not use it, return a new consumer - NonCachedKafkaDataConsumer(newInternalConsumer) +if (TaskContext.get != null && TaskContext.get.attemptNumber >= 1) { + // If this is reattempt at running the task, then invalidate cached consumer if any. -} else if (existingInternalConsumer == null) { - // If consumer is not already cached, then put a new in the cache and return it - cache.put(key, newInternalConsumer) - newInternalConsumer.inUse = true - CachedKafkaDataConsumer(newInternalConsumer) + // invalidate all idle consumers for the key + pool.invalidateKey(key) -} else if (existingInternalConsumer.inUse) { - // If consumer is already cached but is currently in use, then return a new consumer - NonCachedKafkaDataConsumer(newInternalConsumer) + // borrow a consumer from pool even in this case +} -} else { - // If consumer is already cached and is currently not in use, then return that consumer - existingInternalConsumer.inUse = true - CachedKafkaDataConsumer(existingInternalConsumer) +try { + CachedKafkaDataConsumer(pool.borrowObject(key, kafkaParams)) +} catch { case _: NoSuchElementException => + // There's neither idle object to clean up nor available space in pool: + // fail back to create non-cached consumer --- End diff -- This approach introduces behavior change: even though `cache` had capacity, the `cache` worked like soft capacity and allowed adding item to the cache when there's neither idle object nor free space. New behavior of the KafkaDataConsumer is creating all the objects to non-cached whenever pool is exhausted and there's no idle object to free up. I think it is not a big deal when we configure "spark.sql.kafkaConsumerCache.capacity" properly, and having hard capacity feels more convenient to determine what's going on. However we can still mimic the current behavior with having infinite capacity, so we can be back to current behavior if we feel it makes more sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21899 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21899 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94910/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21899 **[Test build #94910 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94910/testReport)** for PR 21899 at commit [`829a333`](https://github.com/apache/spark/commit/829a333ad3dc152b90e5257cf67e2134c31e839e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22138 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22138 **[Test build #94912 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94912/testReport)** for PR 22138 at commit [`c82f306`](https://github.com/apache/spark/commit/c82f3064fa8744f91b5c8a92645588dc9d53ba35). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class PooledObjectInvalidated(key: CacheKey, lastInvalidatedTimestamp: Long,` * ` class PoolConfig extends GenericKeyedObjectPoolConfig[InternalKafkaConsumer] ` * ` case class CacheKey(groupId: String, topicPartition: TopicPartition) ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22138 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94912/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22138 **[Test build #94912 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94912/testReport)** for PR 22138 at commit [`c82f306`](https://github.com/apache/spark/commit/c82f3064fa8744f91b5c8a92645588dc9d53ba35). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22138 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaData...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22138 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22138: [SPARK-25151][SS] Apply Apache Commons Pool to Ka...
GitHub user HeartSaVioR opened a pull request: https://github.com/apache/spark/pull/22138 [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer ## What changes were proposed in this pull request? KafkaDataConsumer contains its own logic for caching InternalKafkaConsumer which looks like can be simplified via applying Apache Commons Pool. Benefits of applying Apache Commons Pool are following: * We can get rid of synchronization of KafkaDataConsumer object while acquiring and returning InternalKafkaConsumer. * We can extract the feature of object pool to outside of the class, so that the behaviors of the pool can be tested easily. * We can get various statistics for the object pool, and also be able to enable JMX for the pool. This patch brings additional dependency, Apache Commons Pool 2.6.0 into `spark-sql-kafka-0-10` module. ## How was this patch tested? Existing unit tests as well as new tests for object pool. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HeartSaVioR/spark SPARK-25151 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22138.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22138 commit c82f3064fa8744f91b5c8a92645588dc9d53ba35 Author: Jungtaek Lim Date: 2018-08-17T09:56:31Z [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21306 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21306 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94906/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21306: [SPARK-24252][SQL] Add catalog registration and table ca...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21306 **[Test build #94906 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94906/testReport)** for PR 21306 at commit [`622180a`](https://github.com/apache/spark/commit/622180a50e05b4d968380824f5dbbe5f89e42422). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class Transforms ` * ` public static final class Identity extends SingleColumnTransform ` * ` public static final class Bucket extends SingleColumnTransform ` * ` public static final class Year extends SingleColumnTransform ` * ` public static final class Month extends SingleColumnTransform ` * ` public static final class Date extends SingleColumnTransform ` * ` public static final class DateAndHour extends SingleColumnTransform ` * ` public static final class Apply extends SingleColumnTransform ` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22134: [SPARK-25143][SQL] Support data source name mapping conf...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/22134 I got it. I'll close this approach. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22134: [SPARK-25143][SQL] Support data source name mappi...
Github user dongjoon-hyun closed the pull request at: https://github.com/apache/spark/pull/22134 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19041: [SPARK-21097][CORE] Add option to recover cached data
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19041 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21584: [SPARK-24433][K8S] Initial R Bindings for SparkR on K8s
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/21584 I filed https://issues.apache.org/jira/browse/SPARK-25152 for the integration tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21584: [SPARK-24433][K8S] Initial R Bindings for SparkR ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21584 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21584: [SPARK-24433][K8S] Initial R Bindings for SparkR on K8s
Github user mccheah commented on the issue: https://github.com/apache/spark/pull/21584 Ok I am merging this to master now. Thanks for the work on this! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22137: [MINOR][DOC][SQL] use one line for annotation arg value
Github user mengxr commented on the issue: https://github.com/apache/spark/pull/22137 cc: @gatorsmile --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22137: [MINOR][DOC][SQL] use one line for annotation arg value
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22137 **[Test build #94911 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94911/testReport)** for PR 22137 at commit [`e9a9376`](https://github.com/apache/spark/commit/e9a93762aeeb219cf9ab600da248a0d1f295d09f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22137: [MINOR][DOC][SQL] use one line for annotation arg value
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22137 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2290/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22137: [MINOR][DOC][SQL] use one line for annotation arg value
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22137 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22137: [MINOR][DOC][SQL] use one line for annotation arg...
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/22137 [MINOR][DOC][SQL] use one line for annotation arg value ## What changes were proposed in this pull request? Put annotation args in one line, or API doc generation will fail. ~~~ [error] /Users/meng/src/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:1559: annotation argument needs to be a constant; found: "_FUNC_(expr) - Returns the character length of string data or number of bytes of ".+("binary data. The length of string data includes the trailing spaces. The length of binary ").+("data includes binary zeros.") [error] "binary data. The length of string data includes the trailing spaces. The length of binary " + [error] ^ [info] No documentation generated with unsuccessful compiler run [error] one error found [error] (catalyst/compile:doc) Scaladoc generation failed [error] Total time: 27 s, completed Aug 17, 2018 3:20:08 PM ~~~ ## How was this patch tested? sbt catalyst/compile:doc passed You can merge this pull request into a Git repository by running: $ git pull https://github.com/mengxr/spark minor-doc-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22137.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22137 commit e9a93762aeeb219cf9ab600da248a0d1f295d09f Author: Xiangrui Meng Date: 2018-08-17T22:47:04Z fix a minor issue to generate API docs --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user cclauss commented on the issue: https://github.com/apache/spark/pull/20838 It was reverted because [__slice_test()__](#22128) was causing the build to fail. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20637 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20637 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94904/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20637: [SPARK-23466][SQL] Remove redundant null checks in gener...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20637 **[Test build #94904 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94904/testReport)** for PR 20637 at commit [`84961b4`](https://github.com/apache/spark/commit/84961b44d0f846e241c322f0f80d8dc032f6008d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21899: [SPARK-24912][SQL] Don't obscure source of OOM du...
Github user bersprockets commented on a diff in the pull request: https://github.com/apache/spark/pull/21899#discussion_r211047556 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala --- @@ -118,12 +119,20 @@ case class BroadcastExchangeExec( // SparkFatalException, which is a subclass of Exception. ThreadUtils.awaitResult // will catch this exception and re-throw the wrapped fatal throwable. case oe: OutOfMemoryError => -throw new SparkFatalException( +val sizeMessage = if (dataSize != -1) { + s"${SparkLauncher.DRIVER_MEMORY} by at least the estimated size of the " + --- End diff -- @hvanhovell That's what was being obscured :). In testing this, I've seen various places. In the three cases I have seen first hand: java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver memory by setting spark.driver.memory to a higher value. at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.grow(HashedRelation.scala:628) at org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.append(HashedRelation.scala:570) at org.apache.spark.sql.execution.joins.LongHashedRelation$.apply(HashedRelation.scala:865) At that line is an allocation: val newPage = new Array[Long](newNumWords.toInt) 2nd case: java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or increase spark.driver.memory by at least the estimated size of the relation (96468992 bytes). at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:286) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$3.apply(TorrentBroadcast.scala:286) 3rd case: java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting \ spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver memory by setting spark.driver.memory to a higher value. at org.apache.spark.unsafe.memory.MemoryBlock.allocateFromObject(MemoryBlock.java:118) at org.apache.spark.sql.catalyst.expressions.UnsafeRow.getUTF8String(UnsafeRow.java:420) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.joins.UnsafeHashedRelation$.apply(HashedRelation.scala:311) At that line is also an allocation: mb = new ByteArrayMemoryBlock(array, offset, length); --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r211045699 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala --- @@ -223,7 +224,8 @@ object MultiLineJsonDataSource extends JsonDataSource { input => parser.parse[InputStream](input, streamParser, partitionedFileString), parser.options.parseMode, schema, - parser.options.columnNameOfCorruptRecord) + parser.options.columnNameOfCorruptRecord, + optimizeEmptySchema = false) --- End diff -- Could we rename `optimizeEmptySchema ` to `isMultiLine`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r211045061 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1492,6 +1492,15 @@ object SQLConf { "This usually speeds up commands that need to list many directories.") .booleanConf .createWithDefault(true) + + val BYPASS_PARSER_FOR_EMPTY_SCHEMA = +buildConf("spark.sql.legacy.bypassParserForEmptySchema") --- End diff -- If no behavior change, do we still need this conf? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21899: [SPARK-24912][SQL] Don't obscure source of OOM du...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/21899#discussion_r211044133 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala --- @@ -118,12 +119,20 @@ case class BroadcastExchangeExec( // SparkFatalException, which is a subclass of Exception. ThreadUtils.awaitResult // will catch this exception and re-throw the wrapped fatal throwable. case oe: OutOfMemoryError => -throw new SparkFatalException( +val sizeMessage = if (dataSize != -1) { + s"${SparkLauncher.DRIVER_MEMORY} by at least the estimated size of the " + --- End diff -- Forgive me for asking a dumb question, but where will this exception come from? The block manager? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22085: [WIP][SPARK-25095][PySpark] Python support for BarrierTa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22085 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94901/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22085: [WIP][SPARK-25095][PySpark] Python support for BarrierTa...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22085 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22085: [WIP][SPARK-25095][PySpark] Python support for BarrierTa...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22085 **[Test build #94901 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94901/testReport)** for PR 22085 at commit [`e234a0a`](https://github.com/apache/spark/commit/e234a0a3d4e740d757fe086b0971a10f621d518b). * This patch **fails from timeout after a configured wait of \`340m\`**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21899 **[Test build #94910 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94910/testReport)** for PR 21899 at commit [`829a333`](https://github.com/apache/spark/commit/829a333ad3dc152b90e5257cf67e2134c31e839e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...
Github user bersprockets commented on the issue: https://github.com/apache/spark/pull/21899 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...
Github user bersprockets commented on the issue: https://github.com/apache/spark/pull/21899 @MaxGekk In the updated message, I left out "hash" from the term "hash relation" only because it seems the relation could be also be an Array. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21909 **[Test build #94909 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94909/testReport)** for PR 21909 at commit [`96a94cc`](https://github.com/apache/spark/commit/96a94ccaed1f68fa7eaf3fc286540e531d9a9506). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21909 **[Test build #94908 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94908/testReport)** for PR 21909 at commit [`2d8e754`](https://github.com/apache/spark/commit/2d8e754e699076c8a5915e7faf971e4bd2a5c1fd). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21899 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21899 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94907/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21899 **[Test build #94907 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94907/testReport)** for PR 21899 at commit [`829a333`](https://github.com/apache/spark/commit/829a333ad3dc152b90e5257cf67e2134c31e839e). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org