[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21859 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21859 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94931/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21859 **[Test build #94931 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94931/testReport)** for PR 21859 at commit [`46bab16`](https://github.com/apache/spark/commit/46bab165af68c1ef2dd1fc57e7f27f5d27c72015). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22135: [SPARK-25093][SQL] Avoid recompiling regexp for c...
Github user igreenfield commented on a diff in the pull request: https://github.com/apache/spark/pull/22135#discussion_r211091975 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeFormatter.scala --- @@ -91,10 +94,7 @@ object CodeFormatter { } def stripExtraNewLinesAndComments(input: String): String = { -val commentReg = - ("""([ |\t]*?\/\*[\s|\S]*?\*\/[ |\t]*?)|""" +// strip /*comment*/ - """([ |\t]*?\/\/[\s\S]*?\n)""").r // strip //comment -val codeWithoutComment = commentReg.replaceAllIn(input, "") +val codeWithoutComment = commentRegexp.replaceAllIn(input, "") codeWithoutComment.replaceAll("""\n\s*\n""", "\n") // strip ExtraNewLines --- End diff -- this line also compile regex and could be replaced! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22140: [SPARK-25072][PySpark] Forbid extra value for custom Row
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/22140 cc @HyukjinKwon --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22141 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2304/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22141 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22141 **[Test build #94932 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94932/testReport)** for PR 22141 at commit [`926c7fc`](https://github.com/apache/spark/commit/926c7fc7a766b1e310798a87ae7a485f731ade99). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggrega...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21931#discussion_r211089742 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala --- @@ -366,6 +366,43 @@ class AggregateBenchmark extends BenchmarkBase { */ } + ignore("capacity for fast hash aggregate") { +val N = 20 << 20 +val M = 1 << 19 + +val benchmark = new Benchmark("Aggregate w multiple keys", N) --- End diff -- `Benchmark("Capacity for fast hash aggregate")` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row....
Github user viirya commented on the issue: https://github.com/apache/spark/pull/21931 Minor comments. LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggrega...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21931#discussion_r211089705 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala --- @@ -366,6 +366,43 @@ class AggregateBenchmark extends BenchmarkBase { */ } + ignore("capacity for fast hash aggregate") { +val N = 20 << 20 +val M = 1 << 19 + +val benchmark = new Benchmark("Aggregate w multiple keys", N) +sparkSession.range(N) + .selectExpr( +"id", +s"(id % $M) as k1", +s"cast(id % $M as int) as k2", +s"cast(id % $M as double) as k3", +s"cast(id % $M as float) as k4") .createOrReplaceTempView("test") + +def f(): Unit = sparkSession.sql("select k1, k2, k3, k4, sum(k1), sum(k2), sum(k3), " + + "sum(k4) from test group by k1, k2, k3, k4").collect() + +benchmark.addCase(s"fasthash = default") { iter => + sparkSession.conf.set("spark.sql.codegen.aggregate.fastHashMap.capacityBit", "16") + f() +} + +benchmark.addCase(s"fasthash = config") { iter => --- End diff -- "fasthash = 20"? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggrega...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21931#discussion_r211089695 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1437,6 +1437,16 @@ object SQLConf { .intConf .createWithDefault(20) + val FAST_HASH_AGGREGATE_MAX_ROWS_CAPACITY_BIT = +buildConf("spark.sql.codegen.aggregate.fastHashMap.capacityBit") + .internal() + .doc("Capacity for the max number of rows to be held in memory by the fast hash aggregate " + +"product operator. the bit not for actual value, but the actual numBuckets is determined " + --- End diff -- `the bit` -> `The bit is`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggrega...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21931#discussion_r211089716 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala --- @@ -366,6 +366,43 @@ class AggregateBenchmark extends BenchmarkBase { */ } + ignore("capacity for fast hash aggregate") { +val N = 20 << 20 +val M = 1 << 19 + +val benchmark = new Benchmark("Aggregate w multiple keys", N) +sparkSession.range(N) + .selectExpr( +"id", +s"(id % $M) as k1", +s"cast(id % $M as int) as k2", +s"cast(id % $M as double) as k3", +s"cast(id % $M as float) as k4") .createOrReplaceTempView("test") + +def f(): Unit = sparkSession.sql("select k1, k2, k3, k4, sum(k1), sum(k2), sum(k3), " + + "sum(k4) from test group by k1, k2, k3, k4").collect() + +benchmark.addCase(s"fasthash = default") { iter => --- End diff -- "fasthash = 16"? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user cclauss commented on the issue: https://github.com/apache/spark/pull/20838 This is not working at all... I am wasting way too much time. 5+ months and 80+ comments for 12 lines of code is I do not have the skills to solve the following undefined name 'long' in a satisfactory manner: ``` ./python/pyspark/streaming/dstream.py:405:35: F821 undefined name 'long' return self._sc._jvm.Time(long(timestamp * 1000)) ^ ``` If someone with more skills would be willing to take that one undefined name off my plate and solve it with a test in a separate PR then I would be grateful. I will study that PR carefully and can then proceed with the others that are in this PR. My recommended fix is at https://github.com/apache/spark/pull/20838/files#diff-6c576c52abc0624ccb6a2f45828dc6a7 and my proposed test (it is failing!) is immediately following. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21931: [SPARK-24978][SQL]Add spark.sql.fast.hash.aggregate.row....
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21931 LGTM, cc @cloud-fan @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21859 **[Test build #94931 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94931/testReport)** for PR 21859 at commit [`46bab16`](https://github.com/apache/spark/commit/46bab165af68c1ef2dd1fc57e7f27f5d27c72015). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21859 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22141 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94930/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22141 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22141 **[Test build #94930 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94930/testReport)** for PR 22141 at commit [`473bfb5`](https://github.com/apache/spark/commit/473bfb500b07626ff42a9e5ddc167970299bde21). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22131: [SPARK-25141][SQL][TEST] Modify tests for higher-...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22131 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22131: [SPARK-25141][SQL][TEST] Modify tests for higher-order f...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/22131 Thanks! I'd use this one. merging to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...
Github user dmateusp commented on the issue: https://github.com/apache/spark/pull/22141 I reproduced the issue with the following code (was a bit surprised with the behavior) The tables: ```scala scala> spark.sql("SELECT * FROM users").show +---+---+ | id|country| +---+---+ | 0| 10| | 1| 20| +---+---+ scala> spark.sql("SELECT * FROM countries").show +---++ | id|name| +---++ | 10|Portugal| +---++ ``` Without the OR: ```scala scala> spark.sql("SELECT * FROM users u WHERE u.country NOT IN (SELECT id from countries)").show +---+---+ | id|country| +---+---+ | 1| 20| +---+---+ ``` With an OR and IN: scala> spark.sql("SELECT * FROM users u WHERE 1=0 OR u.country IN (SELECT id from countries)").show +---+---+ | id|country| +---+---+ | 0| 10| +---+---+ With an OR and NOT IN: ```scala scala> spark.sql("SELECT * FROM users u WHERE 1=0 OR u.country NOT IN (SELECT id from countries)").show org.apache.spark.sql.AnalysisException: Null-aware predicate sub-queries cannot be used in nested conditions: ((1 = 0) || NOT country#9 IN (list#62 []));; ``` +1 to get that fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21899 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21899 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94929/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21899 **[Test build #94929 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94929/testReport)** for PR 21899 at commit [`829a333`](https://github.com/apache/spark/commit/829a333ad3dc152b90e5257cf67e2134c31e839e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22141 **[Test build #94930 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94930/testReport)** for PR 22141 at commit [`473bfb5`](https://github.com/apache/spark/commit/473bfb500b07626ff42a9e5ddc167970299bde21). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22141 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2303/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22141: [SPARK-25154][SQL] Support NOT IN sub-queries inside nes...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22141 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22141: [SPARK-25154] Support NOT IN sub-queries inside n...
GitHub user dilipbiswal opened a pull request: https://github.com/apache/spark/pull/22141 [SPARK-25154] Support NOT IN sub-queries inside nested OR conditions. ## What changes were proposed in this pull request? Currently NOT IN subqueries (predicated null aware subquery) are not allowed inside OR expressions. We currently catch this condition in checkAnalysis and throw an error. This PR enhances the subquery rewrite to support this type of queries. Query ```SQL SELECT * FROM s1 WHERE a > 5 or b NOT IN (SELECT c FROM s2); ``` Optimized Plan ```SQL a: int, b: int Project [a#16, b#17] +- Filter ((a#16 > 5) || NOT b#17 IN (list#13 [])) : +- Project [c#18] : +- SubqueryAlias `default`.`s2` :+- HiveTableRelation `default`.`s2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c#18, d#19] +- SubqueryAlias `default`.`s1` +- HiveTableRelation `default`.`s1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#16, b#17] ``` ## How was this patch tested? Added new testsin SQLQueryTestSuite, RewriteSubquerySuite and SubquerySuite. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dilipbiswal/spark SPARK-25154 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22141.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22141 commit 473bfb500b07626ff42a9e5ddc167970299bde21 Author: Dilip Biswal Date: 2018-08-18T21:22:37Z [SPARK-25154] Support NOT IN sub-queries inside nested OR conditions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21669: [SPARK-23257][K8S][WIP] Kerberos Support for Spark on K8...
Github user witten commented on the issue: https://github.com/apache/spark/pull/21669 I see that this branch currently has merge conflicts, but any idea on when this might land? This is the last feature we're waiting for in order to switch from the abandoned [apache-spark-on-k8s fork](https://github.com/apache-spark-on-k8s/spark) to Spark actual! Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94923/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #94923 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94923/testReport)** for PR 22112 at commit [`739f210`](https://github.com/apache/spark/commit/739f210eb8f70499b56fb75fe573099fcad63541). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20838 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94927/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20838 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22135: [SPARK-25093][SQL] Avoid recompiling regexp for comments...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22135 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94926/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22135: [SPARK-25093][SQL] Avoid recompiling regexp for comments...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22135 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20838 **[Test build #94927 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94927/testReport)** for PR 20838 at commit [`73a4fd2`](https://github.com/apache/spark/commit/73a4fd26eed19256da80d27b07b2e5f4d85eb9f6). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22135: [SPARK-25093][SQL] Avoid recompiling regexp for comments...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22135 **[Test build #94926 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94926/testReport)** for PR 22135 at commit [`5731825`](https://github.com/apache/spark/commit/5731825c9171cfe20591a0e7a34d927402881470). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22078 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22078 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94925/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22078 **[Test build #94925 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94925/testReport)** for PR 22078 at commit [`f1574d5`](https://github.com/apache/spark/commit/f1574d5a3c4bfbd1a202154e69cff0dc81283e35). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22078 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94924/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22078 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22078 **[Test build #94924 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94924/testReport)** for PR 22078 at commit [`bef1d17`](https://github.com/apache/spark/commit/bef1d17ced8865c3e95eaa451424e153b4b7214a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21899 **[Test build #94929 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94929/testReport)** for PR 21899 at commit [`829a333`](https://github.com/apache/spark/commit/829a333ad3dc152b90e5257cf67e2134c31e839e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21899: [SPARK-24912][SQL] Don't obscure source of OOM during br...
Github user bersprockets commented on the issue: https://github.com/apache/spark/pull/21899 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22123: [SPARK-25134][SQL] Csv column pruning with checki...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/22123#discussion_r211081732 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala --- @@ -1603,6 +1603,25 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils with Te .exists(msg => msg.getRenderedMessage.contains("CSV header does not conform to the schema"))) } + test("SPARK-25134: check header on parsing of dataset with projection and column pruning") { +withSQLConf(SQLConf.CSV_PARSER_COLUMN_PRUNING.key -> "true") { + withTempPath { path => +val dir = path.getAbsolutePath +Seq(("a", "b")).toDF("columnA", "columnB").write + .format("csv") + .option("header", true) + .save(dir) +checkAnswer(spark.read + .format("csv") + .option("header", true) + .option("enforceSchema", false) + .load(dir) + .select("columnA"), --- End diff -- Could you check a corner case when required Schema is empty. For example, `.option("enforceSchema", false)` + `count()`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22123: [SPARK-25134][SQL] Csv column pruning with checking of h...
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/22123 May I ask you check the `multiLine` mode additionally since we use different methods of uniVocity parser. When `multiLine` is disabled, the `parseLine` method is used but in the `multiLine` mode: https://github.com/apache/spark/blob/a8a1ac01c4732f8a738b973c8486514cd88bf99b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/UnivocityParser.scala#L303-L307 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22121 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22121 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94928/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22121 **[Test build #94928 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94928/testReport)** for PR 22121 at commit [`8b191bd`](https://github.com/apache/spark/commit/8b191bd37af24ff27b6416ee6af4d885f1c94852). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21909 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22121 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22121 **[Test build #94928 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94928/testReport)** for PR 22121 at commit [`8b191bd`](https://github.com/apache/spark/commit/8b191bd37af24ff27b6416ee6af4d885f1c94852). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22121 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2302/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22121: [SPARK-25133][SQL][Doc]Avro data source guide
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22121#discussion_r211081684 --- Diff: docs/avro-data-source-guide.md --- @@ -0,0 +1,260 @@ +--- +layout: global +title: Apache Avro Data Source Guide +--- + +* This will become a table of contents (this text will be scraped). +{:toc} + +Since Spark 2.4 release, [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) provides built-in support for reading and writing Apache Avro data. + +## Deploying +The `spark-avro` module is external and not included in `spark-submit` or `spark-shell` by default. + +As with any Spark applications, `spark-submit` is used to launch your application. `spark-avro_{{site.SCALA_BINARY_VERSION}}` +and its dependencies can be directly added to `spark-submit` using `--packages`, such as, + +./bin/spark-submit --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ... + +For experimenting on `spark-shell`, you can also use `--packages` to add `org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly, + +./bin/spark-shell --packages org.apache.spark:spark-avro_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ... + +See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies. + +## Load/Save Functions + +Since `spark-avro` module is external, there is not such API as `.avro` in +`DataFrameReader` or `DataFrameWriter`. +To load/save data in Avro format, you need to specify the data source option `format` as short name `avro` or full name `org.apache.spark.sql.avro`. + + +{% highlight scala %} + +val usersDF = spark.read.format("avro").load("examples/src/main/resources/users.avro") +usersDF.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro") + +{% endhighlight %} + + +{% highlight java %} + +Dataset usersDF = spark.read().format("avro").load("examples/src/main/resources/users.avro"); +usersDF.select("name", "favorite_color").write().format("avro").save("namesAndFavColors.avro"); + +{% endhighlight %} + + +{% highlight python %} + +df = spark.read.format("avro").load("examples/src/main/resources/users.avro") +df.select("name", "favorite_color").write.format("avro").save("namesAndFavColors.avro") + +{% endhighlight %} + + +{% highlight r %} + +df <- read.df("examples/src/main/resources/users.avro", "avro") +write.df(select(df, "name", "favorite_color"), "namesAndFavColors.avro", "avro") + +{% endhighlight %} + + + +## Data Source Options + +Data source options of Avro can be set using the `.option` method on `DataFrameReader` or `DataFrameWriter`. + + Property NameDefaultMeaningScope + +avroSchema +None +Optional Avro schema provided by an user in JSON format. +read and write + + +recordName +topLevelRecord +Top level record name in write result, which is required in Avro spec. +write + + +recordNamespace +"" +Record namespace in write result. +write + + +ignoreExtension +true +The option controls ignoring of files without .avro extensions in read. If the option is enabled, all files (with and without .avro extension) are loaded. +read + + +compression +snappy +The compression option allows to specify a compression codec used in write. Currently supported codecs are uncompressed, snappy, deflate, bzip2 and xz. If the option is not set, the configuration spark.sql.avro.compression.codec config is taken into account. +write + + + +## Supported types for Avro -> Spark SQL conversion +Currently Spark supports reading all [primitive types](https://avro.apache.org/docs/1.8.2/spec.html#schema_primitive) and [complex types](https://avro.apache.org/docs/1.8.2/spec.html#schema_complex) of Avro. + + Avro typeSpark SQL type + +boolean +BooleanType + + +int +IntegerType + + +long +LongType + + +float +FloatType + + +double +DoubleType + + +string +StringType + + +enum +StringType + + +fixed +BinaryType + + +bytes +BinaryType + + +record +StructType + + +array +ArrayType + + +map +MapType
[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21909 LGTM. Thanks for being patient to address all the comments! Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21909 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21909 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94922/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21909 **[Test build #94922 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94922/testReport)** for PR 21909 at commit [`050c8ce`](https://github.com/apache/spark/commit/050c8ce73f35791c4adb1a4d11f120288865cae8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22123: [SPARK-25134][SQL] Csv column pruning with checking of h...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/22123 cc @MaxGekk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21087: [SPARK-23997][SQL] Configurable maximum number of...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21087#discussion_r211080067 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala --- @@ -164,9 +165,12 @@ case class BucketSpec( numBuckets: Int, bucketColumnNames: Seq[String], sortColumnNames: Seq[String]) { - if (numBuckets <= 0 || numBuckets >= 10) { + def conf: SQLConf = SQLConf.get + + if (numBuckets <= 0 || numBuckets > conf.bucketingMaxBuckets) { --- End diff -- Since the condition is changed from `>` to `>=`, there is inconsistent between the condition and the error message. If this condition is true, the message is like `... but less than or equal to bucketing.maxBuckets ...`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/21860 cc @hvanhovell --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20838: [SPARK-23698] Resolve undefined names in Python 3
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20838 **[Test build #94927 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94927/testReport)** for PR 20838 at commit [`73a4fd2`](https://github.com/apache/spark/commit/73a4fd26eed19256da80d27b07b2e5f4d85eb9f6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22135: [SPARK-25093][SQL] Avoid recompiling regexp for comments...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22135 **[Test build #94926 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94926/testReport)** for PR 22135 at commit [`5731825`](https://github.com/apache/spark/commit/5731825c9171cfe20591a0e7a34d927402881470). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22135: [SPARK-25093][SQL] Avoid recompiling regexp for comments...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22135 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22135: [SPARK-25093][SQL] Avoid recompiling regexp for comments...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22135 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2301/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22124 Comes from: [[SPARK-22834][SQL] Make insertion commands have real children to fix UI issues](https://github.com/apache/spark/pull/20020). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22078 **[Test build #94925 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94925/testReport)** for PR 22078 at commit [`f1574d5`](https://github.com/apache/spark/commit/f1574d5a3c4bfbd1a202154e69cff0dc81283e35). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22078 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2300/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22078 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22124 > But it is inconsistent now. Can you point out in the codebase where the inconsistency comes from? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 I've removed the concept of "order sensitive partitioner" and came up with a better abstraction. Please take a look at the updated PR descrption, thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22078 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22078 **[Test build #94924 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94924/testReport)** for PR 22078 at commit [`bef1d17`](https://github.com/apache/spark/commit/bef1d17ced8865c3e95eaa451424e153b4b7214a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22078: [SPARK-25085][SQL] Insert overwrite a non-partitioned ta...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22078 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2299/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/22112 @tgravescs The `FileCommitProtocol` is an internal API, and our current implementation does store task-level data temporary in a staging directory (See `HadoopMapReduceCommitProtocol`). That said, we can fix the `FileCommitProtocol` to be able to rollback a committed task, as long as the job is not committed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2298/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22112 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22112 **[Test build #94923 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94923/testReport)** for PR 22112 at commit [`739f210`](https://github.com/apache/spark/commit/739f210eb8f70499b56fb75fe573099fcad63541). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22124 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22124 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/94921/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22124 **[Test build #94921 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94921/testReport)** for PR 22124 at commit [`9b16ff0`](https://github.com/apache/spark/commit/9b16ff0f0581366f587db735658e3110237ceef0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21909: [SPARK-24959][SQL] Speed up count() for JSON and CSV
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21909 **[Test build #94922 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94922/testReport)** for PR 21909 at commit [`050c8ce`](https://github.com/apache/spark/commit/050c8ce73f35791c4adb1a4d11f120288865cae8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r211075385 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonDataSource.scala --- @@ -223,7 +224,8 @@ object MultiLineJsonDataSource extends JsonDataSource { input => parser.parse[InputStream](input, streamParser, partitionedFileString), parser.options.parseMode, schema, - parser.options.columnNameOfCorruptRecord) + parser.options.columnNameOfCorruptRecord, + optimizeEmptySchema = false) --- End diff -- renamed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21909: [SPARK-24959][SQL] Speed up count() for JSON and ...
Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21909#discussion_r211075384 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1492,6 +1492,15 @@ object SQLConf { "This usually speeds up commands that need to list many directories.") .booleanConf .createWithDefault(true) + + val BYPASS_PARSER_FOR_EMPTY_SCHEMA = +buildConf("spark.sql.legacy.bypassParserForEmptySchema") --- End diff -- It seems we don't need it anymore --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/22112 I can't envision how that would work? You can't change how output committers work. You would have to not store anything until all pass or store it temporarily, both in my opinion are not good. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22135: [SPARK-25093][SQL] Avoid recompiling regexp for comments...
Github user mgaido91 commented on the issue: https://github.com/apache/spark/pull/22135 thanks for the comment @kiszk , I am doing it! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22124 The root project should be consistent with the schema of the target table. But now it is inconsistent. **Before this PR**: [dataColumns](https://github.com/apache/spark/blob/e6c6f90a55241905c420afbc803dd3bd6961d66b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L84): `col1#8L,col2#9L` [plan](https://github.com/apache/spark/blob/e6c6f90a55241905c420afbc803dd3bd6961d66b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L67): ``` *(1) Project [col1#8L, col2#9L] +- *(1) Filter (isnotnull(col1#8L) && (col1#8L > -20)) +- *(1) FileScan parquet default.table1[col1#8L,col2#9L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/yumwang/spark/parquet], PartitionFilters: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,-20)], ReadSchema: struct ``` **After this PR**: [dataColumns](https://github.com/apache/spark/blob/e6c6f90a55241905c420afbc803dd3bd6961d66b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L84): `COL1#14L,COL2#15L` [plan](https://github.com/apache/spark/blob/e6c6f90a55241905c420afbc803dd3bd6961d66b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L67): ``` *(1) Project [col1#8L AS COL1#14L, col2#9L AS COL2#15L] +- *(1) Filter (isnotnull(col1#8L) && (col1#8L > -20)) +- *(1) FileScan parquet default.table1[col1#8L,col2#9L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/yumwang/spark/parquet], PartitionFilters: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,-20)], ReadSchema: struct ``` Before [SPARK-22834](https://issues.apache.org/jira/browse/SPARK-22834) [dataColumns](https://github.com/apache/spark/blob/ec122209fb35a65637df42eded64b0203e105aae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L124): `COL1#19L,COL2#20L` [queryExecution](https://github.com/apache/spark/blob/ec122209fb35a65637df42eded64b0203e105aae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L104): ``` == Parsed Logical Plan == Project [COL1#19L, COL2#20L] +- SubqueryAlias view1 +- View (`default`.`view1`, [col1#19L,col2#20L]) +- Project [col1#15L, col2#16L] +- Filter (col1#15L > cast(-20 as bigint)) +- SubqueryAlias table1 +- Relation[col1#15L,col2#16L] parquet == Analyzed Logical Plan == COL1: bigint, COL2: bigint Project [COL1#19L, COL2#20L] +- SubqueryAlias view1 +- View (`default`.`view1`, [col1#19L,col2#20L]) +- Project [cast(col1#15L as bigint) AS col1#19L, cast(col2#16L as bigint) AS col2#20L] +- Project [col1#15L, col2#16L] +- Filter (col1#15L > cast(-20 as bigint)) +- SubqueryAlias table1 +- Relation[col1#15L,col2#16L] parquet == Optimized Logical Plan == Filter (isnotnull(col1#15L) && (col1#15L > -20)) +- Relation[col1#15L,col2#16L] parquet == Physical Plan == *Project [col1#15L, col2#16L] +- *Filter (isnotnull(col1#15L) && (col1#15L > -20)) +- *FileScan parquet default.table1[col1#15L,col2#16L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/yumwang/spark/parquet], PartitionFilters: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,-20)], ReadSchema: struct ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/22124 **[Test build #94921 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/94921/testReport)** for PR 22124 at commit [`9b16ff0`](https://github.com/apache/spark/commit/9b16ff0f0581366f587db735658e3110237ceef0). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22124 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/22124 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2297/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...
Github user heary-cao commented on the issue: https://github.com/apache/spark/pull/21860 cc @cloud-fan @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21819: [SPARK-24863][SS] Report Kafka offset lag as a cu...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21819 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22098: [SPARK-24886][INFRA] Fix the testing script to in...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22098 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21819: [SPARK-24863][SS] Report Kafka offset lag as a custom me...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21819 Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22098: [SPARK-24886][INFRA] Fix the testing script to increase ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/22098 Let me just push this in. Merged to master. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22132: [SPARK-25142][PYSPARK] Add error messages when Py...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22132 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org