[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
frankyin-factual commented on pull request #28898: URL: https://github.com/apache/spark/pull/28898#issuecomment-649269967 Let me see if I can get a more generalized solution out today. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] venkata91 commented on pull request #28287: [SPARK-31418][SCHEDULER] Request more executors in case of dynamic allocation is enabled and a task becomes unschedulable due to spark's bl
venkata91 commented on pull request #28287: URL: https://github.com/apache/spark/pull/28287#issuecomment-649263104 No worries @tgravescs Thanks for taking a look. For some reason, these checks keeps failing but it doesn't look to be related to my changes. some cache issue probably? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
frankyin-factual commented on a change in pull request #28898: URL: https://github.com/apache/spark/pull/28898#discussion_r445331787 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -113,6 +113,11 @@ object NestedColumnAliasing { case _: Sample => true case _: RepartitionByExpression => true case _: Join => true +case x: Filter => x.child match { + case _: Window => true Review comment: Do you have a sample query in mind that produces `Project->Filter->Sample`? I've been trying to come up with a query that generates this plan. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
frankyin-factual commented on a change in pull request #28898: URL: https://github.com/apache/spark/pull/28898#discussion_r445331787 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -113,6 +113,11 @@ object NestedColumnAliasing { case _: Sample => true case _: RepartitionByExpression => true case _: Join => true +case x: Filter => x.child match { + case _: Window => true Review comment: Do you have a sample query that produces `Project->Filter->Sample`? I've been trying to come up with a query that generates this plan. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
AmplabJenkins removed a comment on pull request #27690: URL: https://github.com/apache/spark/pull/27690#issuecomment-649245548 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
AmplabJenkins commented on pull request #27690: URL: https://github.com/apache/spark/pull/27690#issuecomment-649245548 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
SparkQA commented on pull request #27690: URL: https://github.com/apache/spark/pull/27690#issuecomment-649244202 **[Test build #124508 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124508/testReport)** for PR 27690 at commit [`0fbeaf3`](https://github.com/apache/spark/commit/0fbeaf374bf35a7d0cde2b3340d9f3c4551cbdb2). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
SparkQA removed a comment on pull request #27690: URL: https://github.com/apache/spark/pull/27690#issuecomment-649159672 **[Test build #124508 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124508/testReport)** for PR 27690 at commit [`0fbeaf3`](https://github.com/apache/spark/commit/0fbeaf374bf35a7d0cde2b3340d9f3c4551cbdb2). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] karuppayya commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation
karuppayya commented on pull request #28804: URL: https://github.com/apache/spark/pull/28804#issuecomment-649241132 @maropu I have fixed the tests with the flag enabled. Please take a look. Should I go ahead change the default value of the config back to `false`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
viirya commented on a change in pull request #28898: URL: https://github.com/apache/spark/pull/28898#discussion_r445321841 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -113,6 +113,11 @@ object NestedColumnAliasing { case _: Sample => true case _: RepartitionByExpression => true case _: Join => true +case x: Filter => x.child match { + case _: Window => true Review comment: I see, it is due to predicate pushdown rule. I think we need general solution as @maropu said. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
AmplabJenkins removed a comment on pull request #28898: URL: https://github.com/apache/spark/pull/28898#issuecomment-649234400 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
AmplabJenkins commented on pull request #28898: URL: https://github.com/apache/spark/pull/28898#issuecomment-649234400 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
SparkQA removed a comment on pull request #28898: URL: https://github.com/apache/spark/pull/28898#issuecomment-649153764 **[Test build #124507 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124507/testReport)** for PR 28898 at commit [`b1cad9a`](https://github.com/apache/spark/commit/b1cad9ad759c3e1d2ef9efd0f9390c6924e412df). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
SparkQA commented on pull request #28898: URL: https://github.com/apache/spark/pull/28898#issuecomment-649233655 **[Test build #124507 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124507/testReport)** for PR 28898 at commit [`b1cad9a`](https://github.com/apache/spark/commit/b1cad9ad759c3e1d2ef9efd0f9390c6924e412df). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements
maropu commented on a change in pull request #27246: URL: https://github.com/apache/spark/pull/27246#discussion_r445319707 ## File path: sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt ## @@ -42,4 +42,8 @@ Spilling with 1 rows: Best Time(ms) Avg Time(ms) Stdev(m UnsafeExternalSorter 11 11 1 14.7 68.0 1.0X ExternalAppendOnlyUnsafeRowArray 9 10 1 17.1 58.5 1.2X - +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 4.4.0-178-generic +Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz +Spilling SpillReader with 16000 rows:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative + +UnsafeSorterSpillReader_bufferSize1024 411426 13 0.61607.2 1.0X Review comment: oh, looks nice. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] TJX2014 edited a comment on pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST
TJX2014 edited a comment on pull request #28926: URL: https://github.com/apache/spark/pull/28926#issuecomment-649231790 Follow: https://github.com/apache/spark/pull/28856 cc @cloud-fan @MaxGekk This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] TJX2014 edited a comment on pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST
TJX2014 edited a comment on pull request #28926: URL: https://github.com/apache/spark/pull/28926#issuecomment-649231790 https://github.com/apache/spark/pull/28856 cc @cloud-fan @MaxGekk This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] TJX2014 edited a comment on pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST
TJX2014 edited a comment on pull request #28926: URL: https://github.com/apache/spark/pull/28926#issuecomment-649231790 https://github.com/apache/spark/pull/28856 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] TJX2014 commented on pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST
TJX2014 commented on pull request #28926: URL: https://github.com/apache/spark/pull/28926#issuecomment-649231790 #https://github.com/apache/spark/pull/28856 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST
AmplabJenkins removed a comment on pull request #28926: URL: https://github.com/apache/spark/pull/28926#issuecomment-649230613 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST
AmplabJenkins commented on pull request #28926: URL: https://github.com/apache/spark/pull/28926#issuecomment-649230871 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] TJX2014 commented on a change in pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST
TJX2014 commented on a change in pull request #28926: URL: https://github.com/apache/spark/pull/28926#discussion_r445317708 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala ## @@ -2612,6 +2614,9 @@ object Sequence { val stepDays = step.days val stepMicros = step.microseconds + require(scale != MICROS_PER_DAY || stepMonths != 0 || stepDays != 0, Review comment: Add scale constraint here to `DateType` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] TJX2014 opened a new pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST
TJX2014 opened a new pull request #28926: URL: https://github.com/apache/spark/pull/28926 ### What changes were proposed in this pull request? Add a unit test. Bug fix in `org.apache.spark.sql.catalyst.expressions.Sequence.TemporalSequenceImpl` Add `sequence step must be a day interval if start and end values are dates` constraint. Follow: [https://github.com/apache/spark/pull/28856](url) ### Why are the changes needed? Spark sequence doesn't handle date increments that cross DST ### Does this PR introduce _any_ user-facing change? Before the PR, people will not get a correct result: `set spark.sql.session.timeZone` to `Asia/Shanghai, America/Chicago, GMT`, Before execute `sql("select sequence(cast('2011-03-01' as date), cast('2011-05-01' as date), interval 1 month)").show(false)`, People will get `[2011-03-01, 2011-04-01, 2011-05-01]`, **`[2011-03-01, 2011-03-28, 2011-04-28]`**, `[2011-03-01, 2011-04-01, 2011-05-01]`. After the PR, sequence date conversion is corrected: We will get `[2011-03-01, 2011-04-01, 2011-05-01]` from the former three conditions. ### How was this patch tested? Unit test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST
AmplabJenkins commented on pull request #28926: URL: https://github.com/apache/spark/pull/28926#issuecomment-649230613 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on pull request #28902: [SPARK-31801][API][SHUFFLE][TESTS] Tests for registering map output metadata
gatorsmile commented on pull request #28902: URL: https://github.com/apache/spark/pull/28902#issuecomment-649226867 cc @Ngone51 @jiangxb1987 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gatorsmile commented on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR
gatorsmile commented on pull request #28912: URL: https://github.com/apache/spark/pull/28912#issuecomment-649226614 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
frankyin-factual commented on pull request #28898: URL: https://github.com/apache/spark/pull/28898#issuecomment-649224393 > btw, could you follow the PR template? https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE Just updated the PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation
AmplabJenkins removed a comment on pull request #28804: URL: https://github.com/apache/spark/pull/28804#issuecomment-649221849 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation
AmplabJenkins commented on pull request #28804: URL: https://github.com/apache/spark/pull/28804#issuecomment-649221849 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
frankyin-factual commented on a change in pull request #28898: URL: https://github.com/apache/spark/pull/28898#discussion_r445309526 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -113,6 +113,11 @@ object NestedColumnAliasing { case _: Sample => true case _: RepartitionByExpression => true case _: Join => true +case x: Filter => x.child match { + case _: Window => true Review comment: That won’t work because it seems causing an infinite loop in optimizer. It gives me error messages like running out of max iterations. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation
SparkQA commented on pull request #28804: URL: https://github.com/apache/spark/pull/28804#issuecomment-649221158 **[Test build #124505 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124505/testReport)** for PR 28804 at commit [`afc2903`](https://github.com/apache/spark/commit/afc2903e4a327d6caef518e6d3f0dc431424ac7c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation
SparkQA removed a comment on pull request #28804: URL: https://github.com/apache/spark/pull/28804#issuecomment-649138531 **[Test build #124505 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124505/testReport)** for PR 28804 at commit [`afc2903`](https://github.com/apache/spark/commit/afc2903e4a327d6caef518e6d3f0dc431424ac7c). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] siknezevic commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements
siknezevic commented on a change in pull request #27246: URL: https://github.com/apache/spark/pull/27246#discussion_r445307223 ## File path: sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt ## @@ -42,4 +42,8 @@ Spilling with 1 rows: Best Time(ms) Avg Time(ms) Stdev(m UnsafeExternalSorter 11 11 1 14.7 68.0 1.0X ExternalAppendOnlyUnsafeRowArray 9 10 1 17.1 58.5 1.2X - +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 4.4.0-178-generic +Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz +Spilling SpillReader with 16000 rows:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative + +UnsafeSorterSpillReader_bufferSize1024 411426 13 0.61607.2 1.0X Review comment: There are three spill files. It will result in three I/Os for each created iterator, because it has to read number of rows from each spilled file. So, we see more than 2000 X performance difference. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
AmplabJenkins removed a comment on pull request #28761: URL: https://github.com/apache/spark/pull/28761#issuecomment-649217406 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] siknezevic commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements
siknezevic commented on a change in pull request #27246: URL: https://github.com/apache/spark/pull/27246#discussion_r445306120 ## File path: sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt ## @@ -42,4 +42,8 @@ Spilling with 1 rows: Best Time(ms) Avg Time(ms) Stdev(m UnsafeExternalSorter 11 11 1 14.7 68.0 1.0X ExternalAppendOnlyUnsafeRowArray 9 10 1 17.1 58.5 1.2X - +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 4.4.0-178-generic +Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz +Spilling SpillReader with 16000 rows:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative + +UnsafeSorterSpillReader_bufferSize1024 411426 13 0.61607.2 1.0X Review comment: Here it is: OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 4.4.0-178-generic Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz Spilling SpillReader with 16000 rows:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative // UnsafeSorterSpillReader 932683 943898 NaN 0.0 3643292.4 1.0X This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
AmplabJenkins commented on pull request #28761: URL: https://github.com/apache/spark/pull/28761#issuecomment-649217406 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
SparkQA removed a comment on pull request #28761: URL: https://github.com/apache/spark/pull/28761#issuecomment-649140417 **[Test build #124506 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124506/testReport)** for PR 28761 at commit [`bd691ed`](https://github.com/apache/spark/commit/bd691ed16eade2e63c0fdd8d2bbd88282f6c4662). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
SparkQA commented on pull request #28761: URL: https://github.com/apache/spark/pull/28761#issuecomment-649216688 **[Test build #124506 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124506/testReport)** for PR 28761 at commit [`bd691ed`](https://github.com/apache/spark/commit/bd691ed16eade2e63c0fdd8d2bbd88282f6c4662). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28896: [SPARK-32025][SQL] Csv schema inference problems with different types in the same column
AmplabJenkins commented on pull request #28896: URL: https://github.com/apache/spark/pull/28896#issuecomment-649215908 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28896: [SPARK-32025][SQL] Csv schema inference problems with different types in the same column
AmplabJenkins removed a comment on pull request #28896: URL: https://github.com/apache/spark/pull/28896#issuecomment-649215908 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28896: [SPARK-32025][SQL] Csv schema inference problems with different types in the same column
HyukjinKwon commented on a change in pull request #28896: URL: https://github.com/apache/spark/pull/28896#discussion_r444833442 ## File path: sql/core/src/test/resources/test-data/mixed-types1.csv ## @@ -0,0 +1,4 @@ +col_mixed_types +2012 +1997 +True Review comment: Let's don't bother to create new data file here. You can write the test codes as below instead. ```scala withTempPath { path => Seq("col_mixed_types", "2012", "1997", "True").toDS.write.text(path.getCanonicalPath) val df = spark.read.format("csv") .option("header", "true") .option("inferSchema", "true") .load(path.getCanonicalPath) assert(data.schema.last == StructField("col_mixed_types", StringType, true)) } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28896: [SPARK-32025][SQL] Csv schema inference problems with different types in the same column
SparkQA commented on pull request #28896: URL: https://github.com/apache/spark/pull/28896#issuecomment-649215558 **[Test build #124510 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124510/testReport)** for PR 28896 at commit [`4629bb5`](https://github.com/apache/spark/commit/4629bb5556564b2ebcedca88157f45aa95982f00). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] siknezevic commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements
siknezevic commented on a change in pull request #27246: URL: https://github.com/apache/spark/pull/27246#discussion_r445304792 ## File path: sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt ## @@ -42,4 +42,8 @@ Spilling with 1 rows: Best Time(ms) Avg Time(ms) Stdev(m UnsafeExternalSorter 11 11 1 14.7 68.0 1.0X ExternalAppendOnlyUnsafeRowArray 9 10 1 17.1 58.5 1.2X - +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 4.4.0-178-generic Review comment: Yes, I did comparison of the existing micro-benchmark results when patch is applied and when there is no patch. I do not see regression in the performance for the existing micro-benchmarks. I will push micro-benchmark results for jdk11 soon. Thank you for help 😊 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon removed a comment on pull request #28896: [SPARK-32025][SQL] Csv schema inference problems with different types in the same column
HyukjinKwon removed a comment on pull request #28896: URL: https://github.com/apache/spark/pull/28896#issuecomment-649214971 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28896: [SPARK-32025][SQL] Csv schema inference problems with different types in the same column
AmplabJenkins removed a comment on pull request #28896: URL: https://github.com/apache/spark/pull/28896#issuecomment-647758404 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28896: [SPARK-32025][SQL] Csv schema inference problems with different types in the same column
HyukjinKwon commented on pull request #28896: URL: https://github.com/apache/spark/pull/28896#issuecomment-649215341 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28896: [SPARK-32025][SQL] Csv schema inference problems with different types in the same column
HyukjinKwon commented on pull request #28896: URL: https://github.com/apache/spark/pull/28896#issuecomment-649214971 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28919: [SPARK-32038][SQL][FOLLOWUP] Make the alias name pretty after float/double normalization
HyukjinKwon commented on pull request #28919: URL: https://github.com/apache/spark/pull/28919#issuecomment-649213796 @cloud-fan, shall we add a short comment https://github.com/apache/spark/pull/28919#discussion_r445274348? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] siknezevic commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements
siknezevic commented on a change in pull request #27246: URL: https://github.com/apache/spark/pull/27246#discussion_r445299048 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArrayBenchmark.scala ## @@ -182,6 +182,47 @@ object ExternalAppendOnlyUnsafeRowArrayBenchmark extends BenchmarkBase { } } + def testAgainstUnsafeSorterSpillReader( + numSpillThreshold: Int, + numRows: Int, + numIterators: Int, + iterations: Int): Unit = { +val rows = testRows(numRows) +val benchmark = new Benchmark(s"Spilling SpillReader with $numRows rows", iterations * numRows, + output = output) + +benchmark.addCase("UnsafeSorterSpillReader_bufferSize1024") { _: Int => + val array = UnsafeExternalSorter.create( +TaskContext.get().taskMemoryManager(), +SparkEnv.get.blockManager, +SparkEnv.get.serializerManager, +TaskContext.get(), +null, +null, +1024, +SparkEnv.get.memoryManager.pageSizeBytes, +numSpillThreshold, +false) + + rows.foreach(x => +array.insertRecord( + x.getBaseObject, + x.getBaseOffset, + x.getSizeInBytes, + 0, + false)) + + for (_ <- 0L until numIterators) { Review comment: During execution of sort-merge join (Left Semi Join ) for each left join row “right matches” are found and stored into ExternalAppendOnlyUnsafeRowArrey object. ExternalAppendOnlyUnsafeRowArrey object with “right matches” is created when first row on left side of the join is processed and then reused if next rows on the left side of join is the same like previous one. In the case of Queries 14a/14b there are millions of rows of “right matches” which could not fit into memory. To run this query spilling is enabled and “right matches rows” data is moved from ExternalAppendOnlyUnsafeRowArrey into UnsafeExternalSorter and then spilled onto the disk. To perform join operation on left join row, you have to create iterator on top of “right matches rows”. The operation of creation of iterator on top of “right matches” is repeated for each processed row on the left side of the join. When million rows are processed on left side of the join, the iterator on top of spilled “right matches” rows is created each time. This means that millions of time iterator on top of right matches (that are spilled on the disk) is created. The current Spark implementation creates iterator on top of spilled rows and producing I/0 because it reads number of rows stored in the spilled files but iteration action on top of iterator is never done during join operation. Iterator is created, never used and then discarded with each processed join row. This will results into millions of I/0s. One I/0 is 2 or 3 millisecond. Hence this PR which creates lazy iterator ("lazy" constructor for UnsafeSorterSpillReader), so no I/O is done. Also, my micro-benchmark simulates creation of iterators on top of spilled files which contain “right matches”. Sorry for the long explanation. Not sure if I can make it simpler. I hope it clarifies why I created micro-benchmark this way. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
viirya commented on a change in pull request #28898: URL: https://github.com/apache/spark/pull/28898#discussion_r445294440 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -113,6 +113,11 @@ object NestedColumnAliasing { case _: Sample => true case _: RepartitionByExpression => true case _: Join => true +case x: Filter => x.child match { + case _: Window => true Review comment: Then just add `case _: Filter => true`, if you want to let project pushed through `Filter`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
maropu commented on a change in pull request #28898: URL: https://github.com/apache/spark/pull/28898#discussion_r445291555 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -113,6 +113,11 @@ object NestedColumnAliasing { case _: Sample => true case _: RepartitionByExpression => true case _: Join => true +case x: Filter => x.child match { + case _: Window => true Review comment: But, it seems `Project->Filter->Sample` has the same issue. Looks we need a more general solution for that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
maropu commented on pull request #28898: URL: https://github.com/apache/spark/pull/28898#issuecomment-649199639 btw, could you follow the PR template? https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
maropu commented on a change in pull request #28898: URL: https://github.com/apache/spark/pull/28898#discussion_r445291555 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -113,6 +113,11 @@ object NestedColumnAliasing { case _: Sample => true case _: RepartitionByExpression => true case _: Join => true +case x: Filter => x.child match { + case _: Window => true Review comment: But, `Project->Filter->Sample` has the same issue. Looks we need a more general solution for that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
maropu commented on a change in pull request #28898: URL: https://github.com/apache/spark/pull/28898#discussion_r445291555 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -113,6 +113,11 @@ object NestedColumnAliasing { case _: Sample => true case _: RepartitionByExpression => true case _: Join => true +case x: Filter => x.child match { + case _: Window => true Review comment: But, `Project->Filter->Sample` has the same issue. So, it seems to be a different issue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Hellsen83 commented on pull request #23877: [SPARK-26449][PYTHON] Add transform method to DataFrame API
Hellsen83 commented on pull request #23877: URL: https://github.com/apache/spark/pull/23877#issuecomment-649198577 Hello @MrPowers , you are right, this is in fact motivated by your excellent blog post - thank you so much for that! From my experience - i.e. bringing this style of wrting PySpark transformations into a heterogeneous group of roughly 15 devs/data scientists - the following was used most frequently and people new to the game were able to pick this up quickly: ``` def my_logical_name(arg1: type1, arg2: type2): """My Docstring Style goes here :arg1: does something :arg2: does something_else :returns: a dataframe that was first somethinged and then something_elsed """ def _(df: DataFrame): return df.do_something(arg1).do_something_else(arg2) return _ def test_my_logical_name_returns_none_if_args_are_equal(): .. result_df: DataFrame = df.transform(my_logical_name(arg1, arg1)) .. ``` So I am right with ya, but propose "\_" as the inner function name and the above mentioned docstring placement. Main reasons for "_": 1) the amount of visual noise should be lowered as much as possible (when writing many transformations and things do get more complicated, this pays off) 2) if you name the inner function, people _will_ give custom names to it. this goes against 1) and uniform code, as names will start to vary. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR
AmplabJenkins removed a comment on pull request #28912: URL: https://github.com/apache/spark/pull/28912#issuecomment-649188105 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124509/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR
AmplabJenkins removed a comment on pull request #28912: URL: https://github.com/apache/spark/pull/28912#issuecomment-649188097 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR
SparkQA commented on pull request #28912: URL: https://github.com/apache/spark/pull/28912#issuecomment-649187983 **[Test build #124509 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124509/testReport)** for PR 28912 at commit [`49f4335`](https://github.com/apache/spark/commit/49f4335f3a13f04bff05b0974ffa4135731d615d). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR
SparkQA removed a comment on pull request #28912: URL: https://github.com/apache/spark/pull/28912#issuecomment-649179138 **[Test build #124509 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124509/testReport)** for PR 28912 at commit [`49f4335`](https://github.com/apache/spark/commit/49f4335f3a13f04bff05b0974ffa4135731d615d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR
AmplabJenkins commented on pull request #28912: URL: https://github.com/apache/spark/pull/28912#issuecomment-649188097 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR
AmplabJenkins removed a comment on pull request #28912: URL: https://github.com/apache/spark/pull/28912#issuecomment-649179488 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR
AmplabJenkins commented on pull request #28912: URL: https://github.com/apache/spark/pull/28912#issuecomment-649179488 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR
SparkQA commented on pull request #28912: URL: https://github.com/apache/spark/pull/28912#issuecomment-649179138 **[Test build #124509 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124509/testReport)** for PR 28912 at commit [`49f4335`](https://github.com/apache/spark/commit/49f4335f3a13f04bff05b0974ffa4135731d615d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR
HyukjinKwon commented on pull request #28912: URL: https://github.com/apache/spark/pull/28912#issuecomment-649178851 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28864: [SPARK-32004][ALL] Drop references to slave
AmplabJenkins removed a comment on pull request #28864: URL: https://github.com/apache/spark/pull/28864#issuecomment-649177945 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28864: [SPARK-32004][ALL] Drop references to slave
AmplabJenkins commented on pull request #28864: URL: https://github.com/apache/spark/pull/28864#issuecomment-649177945 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28864: [SPARK-32004][ALL] Drop references to slave
SparkQA removed a comment on pull request #28864: URL: https://github.com/apache/spark/pull/28864#issuecomment-649120767 **[Test build #124501 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124501/testReport)** for PR 28864 at commit [`eaf29d8`](https://github.com/apache/spark/commit/eaf29d8f8671b2796bf1fc5a82ee4b292511b3fc). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28864: [SPARK-32004][ALL] Drop references to slave
SparkQA commented on pull request #28864: URL: https://github.com/apache/spark/pull/28864#issuecomment-649177297 **[Test build #124501 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124501/testReport)** for PR 28864 at commit [`eaf29d8`](https://github.com/apache/spark/commit/eaf29d8f8671b2796bf1fc5a82ee4b292511b3fc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28919: [SPARK-32038][SQL][FOLLOWUP] Make the alias name pretty after float/double normalization
HyukjinKwon commented on a change in pull request #28919: URL: https://github.com/apache/spark/pull/28919#discussion_r445274348 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala ## @@ -460,7 +460,12 @@ abstract class SparkStrategies extends QueryPlanner[SparkPlan] { // because `distinctExpressions` is not extracted during logical phase. NormalizeFloatingNumbers.normalize(e) match { case ne: NamedExpression => ne -case other => Alias(other, other.toString)() Review comment: I see, thanks. Can we at least leave a short comment why we're doing it here alone? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27331: [SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API
HyukjinKwon commented on a change in pull request #27331: URL: https://github.com/apache/spark/pull/27331#discussion_r445270934 ## File path: python/pyspark/sql/readwriter.py ## @@ -1048,6 +1048,128 @@ def jdbc(self, url, table, mode=None, properties=None): self.mode(mode)._jwrite.jdbc(url, table, jprop) +class DataFrameWriterV2(object): +""" +Interface used to write a class:`pyspark.sql.dataframe.DataFrame` +to external storage using the v2 API. + +.. versionadded:: 3.1.0 +""" + +def __init__(self, df, table): +self._df = df +self._spark = df.sql_ctx +self._jwriter = df._jdf.writeTo(table) + +@since(3.1) +def using(self, provider): +""" +Specifies a provider for the underlying output data source. +Spark's default catalog supports "parquet", "json", etc. +""" +self._jwriter.using(provider) +return self + +@since(3.1) +def option(self, key, value): +""" +Add a write option. +""" +self._jwriter.option(key, to_str(value)) +return self + +@since(3.1) +def options(self, **options): +""" +Add write options. +""" +options = {k: to_str(v) for k, v in options.items()} +self._jwriter.options(options) +return self + +@since(3.1) +def partitionedBy(self, col, *cols): Review comment: @rdblue, @brkyvz, @cloud-fan, Should we maybe at least use a different class for these partition column expressions such as `PartitionedColumn` like we do for `TypedColumn`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27331: [SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API
HyukjinKwon commented on a change in pull request #27331: URL: https://github.com/apache/spark/pull/27331#discussion_r445268946 ## File path: python/pyspark/sql/readwriter.py ## @@ -1048,6 +1048,128 @@ def jdbc(self, url, table, mode=None, properties=None): self.mode(mode)._jwrite.jdbc(url, table, jprop) +class DataFrameWriterV2(object): +""" +Interface used to write a class:`pyspark.sql.dataframe.DataFrame` +to external storage using the v2 API. + +.. versionadded:: 3.1.0 +""" + +def __init__(self, df, table): +self._df = df +self._spark = df.sql_ctx +self._jwriter = df._jdf.writeTo(table) + +@since(3.1) +def using(self, provider): +""" +Specifies a provider for the underlying output data source. +Spark's default catalog supports "parquet", "json", etc. +""" +self._jwriter.using(provider) +return self + +@since(3.1) +def option(self, key, value): +""" +Add a write option. +""" +self._jwriter.option(key, to_str(value)) +return self + +@since(3.1) +def options(self, **options): +""" +Add write options. +""" +options = {k: to_str(v) for k, v in options.items()} +self._jwriter.options(options) +return self + +@since(3.1) +def partitionedBy(self, col, *cols): Review comment: Maybe it's important to describe what are expected for `col`. Only columns and the partition transform functions are allowed, not the regular Spark Column. I still don't like it we made this API looks like it takes regular Spark Columns, this was one of the reason why Pandas UDFs were redesigned and separate into two separate groups .. let's at least clarify it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27331: [SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API
HyukjinKwon commented on a change in pull request #27331: URL: https://github.com/apache/spark/pull/27331#discussion_r445269268 ## File path: python/pyspark/sql/functions.py ## @@ -3300,6 +3300,88 @@ def map_zip_with(col1, col2, f): return _invoke_higher_order_function("MapZipWith", [col1, col2], [f]) +# -- Partition transform functions + +@since(3.1) +def years(col): +""" +Partition transform function: A transform for timestamps and dates Review comment: Let's also clarify this expression should only with `partitionedBy` in DSv2 APIs. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #27331: [SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API
HyukjinKwon commented on a change in pull request #27331: URL: https://github.com/apache/spark/pull/27331#discussion_r445267780 ## File path: python/pyspark/sql/tests/test_readwriter.py ## @@ -163,6 +163,43 @@ def test_insert_into(self): self.assertEqual(6, self.spark.sql("select * from test_table").count()) +class ReadwriterV2Tests(ReusedSQLTestCase): +def test_api(self): +from pyspark.sql.readwriter import DataFrameWriterV2 +from pyspark.sql.functions import col Review comment: I think we can just import these on the top ## File path: python/pyspark/sql/dataframe.py ## @@ -2220,6 +2220,22 @@ def semanticHash(self): sinceversion=1.4, doc=":func:`drop_duplicates` is an alias for :func:`dropDuplicates`.") +@since(3.1) +def writeTo(self, table): +""" +Create a write configuration builder for v2 sources. + +This builder is used to configure and execute write operations. + +For example, to append or create or replace existing tables. + +>>> df.writeTo("catalog.db.table").append() # doctest: +SKIP +>>> df.writeTo( # doctest: +SKIP +... "catalog.db.table" +... ).partitionedBy($"col").createOrReplace() Review comment: I guess it shouldn't be `$"col"` ## File path: python/pyspark/sql/tests/test_readwriter.py ## @@ -163,6 +163,43 @@ def test_insert_into(self): self.assertEqual(6, self.spark.sql("select * from test_table").count()) +class ReadwriterV2Tests(ReusedSQLTestCase): +def test_api(self): +from pyspark.sql.readwriter import DataFrameWriterV2 +from pyspark.sql.functions import col + +df = self.df +writer = df.writeTo("testcat.t") +self.assertIsInstance(writer, DataFrameWriterV2) +self.assertIsInstance(writer.option("property", "value"), DataFrameWriterV2) +self.assertIsInstance(writer.options(property="value"), DataFrameWriterV2) +self.assertIsInstance(writer.using("source"), DataFrameWriterV2) +self.assertIsInstance(writer.partitionedBy("id"), DataFrameWriterV2) +self.assertIsInstance(writer.partitionedBy(col("id")), DataFrameWriterV2) + +def test_partitioning_functions(self): +import datetime +from pyspark.sql.readwriter import DataFrameWriterV2 +from pyspark.sql.functions import col, years, months, days, hours, bucket + +df = self.spark.createDataFrame( +[(1, datetime.datetime.now(), "foo")], Review comment: I would avoid the indeterministic value in the test unless it's necessary. ## File path: python/pyspark/sql/readwriter.py ## @@ -1048,6 +1048,128 @@ def jdbc(self, url, table, mode=None, properties=None): self.mode(mode)._jwrite.jdbc(url, table, jprop) +class DataFrameWriterV2(object): +""" +Interface used to write a class:`pyspark.sql.dataframe.DataFrame` +to external storage using the v2 API. + +.. versionadded:: 3.1.0 +""" + +def __init__(self, df, table): +self._df = df +self._spark = df.sql_ctx +self._jwriter = df._jdf.writeTo(table) + +@since(3.1) +def using(self, provider): +""" +Specifies a provider for the underlying output data source. +Spark's default catalog supports "parquet", "json", etc. +""" +self._jwriter.using(provider) +return self + +@since(3.1) +def option(self, key, value): +""" +Add a write option. +""" +self._jwriter.option(key, to_str(value)) +return self + +@since(3.1) +def options(self, **options): +""" +Add write options. +""" +options = {k: to_str(v) for k, v in options.items()} +self._jwriter.options(options) +return self + +@since(3.1) +def partitionedBy(self, col, *cols): Review comment: Maybe it's important to describe what are expected for `col`. Only columns and the partition transform functions are allowed, not the regular Spark Column. I still don't like it we made this API looks like it takes regular Spark Columns, this was one of the reason why Pandas UDFs were separate into two separate groups .. let's at least clarify it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #27331: [SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API
HyukjinKwon commented on pull request #27331: URL: https://github.com/apache/spark/pull/27331#issuecomment-649168859 > I don't think that maintenance is a huge issue here. Just saying... That's probably you're used to Python side .. For people who don't know Python, reading itself is some extra overhead .. I will merge a bit later after waiting and monitoring the changes in DSv2 APIs then since it looks like nobody knows the answer about the stability. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #27849: [SPARK-31081][UI][SQL] Make the display of stageId/stageAttemptId/taskId of sql metrics configurable in UI
HyukjinKwon commented on pull request #27849: URL: https://github.com/apache/spark/pull/27849#issuecomment-649167755 Closing in favour of #27927 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #27849: [SPARK-31081][UI][SQL] Make the display of stageId/stageAttemptId/taskId of sql metrics configurable in UI
HyukjinKwon closed pull request #27849: URL: https://github.com/apache/spark/pull/27849 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
AmplabJenkins removed a comment on pull request #28852: URL: https://github.com/apache/spark/pull/28852#issuecomment-649163956 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124504/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
AmplabJenkins removed a comment on pull request #28852: URL: https://github.com/apache/spark/pull/28852#issuecomment-649163945 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
AmplabJenkins commented on pull request #28852: URL: https://github.com/apache/spark/pull/28852#issuecomment-649163945 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
SparkQA removed a comment on pull request #28852: URL: https://github.com/apache/spark/pull/28852#issuecomment-649125587 **[Test build #124504 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124504/testReport)** for PR 28852 at commit [`28da5cf`](https://github.com/apache/spark/commit/28da5cfd39dba6a8319dd1cdfe39e51ed5cbdea5). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
SparkQA commented on pull request #28852: URL: https://github.com/apache/spark/pull/28852#issuecomment-649163590 **[Test build #124504 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124504/testReport)** for PR 28852 at commit [`28da5cf`](https://github.com/apache/spark/commit/28da5cfd39dba6a8319dd1cdfe39e51ed5cbdea5). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
AmplabJenkins removed a comment on pull request #28852: URL: https://github.com/apache/spark/pull/28852#issuecomment-649162543 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124502/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
AmplabJenkins removed a comment on pull request #28852: URL: https://github.com/apache/spark/pull/28852#issuecomment-649162539 Build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
SparkQA removed a comment on pull request #28852: URL: https://github.com/apache/spark/pull/28852#issuecomment-649123085 **[Test build #124502 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124502/testReport)** for PR 28852 at commit [`d9c5bf7`](https://github.com/apache/spark/commit/d9c5bf7b3481e6ff4025c534bef952717b1275b7). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
AmplabJenkins commented on pull request #28852: URL: https://github.com/apache/spark/pull/28852#issuecomment-649162539 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
SparkQA commented on pull request #28852: URL: https://github.com/apache/spark/pull/28852#issuecomment-649162330 **[Test build #124502 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124502/testReport)** for PR 28852 at commit [`d9c5bf7`](https://github.com/apache/spark/commit/d9c5bf7b3481e6ff4025c534bef952717b1275b7). * This patch **fails Spark unit tests**. * This patch **does not merge cleanly**. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
AmplabJenkins removed a comment on pull request #27690: URL: https://github.com/apache/spark/pull/27690#issuecomment-649160040 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
AmplabJenkins commented on pull request #27690: URL: https://github.com/apache/spark/pull/27690#issuecomment-649160040 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
SparkQA commented on pull request #27690: URL: https://github.com/apache/spark/pull/27690#issuecomment-649159672 **[Test build #124508 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124508/testReport)** for PR 27690 at commit [`0fbeaf3`](https://github.com/apache/spark/commit/0fbeaf374bf35a7d0cde2b3340d9f3c4551cbdb2). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
maropu commented on pull request #27690: URL: https://github.com/apache/spark/pull/27690#issuecomment-649158798 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage
maropu commented on pull request #27690: URL: https://github.com/apache/spark/pull/27690#issuecomment-649158887 @HyukjinKwon @viirya no more comment? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements
maropu commented on a change in pull request #27246: URL: https://github.com/apache/spark/pull/27246#discussion_r445255528 ## File path: sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt ## @@ -42,4 +42,8 @@ Spilling with 1 rows: Best Time(ms) Avg Time(ms) Stdev(m UnsafeExternalSorter 11 11 1 14.7 68.0 1.0X ExternalAppendOnlyUnsafeRowArray 9 10 1 17.1 58.5 1.2X - +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 4.4.0-178-generic Review comment: > My understanding was to add new benchmark results to the file. I didn't change other results in the files. Do you want me to update all results? We need to check no perf regression on jdk8/jdk11 (Sometimes, we hit perf regression unexpectedly). But, yea. we might not need the update if you've already checked locally. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements
maropu commented on pull request #27246: URL: https://github.com/apache/spark/pull/27246#issuecomment-649158069 Looks okay to me. Anyone could check this? @cloud-fan @dongjoon-hyun @JoshRosen @jiangxb1987 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements
maropu commented on a change in pull request #27246: URL: https://github.com/apache/spark/pull/27246#discussion_r445256544 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArrayBenchmark.scala ## @@ -182,6 +182,47 @@ object ExternalAppendOnlyUnsafeRowArrayBenchmark extends BenchmarkBase { } } + def testAgainstUnsafeSorterSpillReader( + numSpillThreshold: Int, + numRows: Int, + numIterators: Int, + iterations: Int): Unit = { +val rows = testRows(numRows) +val benchmark = new Benchmark(s"Spilling SpillReader with $numRows rows", iterations * numRows, + output = output) + +benchmark.addCase("UnsafeSorterSpillReader_bufferSize1024") { _: Int => + val array = UnsafeExternalSorter.create( +TaskContext.get().taskMemoryManager(), +SparkEnv.get.blockManager, +SparkEnv.get.serializerManager, +TaskContext.get(), +null, +null, +1024, +SparkEnv.get.memoryManager.pageSizeBytes, +numSpillThreshold, +false) + + rows.foreach(x => +array.insertRecord( + x.getBaseObject, + x.getBaseOffset, + x.getSizeInBytes, + 0, + false)) + + for (_ <- 0L until numIterators) { Review comment: Could you leave some comments about that here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements
maropu commented on a change in pull request #27246: URL: https://github.com/apache/spark/pull/27246#discussion_r445256446 ## File path: sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt ## @@ -42,4 +42,8 @@ Spilling with 1 rows: Best Time(ms) Avg Time(ms) Stdev(m UnsafeExternalSorter 11 11 1 14.7 68.0 1.0X ExternalAppendOnlyUnsafeRowArray 9 10 1 17.1 58.5 1.2X - +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 4.4.0-178-generic +Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz +Spilling SpillReader with 16000 rows:Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative + +UnsafeSorterSpillReader_bufferSize1024 411426 13 0.61607.2 1.0X Review comment: What's a number without this PR? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements
maropu commented on a change in pull request #27246: URL: https://github.com/apache/spark/pull/27246#discussion_r445255528 ## File path: sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt ## @@ -42,4 +42,8 @@ Spilling with 1 rows: Best Time(ms) Avg Time(ms) Stdev(m UnsafeExternalSorter 11 11 1 14.7 68.0 1.0X ExternalAppendOnlyUnsafeRowArray 9 10 1 17.1 58.5 1.2X - +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 4.4.0-178-generic Review comment: > My understanding was to add new benchmark results to the file. I didn't change other results in the files. Do you want me to update all results? We need to check no perf regression on jdk8/jdk11 (Sometimes, we hit perf regression unexpectedly). But, yea. we might not need this update if you've already checked locally. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements
maropu commented on a change in pull request #27246: URL: https://github.com/apache/spark/pull/27246#discussion_r445255528 ## File path: sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt ## @@ -42,4 +42,8 @@ Spilling with 1 rows: Best Time(ms) Avg Time(ms) Stdev(m UnsafeExternalSorter 11 11 1 14.7 68.0 1.0X ExternalAppendOnlyUnsafeRowArray 9 10 1 17.1 58.5 1.2X - +OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 4.4.0-178-generic Review comment: > My understanding was to add new benchmark results to the file. I didn't change other results in the files. Do you want me to update all results? We need to check no perf regression on jdk8/jdk11 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions
frankyin-factual commented on a change in pull request #28898: URL: https://github.com/apache/spark/pull/28898#discussion_r445255141 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala ## @@ -113,6 +113,11 @@ object NestedColumnAliasing { case _: Sample => true case _: RepartitionByExpression => true case _: Join => true +case x: Filter => x.child match { + case _: Window => true Review comment: Looks like the plan is a `Project -> Filter -> Window`. If we only do `case _: Window => true`, the projection aliasing won't be available at the `Window` stage, and can't be passed onto later stages described in the ticket. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #28923: [SPARK-32090][SQL] UserDefinedType.equal() should be symmetrical
maropu commented on pull request #28923: URL: https://github.com/apache/spark/pull/28923#issuecomment-649155603 LGTM except for the @cloud-fan comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #28923: [SPARK-32090][SQL] UserDefinedType.equal() should be symmetrical
maropu commented on a change in pull request #28923: URL: https://github.com/apache/spark/pull/28923#discussion_r445254621 ## File path: sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala ## @@ -134,6 +134,17 @@ class UserDefinedTypeSuite extends QueryTest with SharedSparkSession with Parque MyLabeledPoint(1.0, new TestUDT.MyDenseVector(Array(0.1, 1.0))), MyLabeledPoint(0.0, new TestUDT.MyDenseVector(Array(0.3, 3.0.toDF() + + test("equal") { +val udt1 = new ExampleBaseTypeUDT +val udt2 = new ExampleSubTypeUDT +val udt3 = new ExampleSubTypeUDT +assert(!(udt1 === udt2)) Review comment: nit: `!==`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org