date:20200624

[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

2020-06-24 Thread GitBox



frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649269967


   Let me see if I can get a more generalized solution out today. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] venkata91 commented on pull request #28287: [SPARK-31418][SCHEDULER] Request more executors in case of dynamic allocation is enabled and a task becomes unschedulable due to spark's bl

2020-06-24 Thread GitBox



venkata91 commented on pull request #28287:
URL: https://github.com/apache/spark/pull/28287#issuecomment-649263104


   No worries @tgravescs Thanks for taking a look. For some reason, these 
checks keeps failing but it doesn't look to be related to my changes. some 
cache issue probably?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

2020-06-24 Thread GitBox



frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445331787



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
 case _: Sample => true
 case _: RepartitionByExpression => true
 case _: Join => true
+case x: Filter => x.child match {
+  case _: Window => true

Review comment:
   Do you have a sample query in mind that produces 
`Project->Filter->Sample`? I've been trying to come up with a query that 
generates this plan. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

2020-06-24 Thread GitBox



frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445331787



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
 case _: Sample => true
 case _: RepartitionByExpression => true
 case _: Join => true
+case x: Filter => x.child match {
+  case _: Window => true

Review comment:
   Do you have a sample query that produces `Project->Filter->Sample`? I've 
been trying to come up with a query that generates this plan. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-24 Thread GitBox



AmplabJenkins removed a comment on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-649245548







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-24 Thread GitBox



AmplabJenkins commented on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-649245548







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-24 Thread GitBox



SparkQA commented on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-649244202


   **[Test build #124508 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124508/testReport)**
 for PR 27690 at commit 
[`0fbeaf3`](https://github.com/apache/spark/commit/0fbeaf374bf35a7d0cde2b3340d9f3c4551cbdb2).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-24 Thread GitBox



SparkQA removed a comment on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-649159672


   **[Test build #124508 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124508/testReport)**
 for PR 27690 at commit 
[`0fbeaf3`](https://github.com/apache/spark/commit/0fbeaf374bf35a7d0cde2b3340d9f3c4551cbdb2).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] karuppayya commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

2020-06-24 Thread GitBox



karuppayya commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-649241132


   @maropu I have fixed the tests with the flag enabled. Please take a look. 
Should I go ahead change the default value of the config back to `false`?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

2020-06-24 Thread GitBox



viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445321841



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
 case _: Sample => true
 case _: RepartitionByExpression => true
 case _: Join => true
+case x: Filter => x.child match {
+  case _: Window => true

Review comment:
   I see, it is due to predicate pushdown rule. I think we need general 
solution as @maropu said. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

2020-06-24 Thread GitBox



AmplabJenkins removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649234400







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

2020-06-24 Thread GitBox



AmplabJenkins commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649234400







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

2020-06-24 Thread GitBox



SparkQA removed a comment on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649153764


   **[Test build #124507 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124507/testReport)**
 for PR 28898 at commit 
[`b1cad9a`](https://github.com/apache/spark/commit/b1cad9ad759c3e1d2ef9efd0f9390c6924e412df).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

2020-06-24 Thread GitBox



SparkQA commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649233655


   **[Test build #124507 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124507/testReport)**
 for PR 28898 at commit 
[`b1cad9a`](https://github.com/apache/spark/commit/b1cad9ad759c3e1d2ef9efd0f9390c6924e412df).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements

2020-06-24 Thread GitBox



maropu commented on a change in pull request #27246:
URL: https://github.com/apache/spark/pull/27246#discussion_r445319707



##
File path: 
sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt
##
@@ -42,4 +42,8 @@ Spilling with 1 rows: Best Time(ms)   Avg 
Time(ms)   Stdev(m
 UnsafeExternalSorter 11 11 
  1 14.7  68.0   1.0X
 ExternalAppendOnlyUnsafeRowArray  9 10 
  1 17.1  58.5   1.2X
 
-
+OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 
4.4.0-178-generic
+Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz
+Spilling  SpillReader with 16000 rows:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
+
+UnsafeSorterSpillReader_bufferSize1024  411426 
 13  0.61607.2   1.0X

Review comment:
   oh, looks nice.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] TJX2014 edited a comment on pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST

2020-06-24 Thread GitBox



TJX2014 edited a comment on pull request #28926:
URL: https://github.com/apache/spark/pull/28926#issuecomment-649231790


   Follow: https://github.com/apache/spark/pull/28856
   cc @cloud-fan @MaxGekk 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] TJX2014 edited a comment on pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST

2020-06-24 Thread GitBox



TJX2014 edited a comment on pull request #28926:
URL: https://github.com/apache/spark/pull/28926#issuecomment-649231790


   https://github.com/apache/spark/pull/28856
   cc @cloud-fan @MaxGekk 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] TJX2014 edited a comment on pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST

2020-06-24 Thread GitBox



TJX2014 edited a comment on pull request #28926:
URL: https://github.com/apache/spark/pull/28926#issuecomment-649231790


   https://github.com/apache/spark/pull/28856



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] TJX2014 commented on pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST

2020-06-24 Thread GitBox



TJX2014 commented on pull request #28926:
URL: https://github.com/apache/spark/pull/28926#issuecomment-649231790


   #https://github.com/apache/spark/pull/28856



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST

2020-06-24 Thread GitBox



AmplabJenkins removed a comment on pull request #28926:
URL: https://github.com/apache/spark/pull/28926#issuecomment-649230613


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST

2020-06-24 Thread GitBox



AmplabJenkins commented on pull request #28926:
URL: https://github.com/apache/spark/pull/28926#issuecomment-649230871


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] TJX2014 commented on a change in pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST

2020-06-24 Thread GitBox



TJX2014 commented on a change in pull request #28926:
URL: https://github.com/apache/spark/pull/28926#discussion_r445317708



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
##
@@ -2612,6 +2614,9 @@ object Sequence {
   val stepDays = step.days
   val stepMicros = step.microseconds
 
+  require(scale != MICROS_PER_DAY || stepMonths != 0 || stepDays != 0,

Review comment:
   Add scale constraint here to `DateType`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] TJX2014 opened a new pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST

2020-06-24 Thread GitBox



TJX2014 opened a new pull request #28926:
URL: https://github.com/apache/spark/pull/28926


   ### What changes were proposed in this pull request?
   Add a unit test.
   Bug fix in 
`org.apache.spark.sql.catalyst.expressions.Sequence.TemporalSequenceImpl`
   Add `sequence step must be a day interval if start and end values are dates` 
constraint.
   Follow: [https://github.com/apache/spark/pull/28856](url)
   
   ### Why are the changes needed?
   Spark sequence doesn't handle date increments that cross DST
   
   ### Does this PR introduce _any_ user-facing change?
   Before the PR, people will not get a correct result：
   `set spark.sql.session.timeZone` to `Asia/Shanghai, America/Chicago, GMT`, 
   Before execute 
   `sql("select sequence(cast('2011-03-01' as date), cast('2011-05-01' as 
date), interval 1 month)").show(false)`, People will get `[2011-03-01, 
2011-04-01, 2011-05-01]`, **`[2011-03-01, 2011-03-28, 2011-04-28]`**, 
`[2011-03-01, 2011-04-01, 2011-05-01]`.
   
   After the PR, sequence date conversion is corrected：
   We will get `[2011-03-01, 2011-04-01, 2011-05-01]` from the former three 
conditions.
   
   ### How was this patch tested?
   Unit test.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28926: [SPARK-31982][SQL][FOLLOWUP]Function sequence doesn't handle date increments that cross DST

2020-06-24 Thread GitBox



AmplabJenkins commented on pull request #28926:
URL: https://github.com/apache/spark/pull/28926#issuecomment-649230613


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gatorsmile commented on pull request #28902: [SPARK-31801][API][SHUFFLE][TESTS] Tests for registering map output metadata

2020-06-24 Thread GitBox



gatorsmile commented on pull request #28902:
URL: https://github.com/apache/spark/pull/28902#issuecomment-649226867


   cc @Ngone51 @jiangxb1987 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gatorsmile commented on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR

2020-06-24 Thread GitBox



gatorsmile commented on pull request #28912:
URL: https://github.com/apache/spark/pull/28912#issuecomment-649226614


   retest this please
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] frankyin-factual commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

2020-06-24 Thread GitBox



frankyin-factual commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649224393


   > btw, could you follow the PR template? 
https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
   
   Just updated the PR. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

2020-06-24 Thread GitBox



AmplabJenkins removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-649221849







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

2020-06-24 Thread GitBox



AmplabJenkins commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-649221849







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

2020-06-24 Thread GitBox



frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445309526



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
 case _: Sample => true
 case _: RepartitionByExpression => true
 case _: Join => true
+case x: Filter => x.child match {
+  case _: Window => true

Review comment:
   That won’t work because it seems causing an infinite loop in optimizer. 
It gives me error messages like running out of max iterations. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

2020-06-24 Thread GitBox



SparkQA commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-649221158


   **[Test build #124505 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124505/testReport)**
 for PR 28804 at commit 
[`afc2903`](https://github.com/apache/spark/commit/afc2903e4a327d6caef518e6d3f0dc431424ac7c).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28804: [SPARK-31973][SQL] Add ability to disable Sort,Spill in Partial aggregation

2020-06-24 Thread GitBox



SparkQA removed a comment on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-649138531


   **[Test build #124505 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124505/testReport)**
 for PR 28804 at commit 
[`afc2903`](https://github.com/apache/spark/commit/afc2903e4a327d6caef518e6d3f0dc431424ac7c).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] siknezevic commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements

2020-06-24 Thread GitBox



siknezevic commented on a change in pull request #27246:
URL: https://github.com/apache/spark/pull/27246#discussion_r445307223



##
File path: 
sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt
##
@@ -42,4 +42,8 @@ Spilling with 1 rows: Best Time(ms)   Avg 
Time(ms)   Stdev(m
 UnsafeExternalSorter 11 11 
  1 14.7  68.0   1.0X
 ExternalAppendOnlyUnsafeRowArray  9 10 
  1 17.1  58.5   1.2X
 
-
+OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 
4.4.0-178-generic
+Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz
+Spilling  SpillReader with 16000 rows:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
+
+UnsafeSorterSpillReader_bufferSize1024  411426 
 13  0.61607.2   1.0X

Review comment:
   There are three spill files. It will result in three I/Os for each 
created iterator, because it has to read number of rows from each spilled file. 
So, we see more than 2000 X performance difference.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-06-24 Thread GitBox



AmplabJenkins removed a comment on pull request #28761:
URL: https://github.com/apache/spark/pull/28761#issuecomment-649217406







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] siknezevic commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements

2020-06-24 Thread GitBox



siknezevic commented on a change in pull request #27246:
URL: https://github.com/apache/spark/pull/27246#discussion_r445306120



##
File path: 
sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt
##
@@ -42,4 +42,8 @@ Spilling with 1 rows: Best Time(ms)   Avg 
Time(ms)   Stdev(m
 UnsafeExternalSorter 11 11 
  1 14.7  68.0   1.0X
 ExternalAppendOnlyUnsafeRowArray  9 10 
  1 17.1  58.5   1.2X
 
-
+OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 
4.4.0-178-generic
+Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz
+Spilling  SpillReader with 16000 rows:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
+
+UnsafeSorterSpillReader_bufferSize1024  411426 
 13  0.61607.2   1.0X

Review comment:
   Here it is:
   OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 
4.4.0-178-generic
   Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz
   Spilling  SpillReader with 16000 rows:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
   
//
   UnsafeSorterSpillReader  932683 943898   
  NaN  0.0 3643292.4   1.0X





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-06-24 Thread GitBox



AmplabJenkins commented on pull request #28761:
URL: https://github.com/apache/spark/pull/28761#issuecomment-649217406







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-06-24 Thread GitBox



SparkQA removed a comment on pull request #28761:
URL: https://github.com/apache/spark/pull/28761#issuecomment-649140417


   **[Test build #124506 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124506/testReport)**
 for PR 28761 at commit 
[`bd691ed`](https://github.com/apache/spark/commit/bd691ed16eade2e63c0fdd8d2bbd88282f6c4662).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-06-24 Thread GitBox



SparkQA commented on pull request #28761:
URL: https://github.com/apache/spark/pull/28761#issuecomment-649216688


   **[Test build #124506 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124506/testReport)**
 for PR 28761 at commit 
[`bd691ed`](https://github.com/apache/spark/commit/bd691ed16eade2e63c0fdd8d2bbd88282f6c4662).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28896: [SPARK-32025][SQL] Csv schema inference problems with different types in the same column

2020-06-24 Thread GitBox



AmplabJenkins commented on pull request #28896:
URL: https://github.com/apache/spark/pull/28896#issuecomment-649215908







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28896: [SPARK-32025][SQL] Csv schema inference problems with different types in the same column

2020-06-24 Thread GitBox



AmplabJenkins removed a comment on pull request #28896:
URL: https://github.com/apache/spark/pull/28896#issuecomment-649215908







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #28896: [SPARK-32025][SQL] Csv schema inference problems with different types in the same column

2020-06-24 Thread GitBox



HyukjinKwon commented on a change in pull request #28896:
URL: https://github.com/apache/spark/pull/28896#discussion_r444833442



##
File path: sql/core/src/test/resources/test-data/mixed-types1.csv
##
@@ -0,0 +1,4 @@
+col_mixed_types
+2012
+1997
+True

Review comment:
   Let's don't bother to create new data file here. You can write the test 
codes as below instead.
   
   ```scala
   withTempPath { path =>
 Seq("col_mixed_types", "2012", "1997", 
"True").toDS.write.text(path.getCanonicalPath)
 val df = spark.read.format("csv")
   .option("header", "true")
   .option("inferSchema", "true")
   .load(path.getCanonicalPath)
   
 assert(data.schema.last == StructField("col_mixed_types", StringType, 
true))
   }
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28896: [SPARK-32025][SQL] Csv schema inference problems with different types in the same column

2020-06-24 Thread GitBox



SparkQA commented on pull request #28896:
URL: https://github.com/apache/spark/pull/28896#issuecomment-649215558


   **[Test build #124510 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124510/testReport)**
 for PR 28896 at commit 
[`4629bb5`](https://github.com/apache/spark/commit/4629bb5556564b2ebcedca88157f45aa95982f00).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] siknezevic commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements

2020-06-24 Thread GitBox



siknezevic commented on a change in pull request #27246:
URL: https://github.com/apache/spark/pull/27246#discussion_r445304792



##
File path: 
sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt
##
@@ -42,4 +42,8 @@ Spilling with 1 rows: Best Time(ms)   Avg 
Time(ms)   Stdev(m
 UnsafeExternalSorter 11 11 
  1 14.7  68.0   1.0X
 ExternalAppendOnlyUnsafeRowArray  9 10 
  1 17.1  58.5   1.2X
 
-
+OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 
4.4.0-178-generic

Review comment:
   Yes, I did comparison of the existing micro-benchmark results when patch 
is applied and when there is no patch.
   I do not see regression in the performance for the existing 
micro-benchmarks. I will push micro-benchmark results for jdk11 soon.
   Thank you for help 😊
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon removed a comment on pull request #28896: [SPARK-32025][SQL] Csv schema inference problems with different types in the same column

2020-06-24 Thread GitBox



HyukjinKwon removed a comment on pull request #28896:
URL: https://github.com/apache/spark/pull/28896#issuecomment-649214971


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28896: [SPARK-32025][SQL] Csv schema inference problems with different types in the same column

2020-06-24 Thread GitBox



AmplabJenkins removed a comment on pull request #28896:
URL: https://github.com/apache/spark/pull/28896#issuecomment-647758404


   Can one of the admins verify this patch?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #28896: [SPARK-32025][SQL] Csv schema inference problems with different types in the same column

2020-06-24 Thread GitBox



HyukjinKwon commented on pull request #28896:
URL: https://github.com/apache/spark/pull/28896#issuecomment-649215341


   ok to test



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #28896: [SPARK-32025][SQL] Csv schema inference problems with different types in the same column

2020-06-24 Thread GitBox



HyukjinKwon commented on pull request #28896:
URL: https://github.com/apache/spark/pull/28896#issuecomment-649214971


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #28919: [SPARK-32038][SQL][FOLLOWUP] Make the alias name pretty after float/double normalization

2020-06-24 Thread GitBox



HyukjinKwon commented on pull request #28919:
URL: https://github.com/apache/spark/pull/28919#issuecomment-649213796


   @cloud-fan, shall we add a short comment 
https://github.com/apache/spark/pull/28919#discussion_r445274348?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] siknezevic commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements

2020-06-24 Thread GitBox



siknezevic commented on a change in pull request #27246:
URL: https://github.com/apache/spark/pull/27246#discussion_r445299048



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArrayBenchmark.scala
##
@@ -182,6 +182,47 @@ object ExternalAppendOnlyUnsafeRowArrayBenchmark extends 
BenchmarkBase {
 }
   }
 
+  def testAgainstUnsafeSorterSpillReader(
+  numSpillThreshold: Int,
+  numRows: Int,
+  numIterators: Int,
+  iterations: Int): Unit = {
+val rows = testRows(numRows)
+val benchmark = new Benchmark(s"Spilling  SpillReader with $numRows rows", 
iterations * numRows,
+  output = output)
+
+benchmark.addCase("UnsafeSorterSpillReader_bufferSize1024") { _: Int =>
+  val array = UnsafeExternalSorter.create(
+TaskContext.get().taskMemoryManager(),
+SparkEnv.get.blockManager,
+SparkEnv.get.serializerManager,
+TaskContext.get(),
+null,
+null,
+1024,
+SparkEnv.get.memoryManager.pageSizeBytes,
+numSpillThreshold,
+false)
+
+  rows.foreach(x =>
+array.insertRecord(
+  x.getBaseObject,
+  x.getBaseOffset,
+  x.getSizeInBytes,
+  0,
+  false))
+
+  for (_ <- 0L until numIterators) {

Review comment:
   During execution of sort-merge join (Left Semi Join ) for each left join 
row “right matches” are found and stored into ExternalAppendOnlyUnsafeRowArrey 
object. ExternalAppendOnlyUnsafeRowArrey object with “right matches” is created 
when first row on left side of the join is processed and then reused if next 
rows on the left side of join is the same like previous one. In the case of 
Queries 14a/14b there are millions of rows of “right matches” which could not 
fit into memory. To run this query spilling is enabled and “right matches rows” 
data is moved from ExternalAppendOnlyUnsafeRowArrey into UnsafeExternalSorter 
and then spilled onto the disk. To perform join operation on left join row, you 
have to create iterator on top of “right matches rows”. The operation of 
creation of iterator on top of “right matches” is repeated for each processed 
row on the left side of the join. When million rows are processed on left side 
of the join, the iterator on top of spilled “right matches” rows is created 
each time. This means that millions of time iterator on top of right matches 
(that are spilled on the disk) is created. The current Spark implementation 
creates iterator on top of spilled rows and producing I/0 because it reads 
number of rows stored in the spilled files but iteration action on top of 
iterator is never done during join operation. Iterator is created, never used 
and then discarded with each processed join row. This will results into 
millions of I/0s. One I/0 is 2 or 3 millisecond. Hence this PR which creates 
lazy iterator ("lazy" constructor for UnsafeSorterSpillReader), so no I/O is 
done. Also, my micro-benchmark simulates creation of iterators on top of 
spilled files which contain “right matches”. Sorry for the long explanation. 
Not sure if I can make it simpler. I hope it clarifies why I created 
micro-benchmark this way.
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

2020-06-24 Thread GitBox



viirya commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445294440



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
 case _: Sample => true
 case _: RepartitionByExpression => true
 case _: Join => true
+case x: Filter => x.child match {
+  case _: Window => true

Review comment:
   Then just add `case _: Filter => true`, if you want to let project 
pushed through `Filter`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

2020-06-24 Thread GitBox



maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445291555



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
 case _: Sample => true
 case _: RepartitionByExpression => true
 case _: Join => true
+case x: Filter => x.child match {
+  case _: Window => true

Review comment:
   But, it seems `Project->Filter->Sample` has the same issue. Looks we 
need a more general solution for that.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

2020-06-24 Thread GitBox



maropu commented on pull request #28898:
URL: https://github.com/apache/spark/pull/28898#issuecomment-649199639


   btw, could you follow the PR template? 
https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

2020-06-24 Thread GitBox



maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445291555



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
 case _: Sample => true
 case _: RepartitionByExpression => true
 case _: Join => true
+case x: Filter => x.child match {
+  case _: Window => true

Review comment:
   But, `Project->Filter->Sample` has the same issue. Looks we need a more 
general solution for that.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

2020-06-24 Thread GitBox



maropu commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445291555



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
 case _: Sample => true
 case _: RepartitionByExpression => true
 case _: Join => true
+case x: Filter => x.child match {
+  case _: Window => true

Review comment:
   But, `Project->Filter->Sample` has the same issue. So, it seems to be a 
different issue.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Hellsen83 commented on pull request #23877: [SPARK-26449][PYTHON] Add transform method to DataFrame API

2020-06-24 Thread GitBox



Hellsen83 commented on pull request #23877:
URL: https://github.com/apache/spark/pull/23877#issuecomment-649198577


   Hello @MrPowers ,
   you are right, this is in fact motivated by your excellent blog post - thank 
you so much for that!
   From my experience - i.e. bringing this style of wrting PySpark 
transformations into a heterogeneous group of roughly 15 devs/data scientists - 
the following was used most frequently and people new to the game were able to 
pick this up quickly:
   
   ```
   def my_logical_name(arg1: type1, arg2: type2):
   """My Docstring Style goes here
   
   :arg1: does something
   :arg2: does something_else
   :returns: a dataframe that was first somethinged and then something_elsed
   """
   def _(df: DataFrame):
   return df.do_something(arg1).do_something_else(arg2)
   return _
   
   def test_my_logical_name_returns_none_if_args_are_equal():
   ..
   result_df: DataFrame = df.transform(my_logical_name(arg1, arg1))
   ..
   ```
   
   So I am right with ya, but propose "\_" as the inner function name and the 
above mentioned docstring placement. Main reasons for "_":
   1) the amount of visual noise should be lowered as much as possible (when 
writing many transformations and things do get more complicated, this pays off)
   2) if you name the inner function, people _will_ give custom names to it. 
this goes against 1) and uniform code, as names will start to vary.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR

2020-06-24 Thread GitBox



AmplabJenkins removed a comment on pull request #28912:
URL: https://github.com/apache/spark/pull/28912#issuecomment-649188105


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124509/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR

2020-06-24 Thread GitBox



AmplabJenkins removed a comment on pull request #28912:
URL: https://github.com/apache/spark/pull/28912#issuecomment-649188097


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR

2020-06-24 Thread GitBox



SparkQA commented on pull request #28912:
URL: https://github.com/apache/spark/pull/28912#issuecomment-649187983


   **[Test build #124509 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124509/testReport)**
 for PR 28912 at commit 
[`49f4335`](https://github.com/apache/spark/commit/49f4335f3a13f04bff05b0974ffa4135731d615d).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR

2020-06-24 Thread GitBox



SparkQA removed a comment on pull request #28912:
URL: https://github.com/apache/spark/pull/28912#issuecomment-649179138


   **[Test build #124509 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124509/testReport)**
 for PR 28912 at commit 
[`49f4335`](https://github.com/apache/spark/commit/49f4335f3a13f04bff05b0974ffa4135731d615d).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR

2020-06-24 Thread GitBox



AmplabJenkins commented on pull request #28912:
URL: https://github.com/apache/spark/pull/28912#issuecomment-649188097







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR

2020-06-24 Thread GitBox



AmplabJenkins removed a comment on pull request #28912:
URL: https://github.com/apache/spark/pull/28912#issuecomment-649179488







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR

2020-06-24 Thread GitBox



AmplabJenkins commented on pull request #28912:
URL: https://github.com/apache/spark/pull/28912#issuecomment-649179488







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR

2020-06-24 Thread GitBox



SparkQA commented on pull request #28912:
URL: https://github.com/apache/spark/pull/28912#issuecomment-649179138


   **[Test build #124509 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124509/testReport)**
 for PR 28912 at commit 
[`49f4335`](https://github.com/apache/spark/commit/49f4335f3a13f04bff05b0974ffa4135731d615d).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #28912: [SPARK-32057][SQL] ExecuteStatement: cancel and close should not transiently ERROR

2020-06-24 Thread GitBox



HyukjinKwon commented on pull request #28912:
URL: https://github.com/apache/spark/pull/28912#issuecomment-649178851


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28864: [SPARK-32004][ALL] Drop references to slave

2020-06-24 Thread GitBox



AmplabJenkins removed a comment on pull request #28864:
URL: https://github.com/apache/spark/pull/28864#issuecomment-649177945







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28864: [SPARK-32004][ALL] Drop references to slave

2020-06-24 Thread GitBox



AmplabJenkins commented on pull request #28864:
URL: https://github.com/apache/spark/pull/28864#issuecomment-649177945







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28864: [SPARK-32004][ALL] Drop references to slave

2020-06-24 Thread GitBox



SparkQA removed a comment on pull request #28864:
URL: https://github.com/apache/spark/pull/28864#issuecomment-649120767


   **[Test build #124501 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124501/testReport)**
 for PR 28864 at commit 
[`eaf29d8`](https://github.com/apache/spark/commit/eaf29d8f8671b2796bf1fc5a82ee4b292511b3fc).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28864: [SPARK-32004][ALL] Drop references to slave

2020-06-24 Thread GitBox



SparkQA commented on pull request #28864:
URL: https://github.com/apache/spark/pull/28864#issuecomment-649177297


   **[Test build #124501 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124501/testReport)**
 for PR 28864 at commit 
[`eaf29d8`](https://github.com/apache/spark/commit/eaf29d8f8671b2796bf1fc5a82ee4b292511b3fc).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #28919: [SPARK-32038][SQL][FOLLOWUP] Make the alias name pretty after float/double normalization

2020-06-24 Thread GitBox



HyukjinKwon commented on a change in pull request #28919:
URL: https://github.com/apache/spark/pull/28919#discussion_r445274348



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala
##
@@ -460,7 +460,12 @@ abstract class SparkStrategies extends 
QueryPlanner[SparkPlan] {
   // because `distinctExpressions` is not extracted during logical 
phase.
   NormalizeFloatingNumbers.normalize(e) match {
 case ne: NamedExpression => ne
-case other => Alias(other, other.toString)()

Review comment:
   I see, thanks. Can we at least leave a short comment why we're doing it 
here alone?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27331: [SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API

2020-06-24 Thread GitBox



HyukjinKwon commented on a change in pull request #27331:
URL: https://github.com/apache/spark/pull/27331#discussion_r445270934



##
File path: python/pyspark/sql/readwriter.py
##
@@ -1048,6 +1048,128 @@ def jdbc(self, url, table, mode=None, properties=None):
 self.mode(mode)._jwrite.jdbc(url, table, jprop)
 
 
+class DataFrameWriterV2(object):
+"""
+Interface used to write a class:`pyspark.sql.dataframe.DataFrame`
+to external storage using the v2 API.
+
+.. versionadded:: 3.1.0
+"""
+
+def __init__(self, df, table):
+self._df = df
+self._spark = df.sql_ctx
+self._jwriter = df._jdf.writeTo(table)
+
+@since(3.1)
+def using(self, provider):
+"""
+Specifies a provider for the underlying output data source.
+Spark's default catalog supports "parquet", "json", etc.
+"""
+self._jwriter.using(provider)
+return self
+
+@since(3.1)
+def option(self, key, value):
+"""
+Add a write option.
+"""
+self._jwriter.option(key, to_str(value))
+return self
+
+@since(3.1)
+def options(self, **options):
+"""
+Add write options.
+"""
+options = {k: to_str(v) for k, v in options.items()}
+self._jwriter.options(options)
+return self
+
+@since(3.1)
+def partitionedBy(self, col, *cols):

Review comment:
   @rdblue, @brkyvz, @cloud-fan, Should we maybe at least use a different 
class for these partition column expressions such as `PartitionedColumn` like 
we do for `TypedColumn`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27331: [SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API

2020-06-24 Thread GitBox



HyukjinKwon commented on a change in pull request #27331:
URL: https://github.com/apache/spark/pull/27331#discussion_r445268946



##
File path: python/pyspark/sql/readwriter.py
##
@@ -1048,6 +1048,128 @@ def jdbc(self, url, table, mode=None, properties=None):
 self.mode(mode)._jwrite.jdbc(url, table, jprop)
 
 
+class DataFrameWriterV2(object):
+"""
+Interface used to write a class:`pyspark.sql.dataframe.DataFrame`
+to external storage using the v2 API.
+
+.. versionadded:: 3.1.0
+"""
+
+def __init__(self, df, table):
+self._df = df
+self._spark = df.sql_ctx
+self._jwriter = df._jdf.writeTo(table)
+
+@since(3.1)
+def using(self, provider):
+"""
+Specifies a provider for the underlying output data source.
+Spark's default catalog supports "parquet", "json", etc.
+"""
+self._jwriter.using(provider)
+return self
+
+@since(3.1)
+def option(self, key, value):
+"""
+Add a write option.
+"""
+self._jwriter.option(key, to_str(value))
+return self
+
+@since(3.1)
+def options(self, **options):
+"""
+Add write options.
+"""
+options = {k: to_str(v) for k, v in options.items()}
+self._jwriter.options(options)
+return self
+
+@since(3.1)
+def partitionedBy(self, col, *cols):

Review comment:
   Maybe it's important to describe what are expected for `col`. Only 
columns and the partition transform functions are allowed, not the regular 
Spark Column.
   
   I still don't like it we made this API looks like it takes regular Spark 
Columns, this was one of the reason why Pandas UDFs were redesigned and 
separate into two separate groups .. let's at least clarify it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27331: [SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API

2020-06-24 Thread GitBox



HyukjinKwon commented on a change in pull request #27331:
URL: https://github.com/apache/spark/pull/27331#discussion_r445269268



##
File path: python/pyspark/sql/functions.py
##
@@ -3300,6 +3300,88 @@ def map_zip_with(col1, col2, f):
 return _invoke_higher_order_function("MapZipWith", [col1, col2], [f])
 
 
+# -- Partition transform functions 

+
+@since(3.1)
+def years(col):
+"""
+Partition transform function: A transform for timestamps and dates

Review comment:
   Let's also clarify this expression should only with `partitionedBy` in 
DSv2 APIs.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a change in pull request #27331: [SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API

2020-06-24 Thread GitBox



HyukjinKwon commented on a change in pull request #27331:
URL: https://github.com/apache/spark/pull/27331#discussion_r445267780



##
File path: python/pyspark/sql/tests/test_readwriter.py
##
@@ -163,6 +163,43 @@ def test_insert_into(self):
 self.assertEqual(6, self.spark.sql("select * from 
test_table").count())
 
 
+class ReadwriterV2Tests(ReusedSQLTestCase):
+def test_api(self):
+from pyspark.sql.readwriter import DataFrameWriterV2
+from pyspark.sql.functions import col

Review comment:
   I think we can just import these on the top

##
File path: python/pyspark/sql/dataframe.py
##
@@ -2220,6 +2220,22 @@ def semanticHash(self):
 sinceversion=1.4,
 doc=":func:`drop_duplicates` is an alias for :func:`dropDuplicates`.")
 
+@since(3.1)
+def writeTo(self, table):
+"""
+Create a write configuration builder for v2 sources.
+
+This builder is used to configure and execute write operations.
+
+For example, to append or create or replace existing tables.
+
+>>> df.writeTo("catalog.db.table").append()  # doctest: +SKIP
+>>> df.writeTo(  # doctest: +SKIP
+... "catalog.db.table"
+... ).partitionedBy($"col").createOrReplace()

Review comment:
   I guess it shouldn't be `$"col"`

##
File path: python/pyspark/sql/tests/test_readwriter.py
##
@@ -163,6 +163,43 @@ def test_insert_into(self):
 self.assertEqual(6, self.spark.sql("select * from 
test_table").count())
 
 
+class ReadwriterV2Tests(ReusedSQLTestCase):
+def test_api(self):
+from pyspark.sql.readwriter import DataFrameWriterV2
+from pyspark.sql.functions import col
+
+df = self.df
+writer = df.writeTo("testcat.t")
+self.assertIsInstance(writer, DataFrameWriterV2)
+self.assertIsInstance(writer.option("property", "value"), 
DataFrameWriterV2)
+self.assertIsInstance(writer.options(property="value"), 
DataFrameWriterV2)
+self.assertIsInstance(writer.using("source"), DataFrameWriterV2)
+self.assertIsInstance(writer.partitionedBy("id"), DataFrameWriterV2)
+self.assertIsInstance(writer.partitionedBy(col("id")), 
DataFrameWriterV2)
+
+def test_partitioning_functions(self):
+import datetime
+from pyspark.sql.readwriter import DataFrameWriterV2
+from pyspark.sql.functions import col, years, months, days, hours, 
bucket
+
+df = self.spark.createDataFrame(
+[(1, datetime.datetime.now(), "foo")],

Review comment:
   I would avoid the indeterministic value in the test unless it's 
necessary.

##
File path: python/pyspark/sql/readwriter.py
##
@@ -1048,6 +1048,128 @@ def jdbc(self, url, table, mode=None, properties=None):
 self.mode(mode)._jwrite.jdbc(url, table, jprop)
 
 
+class DataFrameWriterV2(object):
+"""
+Interface used to write a class:`pyspark.sql.dataframe.DataFrame`
+to external storage using the v2 API.
+
+.. versionadded:: 3.1.0
+"""
+
+def __init__(self, df, table):
+self._df = df
+self._spark = df.sql_ctx
+self._jwriter = df._jdf.writeTo(table)
+
+@since(3.1)
+def using(self, provider):
+"""
+Specifies a provider for the underlying output data source.
+Spark's default catalog supports "parquet", "json", etc.
+"""
+self._jwriter.using(provider)
+return self
+
+@since(3.1)
+def option(self, key, value):
+"""
+Add a write option.
+"""
+self._jwriter.option(key, to_str(value))
+return self
+
+@since(3.1)
+def options(self, **options):
+"""
+Add write options.
+"""
+options = {k: to_str(v) for k, v in options.items()}
+self._jwriter.options(options)
+return self
+
+@since(3.1)
+def partitionedBy(self, col, *cols):

Review comment:
   Maybe it's important to describe what are expected for `col`. Only 
columns and the partition transform functions are allowed, not the regular 
Spark Column. I still don't like it we made this API looks like it takes 
regular Spark Columns, this was one of the reason why Pandas UDFs were separate 
into two separate groups .. let's at least clarify it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #27331: [SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API

2020-06-24 Thread GitBox



HyukjinKwon commented on pull request #27331:
URL: https://github.com/apache/spark/pull/27331#issuecomment-649168859


   >  I don't think that maintenance is a huge issue here. Just saying...
   
   That's probably you're used to Python side .. For people who don't know 
Python, reading itself is some extra overhead .. I will merge a bit later after 
waiting and monitoring the changes in DSv2 APIs then since it looks like nobody 
knows the answer about the stability.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #27849: [SPARK-31081][UI][SQL] Make the display of stageId/stageAttemptId/taskId of sql metrics configurable in UI

2020-06-24 Thread GitBox



HyukjinKwon commented on pull request #27849:
URL: https://github.com/apache/spark/pull/27849#issuecomment-649167755


   Closing in favour of #27927



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #27849: [SPARK-31081][UI][SQL] Make the display of stageId/stageAttemptId/taskId of sql metrics configurable in UI

2020-06-24 Thread GitBox



HyukjinKwon closed pull request #27849:
URL: https://github.com/apache/spark/pull/27849


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-24 Thread GitBox



AmplabJenkins removed a comment on pull request #28852:
URL: https://github.com/apache/spark/pull/28852#issuecomment-649163956


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124504/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-24 Thread GitBox



AmplabJenkins removed a comment on pull request #28852:
URL: https://github.com/apache/spark/pull/28852#issuecomment-649163945


   Merged build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-24 Thread GitBox



AmplabJenkins commented on pull request #28852:
URL: https://github.com/apache/spark/pull/28852#issuecomment-649163945







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-24 Thread GitBox



SparkQA removed a comment on pull request #28852:
URL: https://github.com/apache/spark/pull/28852#issuecomment-649125587


   **[Test build #124504 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124504/testReport)**
 for PR 28852 at commit 
[`28da5cf`](https://github.com/apache/spark/commit/28da5cfd39dba6a8319dd1cdfe39e51ed5cbdea5).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-24 Thread GitBox



SparkQA commented on pull request #28852:
URL: https://github.com/apache/spark/pull/28852#issuecomment-649163590


   **[Test build #124504 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124504/testReport)**
 for PR 28852 at commit 
[`28da5cf`](https://github.com/apache/spark/commit/28da5cfd39dba6a8319dd1cdfe39e51ed5cbdea5).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-24 Thread GitBox



AmplabJenkins removed a comment on pull request #28852:
URL: https://github.com/apache/spark/pull/28852#issuecomment-649162543


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124502/
   Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-24 Thread GitBox



AmplabJenkins removed a comment on pull request #28852:
URL: https://github.com/apache/spark/pull/28852#issuecomment-649162539


   Build finished. Test FAILed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-24 Thread GitBox



SparkQA removed a comment on pull request #28852:
URL: https://github.com/apache/spark/pull/28852#issuecomment-649123085


   **[Test build #124502 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124502/testReport)**
 for PR 28852 at commit 
[`d9c5bf7`](https://github.com/apache/spark/commit/d9c5bf7b3481e6ff4025c534bef952717b1275b7).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-24 Thread GitBox



AmplabJenkins commented on pull request #28852:
URL: https://github.com/apache/spark/pull/28852#issuecomment-649162539







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #28852: [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache

2020-06-24 Thread GitBox



SparkQA commented on pull request #28852:
URL: https://github.com/apache/spark/pull/28852#issuecomment-649162330


   **[Test build #124502 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124502/testReport)**
 for PR 28852 at commit 
[`d9c5bf7`](https://github.com/apache/spark/commit/d9c5bf7b3481e6ff4025c534bef952717b1275b7).
* This patch **fails Spark unit tests**.
* This patch **does not merge cleanly**.
* This patch adds no public classes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-24 Thread GitBox



AmplabJenkins removed a comment on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-649160040







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-24 Thread GitBox



AmplabJenkins commented on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-649160040







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-24 Thread GitBox



SparkQA commented on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-649159672


   **[Test build #124508 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124508/testReport)**
 for PR 27690 at commit 
[`0fbeaf3`](https://github.com/apache/spark/commit/0fbeaf374bf35a7d0cde2b3340d9f3c4551cbdb2).



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-24 Thread GitBox



maropu commented on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-649158798


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on pull request #27690: [SPARK-21514][SQL] Added a new option to use non-blobstore storage when writing into blobstore storage

2020-06-24 Thread GitBox



maropu commented on pull request #27690:
URL: https://github.com/apache/spark/pull/27690#issuecomment-649158887


   @HyukjinKwon @viirya no more comment?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements

2020-06-24 Thread GitBox



maropu commented on a change in pull request #27246:
URL: https://github.com/apache/spark/pull/27246#discussion_r445255528



##
File path: 
sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt
##
@@ -42,4 +42,8 @@ Spilling with 1 rows: Best Time(ms)   Avg 
Time(ms)   Stdev(m
 UnsafeExternalSorter 11 11 
  1 14.7  68.0   1.0X
 ExternalAppendOnlyUnsafeRowArray  9 10 
  1 17.1  58.5   1.2X
 
-
+OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 
4.4.0-178-generic

Review comment:
   > My understanding was to add new benchmark results to the file. I 
didn't change other results in the files. Do you want me to update all results?
   
   We need to check no perf regression on jdk8/jdk11 (Sometimes, we hit perf 
regression unexpectedly). But, yea. we might not need the update if you've 
already checked locally.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements

2020-06-24 Thread GitBox



maropu commented on pull request #27246:
URL: https://github.com/apache/spark/pull/27246#issuecomment-649158069


   Looks okay to me. Anyone could check this? @cloud-fan @dongjoon-hyun  
@JoshRosen @jiangxb1987 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements

2020-06-24 Thread GitBox



maropu commented on a change in pull request #27246:
URL: https://github.com/apache/spark/pull/27246#discussion_r445256544



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/ExternalAppendOnlyUnsafeRowArrayBenchmark.scala
##
@@ -182,6 +182,47 @@ object ExternalAppendOnlyUnsafeRowArrayBenchmark extends 
BenchmarkBase {
 }
   }
 
+  def testAgainstUnsafeSorterSpillReader(
+  numSpillThreshold: Int,
+  numRows: Int,
+  numIterators: Int,
+  iterations: Int): Unit = {
+val rows = testRows(numRows)
+val benchmark = new Benchmark(s"Spilling  SpillReader with $numRows rows", 
iterations * numRows,
+  output = output)
+
+benchmark.addCase("UnsafeSorterSpillReader_bufferSize1024") { _: Int =>
+  val array = UnsafeExternalSorter.create(
+TaskContext.get().taskMemoryManager(),
+SparkEnv.get.blockManager,
+SparkEnv.get.serializerManager,
+TaskContext.get(),
+null,
+null,
+1024,
+SparkEnv.get.memoryManager.pageSizeBytes,
+numSpillThreshold,
+false)
+
+  rows.foreach(x =>
+array.insertRecord(
+  x.getBaseObject,
+  x.getBaseOffset,
+  x.getSizeInBytes,
+  0,
+  false))
+
+  for (_ <- 0L until numIterators) {

Review comment:
   Could you leave some comments about that here?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements

2020-06-24 Thread GitBox



maropu commented on a change in pull request #27246:
URL: https://github.com/apache/spark/pull/27246#discussion_r445256446



##
File path: 
sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt
##
@@ -42,4 +42,8 @@ Spilling with 1 rows: Best Time(ms)   Avg 
Time(ms)   Stdev(m
 UnsafeExternalSorter 11 11 
  1 14.7  68.0   1.0X
 ExternalAppendOnlyUnsafeRowArray  9 10 
  1 17.1  58.5   1.2X
 
-
+OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 
4.4.0-178-generic
+Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz
+Spilling  SpillReader with 16000 rows:Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
+
+UnsafeSorterSpillReader_bufferSize1024  411426 
 13  0.61607.2   1.0X

Review comment:
   What's a number without this PR?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements

2020-06-24 Thread GitBox



maropu commented on a change in pull request #27246:
URL: https://github.com/apache/spark/pull/27246#discussion_r445255528



##
File path: 
sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt
##
@@ -42,4 +42,8 @@ Spilling with 1 rows: Best Time(ms)   Avg 
Time(ms)   Stdev(m
 UnsafeExternalSorter 11 11 
  1 14.7  68.0   1.0X
 ExternalAppendOnlyUnsafeRowArray  9 10 
  1 17.1  58.5   1.2X
 
-
+OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 
4.4.0-178-generic

Review comment:
   > My understanding was to add new benchmark results to the file. I 
didn't change other results in the files. Do you want me to update all results?
   
   We need to check no perf regression on jdk8/jdk11 (Sometimes, we hit perf 
regression unexpectedly). But, yea. we might not need this update if you've 
already checked locally.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27246: [SPARK-30536][CORE][SQL] Sort-merge join operator spilling performance improvements

2020-06-24 Thread GitBox



maropu commented on a change in pull request #27246:
URL: https://github.com/apache/spark/pull/27246#discussion_r445255528



##
File path: 
sql/core/benchmarks/ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt
##
@@ -42,4 +42,8 @@ Spilling with 1 rows: Best Time(ms)   Avg 
Time(ms)   Stdev(m
 UnsafeExternalSorter 11 11 
  1 14.7  68.0   1.0X
 ExternalAppendOnlyUnsafeRowArray  9 10 
  1 17.1  58.5   1.2X
 
-
+OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~16.04-b09 on Linux 
4.4.0-178-generic

Review comment:
   > My understanding was to add new benchmark results to the file. I 
didn't change other results in the files. Do you want me to update all results?
   
   We need to check no perf regression on jdk8/jdk11





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] frankyin-factual commented on a change in pull request #28898: [SPARK-32059][SQL] Allow schema pruning thru window functions

2020-06-24 Thread GitBox



frankyin-factual commented on a change in pull request #28898:
URL: https://github.com/apache/spark/pull/28898#discussion_r445255141



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##
@@ -113,6 +113,11 @@ object NestedColumnAliasing {
 case _: Sample => true
 case _: RepartitionByExpression => true
 case _: Join => true
+case x: Filter => x.child match {
+  case _: Window => true

Review comment:
   Looks like the plan is a `Project -> Filter -> Window`. If we only do 
`case _: Window => true`, the projection aliasing won't be available at the 
`Window` stage, and can't be passed onto later stages described in the ticket. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on pull request #28923: [SPARK-32090][SQL] UserDefinedType.equal() should be symmetrical

2020-06-24 Thread GitBox



maropu commented on pull request #28923:
URL: https://github.com/apache/spark/pull/28923#issuecomment-649155603


   LGTM except for the @cloud-fan comment.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #28923: [SPARK-32090][SQL] UserDefinedType.equal() should be symmetrical

2020-06-24 Thread GitBox



maropu commented on a change in pull request #28923:
URL: https://github.com/apache/spark/pull/28923#discussion_r445254621



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/UserDefinedTypeSuite.scala
##
@@ -134,6 +134,17 @@ class UserDefinedTypeSuite extends QueryTest with 
SharedSparkSession with Parque
 MyLabeledPoint(1.0, new TestUDT.MyDenseVector(Array(0.1, 1.0))),
 MyLabeledPoint(0.0, new TestUDT.MyDenseVector(Array(0.3, 3.0.toDF()
 
+
+  test("equal") {
+val udt1 = new ExampleBaseTypeUDT
+val udt2 = new ExampleSubTypeUDT
+val udt3 = new ExampleSubTypeUDT
+assert(!(udt1 === udt2))

Review comment:
   nit: `!==`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 6 >

1 - 100 of 553 matches

Mail list logo