[GitHub] [spark] yaooqinn commented on pull request #31961: [SPARK-34868][SQL] Support divide an year-month interval by a numeric

2021-03-25 Thread GitBox


yaooqinn commented on pull request #31961:
URL: https://github.com/apache/spark/pull/31961#issuecomment-807959058


   late lgtm.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #31961: [SPARK-34868][SQL] Support divide an year-month interval by a numeric

2021-03-25 Thread GitBox


cloud-fan closed pull request #31961:
URL: https://github.com/apache/spark/pull/31961


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #31961: [SPARK-34868][SQL] Support divide an year-month interval by a numeric

2021-03-25 Thread GitBox


cloud-fan commented on pull request #31961:
URL: https://github.com/apache/spark/pull/31961#issuecomment-807958691


   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #31961: [SPARK-34868][SQL] Support divide an year-month interval by a numeric

2021-03-25 Thread GitBox


cloud-fan commented on a change in pull request #31961:
URL: https://github.com/apache/spark/pull/31961#discussion_r602033028



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/intervalExpressions.scala
##
@@ -341,3 +341,53 @@ case class MultiplyDTInterval(
 
   override def toString: String = s"($left * $right)"
 }
+
+// Divide an year-month interval by a numeric
+case class DivideYMInterval(
+interval: Expression,
+num: Expression)
+  extends BinaryExpression with ImplicitCastInputTypes with NullIntolerant 
with Serializable {
+  override def left: Expression = interval
+  override def right: Expression = num
+
+  override def inputTypes: Seq[AbstractDataType] = Seq(YearMonthIntervalType, 
NumericType)
+  override def dataType: DataType = YearMonthIntervalType
+
+  @transient
+  private lazy val evalFunc: (Int, Any) => Any = right.dataType match {
+case LongType => (months: Int, num) =>
+  LongMath.divide(months, num.asInstanceOf[Long], 
RoundingMode.HALF_UP).toInt

Review comment:
   It's better to put it as a code comment. We can fix it in the next PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

2021-03-25 Thread GitBox


dongjoon-hyun commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r602032483



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -307,6 +307,17 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
+  val DYNAMIC_PARTITON_PRUNING_PRUNING_SIDE_EXTRA_FILTER_RATIO =
+
buildConf("spark.sql.optimizer.dynamicPartitionPruning.pruningSideExtraFilterRatio")
+.internal()

Review comment:
   Indentation: two more spaces?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


SparkQA removed a comment on pull request #31804:
URL: https://github.com/apache/spark/pull/31804#issuecomment-807884538


   **[Test build #136543 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136543/testReport)**
 for PR 31804 at commit 
[`bee8cbe`](https://github.com/apache/spark/commit/bee8cbee33ceeca81a25d746a63720eb7fe78cd7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WangGuangxin commented on pull request #31967: [SPAKR-34819][SQL]MapType supports orderable semantics

2021-03-25 Thread GitBox


WangGuangxin commented on pull request #31967:
URL: https://github.com/apache/spark/pull/31967#issuecomment-807957409


   @hvanhovell  @cloud-fan @maropu Could you please help review this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

2021-03-25 Thread GitBox


dongjoon-hyun commented on a change in pull request #29726:
URL: https://github.com/apache/spark/pull/29726#discussion_r602032154



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##
@@ -307,6 +307,17 @@ object SQLConf {
   .booleanConf
   .createWithDefault(true)
 
+  val DYNAMIC_PARTITON_PRUNING_PRUNING_SIDE_EXTRA_FILTER_RATIO =

Review comment:
   `PARTITON` -> `PARTITION`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


SparkQA commented on pull request #31804:
URL: https://github.com/apache/spark/pull/31804#issuecomment-807957257


   **[Test build #136543 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136543/testReport)**
 for PR 31804 at commit 
[`bee8cbe`](https://github.com/apache/spark/commit/bee8cbee33ceeca81a25d746a63720eb7fe78cd7).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] c21 commented on pull request #31958: [SPARK-34862][SQL] Support nested column in ORC vectorized reader

2021-03-25 Thread GitBox


c21 commented on pull request #31958:
URL: https://github.com/apache/spark/pull/31958#issuecomment-807956027


   The unit test failed with MiMa tests
   
   > abstract method getBoolean(Int)Boolean in class 
org.apache.spark.sql.vectorized.ColumnVector does not have a correspondent in 
current version
   
   However in this PR, the class `org.apache.spark.sql.vectorized.ColumnVector` 
is not changed at all. I am still checking why this test is failing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


SparkQA commented on pull request #31804:
URL: https://github.com/apache/spark/pull/31804#issuecomment-807954099


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41127/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] WangGuangxin opened a new pull request #31967: [SPAKR-34819][SQL]MapType supports orderable semantics

2021-03-25 Thread GitBox


WangGuangxin opened a new pull request #31967:
URL: https://github.com/apache/spark/pull/31967


   ### What changes were proposed in this pull request?
   Currently MapType doesn't support orderable semantics, while it's supported 
in Hive/Presto. This makes it hard to migrate from Hive to SparkSQL if user 
have groupby/orderby map type in their sql.
   
   
   ### Why are the changes needed?
   Generally,  we compare two maps by the following steps:
   1. If the size of two maps are not equal, compare them by size.
   2. Otherwise, sort each map entry by map key, then compare two map entries 
one by one, first compare by key, then value.
   
   We have to specially handle this in grouping/join/window because Spark SQL 
turns grouping/join/window partition keys into binary `UnsafeRow` and compare 
the binary data directly instead of using MapType's ordering. In this case, we 
have to insert a `SortMapKey` expression to sort map entry by key. This is very 
similiar to `NormalizeFloatingNumbers` 
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Add more UTs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


AmplabJenkins removed a comment on pull request #31804:
URL: https://github.com/apache/spark/pull/31804#issuecomment-807950640


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136544/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


AmplabJenkins commented on pull request #31804:
URL: https://github.com/apache/spark/pull/31804#issuecomment-807950640


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136544/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


SparkQA removed a comment on pull request #31804:
URL: https://github.com/apache/spark/pull/31804#issuecomment-807910778


   **[Test build #136544 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136544/testReport)**
 for PR 31804 at commit 
[`eaba257`](https://github.com/apache/spark/commit/eaba25705b82f5462246c8b6fbb89321f7e43ee2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


SparkQA commented on pull request #31804:
URL: https://github.com/apache/spark/pull/31804#issuecomment-807950460


   **[Test build #136544 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136544/testReport)**
 for PR 31804 at commit 
[`eaba257`](https://github.com/apache/spark/commit/eaba25705b82f5462246c8b6fbb89321f7e43ee2).
* This patch **fails Spark unit tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31962: [SPARK-34869][K8S][TEST] Extend "EXTRA LOGS FOR THE FAILED TEST" section of k8s integration test log with the describe pods

2021-03-25 Thread GitBox


dongjoon-hyun commented on a change in pull request #31962:
URL: https://github.com/apache/spark/pull/31962#discussion_r602026721



##
File path: 
resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala
##
@@ -74,6 +75,9 @@ class KubernetesSuite extends SparkFunSuite
 
   protected override def logForFailedTest(): Unit = {
 logInfo("\n\n= EXTRA LOGS FOR THE FAILED TEST\n")
+logInfo("BEGIN driver DESCRIBE POD\n" +
+  
Minikube.describePods(s"spark-app-locator=$appLocator,spark-role=driver").mkString("\n"))

Review comment:
   +1 for the idea.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] xuanyuanking commented on pull request #31963: [SPARK-34871][SS] Move the checkpoint location resolving into the rule ResolveWriteToStream

2021-03-25 Thread GitBox


xuanyuanking commented on pull request #31963:
URL: https://github.com/apache/spark/pull/31963#issuecomment-807949559


   Wow, thanks all for the quick response :) Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31962: [SPARK-34869][K8S][TEST] Extend "EXTRA LOGS FOR THE FAILED TEST" section of k8s integration test log with the describe pods

2021-03-25 Thread GitBox


dongjoon-hyun commented on a change in pull request #31962:
URL: https://github.com/apache/spark/pull/31962#discussion_r602026102



##
File path: 
resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/backend/minikube/Minikube.scala
##
@@ -126,17 +126,21 @@ private[spark] object Minikube extends Logging {
 }
   }
 
-  def executeMinikube(action: String, args: String*): Seq[String] = {
+  def describePods(labels: String): Seq[String] =
+Minikube.executeMinikube(false, "kubectl", "--", "describe", "pods", 
"--all-namespaces",
+  "-l", labels)

Review comment:
   If you don't mind, could you move this function to the end? For example, 
after `minikubeServiceAction`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


SparkQA commented on pull request #31804:
URL: https://github.com/apache/spark/pull/31804#issuecomment-807948263


   **[Test build #136545 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136545/testReport)**
 for PR 31804 at commit 
[`f50d87b`](https://github.com/apache/spark/commit/f50d87b58e3b01b5f9451ce523080b7f58e0d6e3).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31649: [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0

2021-03-25 Thread GitBox


AmplabJenkins removed a comment on pull request #31649:
URL: https://github.com/apache/spark/pull/31649#issuecomment-807944123


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136539/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


AmplabJenkins removed a comment on pull request #31804:
URL: https://github.com/apache/spark/pull/31804#issuecomment-807944124


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41128/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31965: [SPARK-34843][SQL] Calculate more precise partition stride in JDBCRelation

2021-03-25 Thread GitBox


AmplabJenkins removed a comment on pull request #31965:
URL: https://github.com/apache/spark/pull/31965#issuecomment-807944126


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136537/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on pull request #31962: [SPARK-34869][K8S][TEST] Extend "EXTRA LOGS FOR THE FAILED TEST" section of k8s integration test log with the describe pods outp

2021-03-25 Thread GitBox


dongjoon-hyun edited a comment on pull request #31962:
URL: https://github.com/apache/spark/pull/31962#issuecomment-807944893


   Thank you for pining me, @attilapiros . I agree with your analysis for the 
AS-IS Jenkins failure. Apparently, Amplab Jenkins seems to have a setup issue 
still. FYI, I have a personal downstream Jenkins machine dedicated to run K8s 
integration test for all Apache branches (master/3.1/3.0/2.4). I usually keep 
them up-to-date. Currently, Minikube 1.18.1 and K8s 1.20.2. They never fails 
for last 7 days in all branches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun edited a comment on pull request #31962: [SPARK-34869][K8S][TEST] Extend "EXTRA LOGS FOR THE FAILED TEST" section of k8s integration test log with the describe pods outp

2021-03-25 Thread GitBox


dongjoon-hyun edited a comment on pull request #31962:
URL: https://github.com/apache/spark/pull/31962#issuecomment-807944893


   Thank you for pining me, @attilapiros . I agree with your analysis for the 
AS-IS Jenkins failure. Apparently, Amplab Jenkins seems to have a setup issue 
still. I have a personal downstream Jenkins machine dedicated to run K8s 
integration test for all Apache branches (master/3.1/3.0/2.4). I usually keep 
them up-to-date. Currently, Minikube 1.18.1 and K8s 1.20.2. They never fails 
for last 7 days in all branches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on pull request #31962: [SPARK-34869][K8S][TEST] Extend "EXTRA LOGS FOR THE FAILED TEST" section of k8s integration test log with the describe pods output

2021-03-25 Thread GitBox


dongjoon-hyun commented on pull request #31962:
URL: https://github.com/apache/spark/pull/31962#issuecomment-807944893


   Thank you for pining me, @attilapiros . I agree with you. Apparently, Amplab 
Jenkins seems to have a setup issue still. I have a personal downstream Jenkins 
machine dedicated to run K8s integration test for all Apache branches 
(master/3.1/3.0/2.4). I usually keep them up-to-date. Currently, Minikube 
1.18.1 and K8s 1.20.2. They never fails for last 7 days in all branches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31649: [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0

2021-03-25 Thread GitBox


AmplabJenkins commented on pull request #31649:
URL: https://github.com/apache/spark/pull/31649#issuecomment-807944123


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136539/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


AmplabJenkins commented on pull request #31804:
URL: https://github.com/apache/spark/pull/31804#issuecomment-807944124


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41128/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31965: [SPARK-34843][SQL] Calculate more precise partition stride in JDBCRelation

2021-03-25 Thread GitBox


AmplabJenkins commented on pull request #31965:
URL: https://github.com/apache/spark/pull/31965#issuecomment-807944126


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136537/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


SparkQA commented on pull request #31804:
URL: https://github.com/apache/spark/pull/31804#issuecomment-807937788


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41128/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] JkSelf commented on pull request #31941: [SPARK-34637][SQL] Improve the performance of AQE and DPP through logical optimization.

2021-03-25 Thread GitBox


JkSelf commented on pull request #31941:
URL: https://github.com/apache/spark/pull/31941#issuecomment-807937198


   > I think it is better to make use of the AQE framework to reuse the 
broadcast exchange or newQueryStage.
   
   @cloud-fan 
   I may need to explain a little bit more about this.
   
   1. In my understanding, `PlanDynamicPruningFilters` rule is just simply 
judge whether there is an exchange that can be reused to decide whether to 
insert DPP or not. And the process of real reuse is in `ReuseExchange` rule. I 
think this way of thinking is clearer. 
   2. When AQE was enabled, we implemented the `ReuseExchange` rule in the AQE 
Framework. When the exchange was created, we went to the `stageCache` to find 
out if there is an exchange that can be reused, and if there is, we reuse it.
   3. In the `PlanAdaptiveDynamicPruningFilters` rule, I am more inclined to 
the idea of `PlanDynamicPruningFilters` rule, just add DPP filter by judging 
whether there is an exchange that can be reused. The real reuse process is left 
to AQE Framework instead of looking in the `stageCache`  to create the reused 
exchange or calling the `newQueryStage` method to create a new quey stage in 
the `PlanAdaptiveDynamicPruningFilters` rule. Of course, we did this in 
[PR#31258](https://github.com/apache/spark/pull/31258). But I think we may need 
to make some improvements in subsequent implementations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #31965: [SPARK-34843][SQL] Calculate more precise partition stride in JDBCRelation

2021-03-25 Thread GitBox


SparkQA removed a comment on pull request #31965:
URL: https://github.com/apache/spark/pull/31965#issuecomment-807804551


   **[Test build #136537 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136537/testReport)**
 for PR 31965 at commit 
[`c8d256d`](https://github.com/apache/spark/commit/c8d256d460ee55c424e5e02a55903850e15c003b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31965: [SPARK-34843][SQL] Calculate more precise partition stride in JDBCRelation

2021-03-25 Thread GitBox


SparkQA commented on pull request #31965:
URL: https://github.com/apache/spark/pull/31965#issuecomment-807934643


   **[Test build #136537 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136537/testReport)**
 for PR 31965 at commit 
[`c8d256d`](https://github.com/apache/spark/commit/c8d256d460ee55c424e5e02a55903850e15c003b).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #31649: [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0

2021-03-25 Thread GitBox


SparkQA removed a comment on pull request #31649:
URL: https://github.com/apache/spark/pull/31649#issuecomment-807860345


   **[Test build #136539 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136539/testReport)**
 for PR 31649 at commit 
[`8b58e29`](https://github.com/apache/spark/commit/8b58e29adcbad1bf7cdeb416cd1e22928d49c892).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan closed pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

2021-03-25 Thread GitBox


cloud-fan closed pull request #29726:
URL: https://github.com/apache/spark/pull/29726


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31649: [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0

2021-03-25 Thread GitBox


SparkQA commented on pull request #31649:
URL: https://github.com/apache/spark/pull/31649#issuecomment-807932308


   **[Test build #136539 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136539/testReport)**
 for PR 31649 at commit 
[`8b58e29`](https://github.com/apache/spark/commit/8b58e29adcbad1bf7cdeb416cd1e22928d49c892).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

2021-03-25 Thread GitBox


cloud-fan commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807932345


   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] baohe-zhang commented on a change in pull request #31945: [SPARK-34845][CORE] ProcfsMetricsGetter shouldn't return partial procfs metrics

2021-03-25 Thread GitBox


baohe-zhang commented on a change in pull request #31945:
URL: https://github.com/apache/spark/pull/31945#discussion_r602012026



##
File path: 
core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala
##
@@ -199,7 +201,7 @@ private[spark] class ProcfsMetricsGetter(procfsDir: String 
= "/proc/") extends L
   case f: IOException =>
 logWarning("There was a problem with reading" +
   " the stat file of the process. ", f)
-ProcfsMetrics(0, 0, 0, 0, 0, 0)
+throw f

Review comment:
   Yes, that is correct.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] srowen commented on a change in pull request #31942: [SPARK-34834][NETWORK] Fix a potential Netty memory leak in TransportResponseHandler.

2021-03-25 Thread GitBox


srowen commented on a change in pull request #31942:
URL: https://github.com/apache/spark/pull/31942#discussion_r602011411



##
File path: 
common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java
##
@@ -188,6 +188,7 @@ public void handle(ResponseMessage message) throws 
Exception {
   if (listener == null) {
 logger.warn("Ignoring response for RPC {} from {} ({} bytes) since it 
is not outstanding",
   resp.requestId, getRemoteAddress(channel), resp.body().size());
+resp.body().release();

Review comment:
   Evidently, yeah. It may just not cause much of an issue. Are there any 
reasons to release() it, anyone know?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31945: [SPARK-34845][CORE] ProcfsMetricsGetter shouldn't return partial procfs metrics

2021-03-25 Thread GitBox


dongjoon-hyun commented on a change in pull request #31945:
URL: https://github.com/apache/spark/pull/31945#discussion_r602010869



##
File path: 
core/src/main/scala/org/apache/spark/executor/ProcfsMetricsGetter.scala
##
@@ -199,7 +201,7 @@ private[spark] class ProcfsMetricsGetter(procfsDir: String 
= "/proc/") extends L
   case f: IOException =>
 logWarning("There was a problem with reading" +
   " the stat file of the process. ", f)
-ProcfsMetrics(0, 0, 0, 0, 0, 0)
+throw f

Review comment:
   I meant that we escape from `for (p <- pids) {}` statement at the first 
exception.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #31942: [SPARK-34834][NETWORK] Fix a potential Netty memory leak in TransportResponseHandler.

2021-03-25 Thread GitBox


dongjoon-hyun commented on a change in pull request #31942:
URL: https://github.com/apache/spark/pull/31942#discussion_r602010302



##
File path: 
common/network-common/src/main/java/org/apache/spark/network/client/TransportResponseHandler.java
##
@@ -188,6 +188,7 @@ public void handle(ResponseMessage message) throws 
Exception {
   if (listener == null) {
 logger.warn("Ignoring response for RPC {} from {} ({} bytes) since it 
is not outstanding",
   resp.requestId, getRemoteAddress(channel), resp.body().size());
+resp.body().release();

Review comment:
   From the commit history, is this missed since Apache Spark 1.2.0 at 
SPARK-3453?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


SparkQA commented on pull request #31804:
URL: https://github.com/apache/spark/pull/31804#issuecomment-807927940


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41128/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


SparkQA commented on pull request #31804:
URL: https://github.com/apache/spark/pull/31804#issuecomment-807925937


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41127/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31966: [SPARK-34638][SQL] Single field nested column prune on generator output

2021-03-25 Thread GitBox


AmplabJenkins removed a comment on pull request #31966:
URL: https://github.com/apache/spark/pull/31966#issuecomment-807924903


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41125/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31966: [SPARK-34638][SQL] Single field nested column prune on generator output

2021-03-25 Thread GitBox


AmplabJenkins commented on pull request #31966:
URL: https://github.com/apache/spark/pull/31966#issuecomment-807924903


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41125/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31966: [SPARK-34638][SQL] Single field nested column prune on generator output

2021-03-25 Thread GitBox


SparkQA commented on pull request #31966:
URL: https://github.com/apache/spark/pull/31966#issuecomment-807913303


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41125/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] mridulm commented on a change in pull request #30480: [SPARK-32921][SHUFFLE] MapOutputTracker extensions to support push-based shuffle

2021-03-25 Thread GitBox


mridulm commented on a change in pull request #30480:
URL: https://github.com/apache/spark/pull/30480#discussion_r601996217



##
File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala
##
@@ -987,18 +1277,51 @@ private[spark] object MapOutputTracker extends Logging {
   shuffleId: Int,
   startPartition: Int,
   endPartition: Int,
-  statuses: Array[MapStatus],
+  mapStatuses: Array[MapStatus],
   startMapIndex : Int,
-  endMapIndex: Int): Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])] 
= {
-assert (statuses != null)
+  endMapIndex: Int,
+  mergeStatuses: Option[Array[MergeStatus]] = None):
+  Iterator[(BlockManagerId, Seq[(BlockId, Long, Int)])] = {
+assert (mapStatuses != null)
 val splitsByAddress = new HashMap[BlockManagerId, ListBuffer[(BlockId, 
Long, Int)]]
-val iter = statuses.iterator.zipWithIndex
-for ((status, mapIndex) <- iter.slice(startMapIndex, endMapIndex)) {
-  if (status == null) {
-val errorMessage = s"Missing an output location for shuffle $shuffleId"
-logError(errorMessage)
-throw new MetadataFetchFailedException(shuffleId, startPartition, 
errorMessage)
-  } else {
+// Only use MergeStatus for reduce tasks that fetch all map outputs. Since 
a merged shuffle
+// partition consists of blocks merged in random order, we are unable to 
serve map index
+// subrange requests. However, when a reduce task needs to fetch blocks 
from a subrange of
+// map outputs, it usually indicates skewed partitions which push-based 
shuffle delegates
+// to AQE to handle.

Review comment:
   There are couple of things here:
   
   a) Can we leverage existing skew algo ? My understanding is we can, though 
it might not be necessarily as optimal as current reads for push based shuffle.
   What I mean is, if reducer r1.1 is processing m1-m100 and r1.2 is processing 
m101-m200 for reducer partition 1, we can ensure that m1-m100 can be satisfied 
with bin packing to get better read than reading from 100 mappers/ESS - right ? 
It is not as optimal as reading m1-m200, but it will be better than alternative.
   
   b) Alternative ways to split mapper input for a reducer : which is what you 
described, and this can be an option as spark evolves.
   
   Given both of these, we would want to make the comment a TODO with a jira 
for it - which can be addressed in some subsequent work.
   
   Or are there concerns with (a) or (b) that I am missing ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31966: [SPARK-34638][SQL] Single field nested column prune on generator output

2021-03-25 Thread GitBox


SparkQA commented on pull request #31966:
URL: https://github.com/apache/spark/pull/31966#issuecomment-807910910


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41125/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


SparkQA commented on pull request #31804:
URL: https://github.com/apache/spark/pull/31804#issuecomment-807910778


   **[Test build #136544 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136544/testReport)**
 for PR 31804 at commit 
[`eaba257`](https://github.com/apache/spark/commit/eaba25705b82f5462246c8b6fbb89321f7e43ee2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31886: [WIP][SPARK-34795][SQL][TEST] Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-25 Thread GitBox


AmplabJenkins removed a comment on pull request #31886:
URL: https://github.com/apache/spark/pull/31886#issuecomment-807907171


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41126/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

2021-03-25 Thread GitBox


AmplabJenkins removed a comment on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807907173


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41124/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31958: [SPARK-34862][SQL] Support nested column in ORC vectorized reader

2021-03-25 Thread GitBox


AmplabJenkins removed a comment on pull request #31958:
URL: https://github.com/apache/spark/pull/31958#issuecomment-807907174


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41122/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31649: [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0

2021-03-25 Thread GitBox


AmplabJenkins removed a comment on pull request #31649:
URL: https://github.com/apache/spark/pull/31649#issuecomment-807907172


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41123/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31886: [WIP][SPARK-34795][SQL][TEST] Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-25 Thread GitBox


AmplabJenkins commented on pull request #31886:
URL: https://github.com/apache/spark/pull/31886#issuecomment-807907171


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41126/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31958: [SPARK-34862][SQL] Support nested column in ORC vectorized reader

2021-03-25 Thread GitBox


AmplabJenkins commented on pull request #31958:
URL: https://github.com/apache/spark/pull/31958#issuecomment-807907174


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41122/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31649: [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0

2021-03-25 Thread GitBox


AmplabJenkins commented on pull request #31649:
URL: https://github.com/apache/spark/pull/31649#issuecomment-807907172


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41123/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

2021-03-25 Thread GitBox


AmplabJenkins commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807907173


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41124/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31649: [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0

2021-03-25 Thread GitBox


SparkQA commented on pull request #31649:
URL: https://github.com/apache/spark/pull/31649#issuecomment-807903371


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41123/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31886: [WIP][SPARK-34795][SQL][TEST] Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-25 Thread GitBox


SparkQA commented on pull request #31886:
URL: https://github.com/apache/spark/pull/31886#issuecomment-807901200


   Kubernetes integration test unable to build dist.
   
   exiting with code: 1
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41126/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

2021-03-25 Thread GitBox


SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807901054


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41124/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31958: [SPARK-34862][SQL] Support nested column in ORC vectorized reader

2021-03-25 Thread GitBox


SparkQA commented on pull request #31958:
URL: https://github.com/apache/spark/pull/31958#issuecomment-807898013


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41122/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

2021-03-25 Thread GitBox


SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807897409


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41124/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31886: [WIP][SPARK-34795][SQL][TEST] Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-25 Thread GitBox


AmplabJenkins removed a comment on pull request #31886:
URL: https://github.com/apache/spark/pull/31886#issuecomment-807892337


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136542/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #31886: [WIP][SPARK-34795][SQL][TEST] Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-25 Thread GitBox


SparkQA removed a comment on pull request #31886:
URL: https://github.com/apache/spark/pull/31886#issuecomment-807884503


   **[Test build #136542 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136542/testReport)**
 for PR 31886 at commit 
[`fc16a02`](https://github.com/apache/spark/commit/fc16a027edf7eb82d942518b3896df4f51ab).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31886: [WIP][SPARK-34795][SQL][TEST] Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-25 Thread GitBox


AmplabJenkins commented on pull request #31886:
URL: https://github.com/apache/spark/pull/31886#issuecomment-807892337


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136542/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31886: [WIP][SPARK-34795][SQL][TEST] Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-25 Thread GitBox


SparkQA commented on pull request #31886:
URL: https://github.com/apache/spark/pull/31886#issuecomment-807892293


   **[Test build #136542 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136542/testReport)**
 for PR 31886 at commit 
[`fc16a02`](https://github.com/apache/spark/commit/fc16a027edf7eb82d942518b3896df4f51ab).
* This patch **fails to build**.
* This patch merges cleanly.
* This patch adds the following public classes _(experimental)_:
 * `  protected case class QueryOutput(sql: String, schema: String, output: 
String) `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31958: [SPARK-34862][SQL] Support nested column in ORC vectorized reader

2021-03-25 Thread GitBox


AmplabJenkins removed a comment on pull request #31958:
URL: https://github.com/apache/spark/pull/31958#issuecomment-807884024


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136538/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #30957: [SPARK-31937][SQL] Support processing ArrayType/MapType/StructType data using no-serde mode script transform

2021-03-25 Thread GitBox


AngersZh commented on a change in pull request #30957:
URL: https://github.com/apache/spark/pull/30957#discussion_r601970875



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
##
@@ -47,7 +47,13 @@ trait BaseScriptTransformationExec extends UnaryExecNode {
   def ioschema: ScriptTransformationIOSchema
 
   protected lazy val inputExpressionsWithoutSerde: Seq[Expression] = {
-input.map(Cast(_, StringType).withTimeZone(conf.sessionLocalTimeZone))
+input.map { in =>
+  in.dataType match {
+case _: ArrayType | _: MapType | _: StructType =>
+  new StructsToJson(in).withTimeZone(conf.sessionLocalTimeZone)

Review comment:
   > hm. Actually, this feature is not relevant to users using pre-built 
spark (w/ `-Phive` enabled)? IIUC there is no way to use this feature when 
enabling `-Phive`. I'm not sure about how important this feature is, but I 
think we need to document it somewhere if we will merge it.
   
   That's why I implement this in hive's way first and support make json as a 
serde. If we implement this in json, then some origin hive SQL still can't run 
when directly upgrade to spark sql (w/ -Phive)。  How about add a configuration 
such as 
   `spark.sql.sccriptTransform.useSparkFirst`

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
##
@@ -47,7 +47,13 @@ trait BaseScriptTransformationExec extends UnaryExecNode {
   def ioschema: ScriptTransformationIOSchema
 
   protected lazy val inputExpressionsWithoutSerde: Seq[Expression] = {
-input.map(Cast(_, StringType).withTimeZone(conf.sessionLocalTimeZone))
+input.map { in =>
+  in.dataType match {
+case _: ArrayType | _: MapType | _: StructType =>
+  new StructsToJson(in).withTimeZone(conf.sessionLocalTimeZone)

Review comment:
   > hm. Actually, this feature is not relevant to users using pre-built 
spark (w/ `-Phive` enabled)? IIUC there is no way to use this feature when 
enabling `-Phive`. I'm not sure about how important this feature is, but I 
think we need to document it somewhere if we will merge it.
   
   That's why I implement this in hive's way first and support make json as a 
serde. If we implement this in json, then some origin hive SQL still can't run 
when directly upgrade to spark sql (w/ -Phive)。  How about add a configuration 
such as 
   `spark.sql.sccriptTransform.useSparkFirst` and add document?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


SparkQA commented on pull request #31804:
URL: https://github.com/apache/spark/pull/31804#issuecomment-807884538


   **[Test build #136543 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136543/testReport)**
 for PR 31804 at commit 
[`bee8cbe`](https://github.com/apache/spark/commit/bee8cbee33ceeca81a25d746a63720eb7fe78cd7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31886: [WIP][SPARK-34795][SQL][TEST] Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-25 Thread GitBox


SparkQA commented on pull request #31886:
URL: https://github.com/apache/spark/pull/31886#issuecomment-807884503


   **[Test build #136542 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136542/testReport)**
 for PR 31886 at commit 
[`fc16a02`](https://github.com/apache/spark/commit/fc16a027edf7eb82d942518b3896df4f51ab).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31966: [SPARK-34638][SQL] Single field nested column prune on generator output

2021-03-25 Thread GitBox


SparkQA commented on pull request #31966:
URL: https://github.com/apache/spark/pull/31966#issuecomment-807884390


   **[Test build #136541 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136541/testReport)**
 for PR 31966 at commit 
[`5221be3`](https://github.com/apache/spark/commit/5221be35710ed149238cd801754790e7442f954b).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on pull request #31747: [SPARK-34607][SQL][2.4] Add `Utils.isMemberClass` to fix a malformed class name error on jdk8u

2021-03-25 Thread GitBox


maropu commented on pull request #31747:
URL: https://github.com/apache/spark/pull/31747#issuecomment-807884230


   Thanks for the reviews, @viirya @cloud-fan !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya closed pull request #31747: [SPARK-34607][SQL][2.4] Add `Utils.isMemberClass` to fix a malformed class name error on jdk8u

2021-03-25 Thread GitBox


viirya closed pull request #31747:
URL: https://github.com/apache/spark/pull/31747


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31958: [SPARK-34862][SQL] Support nested column in ORC vectorized reader

2021-03-25 Thread GitBox


AmplabJenkins commented on pull request #31958:
URL: https://github.com/apache/spark/pull/31958#issuecomment-807884024


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136538/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on pull request #31747: [SPARK-34607][SQL][2.4] Add `Utils.isMemberClass` to fix a malformed class name error on jdk8u

2021-03-25 Thread GitBox


viirya commented on pull request #31747:
URL: https://github.com/apache/spark/pull/31747#issuecomment-807883601


   Thanks all. Merging to branch-2.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31958: [SPARK-34862][SQL] Support nested column in ORC vectorized reader

2021-03-25 Thread GitBox


SparkQA commented on pull request #31958:
URL: https://github.com/apache/spark/pull/31958#issuecomment-807882852


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41122/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AngersZhuuuu commented on a change in pull request #30957: [SPARK-31937][SQL] Support processing ArrayType/MapType/StructType data using no-serde mode script transform

2021-03-25 Thread GitBox


AngersZh commented on a change in pull request #30957:
URL: https://github.com/apache/spark/pull/30957#discussion_r578077387



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala
##
@@ -47,7 +47,13 @@ trait BaseScriptTransformationExec extends UnaryExecNode {
   def ioschema: ScriptTransformationIOSchema
 
   protected lazy val inputExpressionsWithoutSerde: Seq[Expression] = {
-input.map(Cast(_, StringType).withTimeZone(conf.sessionLocalTimeZone))
+input.map { in =>
+  in.dataType match {
+case _: ArrayType | _: MapType | _: StructType =>
+  new StructsToJson(in).withTimeZone(conf.sessionLocalTimeZone)

Review comment:
   Yea, H, your committer have some discuss result for this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on a change in pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


yaooqinn commented on a change in pull request #31804:
URL: https://github.com/apache/spark/pull/31804#discussion_r601967696



##
File path: sql/core/src/test/resources/log4j.properties
##
@@ -22,7 +22,7 @@ log4j.rootLogger=INFO, CA, FA
 log4j.appender.CA=org.apache.log4j.ConsoleAppender
 log4j.appender.CA.layout=org.apache.log4j.PatternLayout
 log4j.appender.CA.layout.ConversionPattern=%d{HH:mm:ss.SSS} %p %c: %m%n
-log4j.appender.CA.Threshold = WARN
+log4j.appender.CA.Threshold = FATAL

Review comment:
   Thanks, I'll send a PR then




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31649: [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0

2021-03-25 Thread GitBox


SparkQA commented on pull request #31649:
URL: https://github.com/apache/spark/pull/31649#issuecomment-807880991


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41123/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya opened a new pull request #31966: [SPARK-34638][SQL] Single field nested column prune on generator output

2021-03-25 Thread GitBox


viirya opened a new pull request #31966:
URL: https://github.com/apache/spark/pull/31966


   
   
   ### What changes were proposed in this pull request?
   
   
   This patch proposes an improvement on nested column pruning if the pruning 
target is generator's output. Previously we disallow such case. This patch 
allows to prune on it if there is only one single nested column is accessed 
after `Generate`.
   
   ### Why are the changes needed?
   
   
   This helps to extends the availability of nested column pruning.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No
   
   ### How was this patch tested?
   
   
   Unit test
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #31809: Fix avro decoding based on avro MAGIC_NUMBER presence

2021-03-25 Thread GitBox


HyukjinKwon commented on pull request #31809:
URL: https://github.com/apache/spark/pull/31809#issuecomment-807878875


   cc @gengliangwang FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


HyukjinKwon commented on pull request #31804:
URL: https://github.com/apache/spark/pull/31804#issuecomment-807878549


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on a change in pull request #31804: [SPARK-34710][SQL] Add tableType column for SHOW TABLES to distinguish view and tables

2021-03-25 Thread GitBox


HyukjinKwon commented on a change in pull request #31804:
URL: https://github.com/apache/spark/pull/31804#discussion_r601965623



##
File path: sql/core/src/test/resources/log4j.properties
##
@@ -22,7 +22,7 @@ log4j.rootLogger=INFO, CA, FA
 log4j.appender.CA=org.apache.log4j.ConsoleAppender
 log4j.appender.CA.layout=org.apache.log4j.PatternLayout
 log4j.appender.CA.layout.ConversionPattern=%d{HH:mm:ss.SSS} %p %c: %m%n
-log4j.appender.CA.Threshold = WARN
+log4j.appender.CA.Threshold = FATAL

Review comment:
   I think it's okay as long as `unit-tests.log` contains.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn commented on pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

2021-03-25 Thread GitBox


yaooqinn commented on pull request #31960:
URL: https://github.com/apache/spark/pull/31960#issuecomment-807873422


   Thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] yaooqinn closed pull request #31960: [SPARK-34786][SQL] Read Parquet unsigned int64 logical type that stored as signed int64 physical type to decimal(20, 0)

2021-03-25 Thread GitBox


yaooqinn closed pull request #31960:
URL: https://github.com/apache/spark/pull/31960


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA removed a comment on pull request #31958: [SPARK-34862][SQL] Support nested column in ORC vectorized reader

2021-03-25 Thread GitBox


SparkQA removed a comment on pull request #31958:
URL: https://github.com/apache/spark/pull/31958#issuecomment-807860163


   **[Test build #136538 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136538/testReport)**
 for PR 31958 at commit 
[`94df62c`](https://github.com/apache/spark/commit/94df62ce3376d2579ad26c1a84347bb46895d9fc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31958: [SPARK-34862][SQL] Support nested column in ORC vectorized reader

2021-03-25 Thread GitBox


SparkQA commented on pull request #31958:
URL: https://github.com/apache/spark/pull/31958#issuecomment-807867798


   **[Test build #136538 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136538/testReport)**
 for PR 31958 at commit 
[`94df62c`](https://github.com/apache/spark/commit/94df62ce3376d2579ad26c1a84347bb46895d9fc).
* This patch **fails MiMa tests**.
* This patch merges cleanly.
* This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins removed a comment on pull request #31965: [SPARK-34843][SQL] Calculate more precise partition stride in JDBCRelation

2021-03-25 Thread GitBox


AmplabJenkins removed a comment on pull request #31965:
URL: https://github.com/apache/spark/pull/31965#issuecomment-807863128


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41121/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] AmplabJenkins commented on pull request #31965: [SPARK-34843][SQL] Calculate more precise partition stride in JDBCRelation

2021-03-25 Thread GitBox


AmplabJenkins commented on pull request #31965:
URL: https://github.com/apache/spark/pull/31965#issuecomment-807863128


   
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41121/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon closed pull request #31963: [SPARK-34871][SS] Move the checkpoint location resolving into the rule ResolveWriteToStream

2021-03-25 Thread GitBox


HyukjinKwon closed pull request #31963:
URL: https://github.com/apache/spark/pull/31963


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31965: [SPARK-34843][SQL] Calculate more precise partition stride in JDBCRelation

2021-03-25 Thread GitBox


SparkQA commented on pull request #31965:
URL: https://github.com/apache/spark/pull/31965#issuecomment-807863097


   Kubernetes integration test status failure
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41121/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] HyukjinKwon commented on pull request #31963: [SPARK-34871][SS] Move the checkpoint location resolving into the rule ResolveWriteToStream

2021-03-25 Thread GitBox


HyukjinKwon commented on pull request #31963:
URL: https://github.com/apache/spark/pull/31963#issuecomment-807862974


   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #31886: [WIP][SPARK-34795][SQL][TEST] Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-25 Thread GitBox


maropu commented on a change in pull request #31886:
URL: https://github.com/apache/spark/pull/31886#discussion_r597299503



##
File path: sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestHelper.scala
##
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql
+
+import scala.util.control.NonFatal
+
+import org.apache.spark.SparkException
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.execution.HiveResult.hiveResultString
+import org.apache.spark.sql.execution.SQLExecution
+import org.apache.spark.sql.execution.command.{DescribeColumnCommand, 
DescribeCommandBase}
+import org.apache.spark.sql.types.StructType
+
+trait SQLQueryTestHelper {
+
+  private val notIncludedMsg = "[not included in comparison]"
+  private val clsName = this.getClass.getCanonicalName
+  protected val emptySchema = StructType(Seq.empty).catalogString
+
+  /** A single SQL query's output. */
+  protected case class QueryOutput(sql: String, schema: String, output: 
String) {
+override def toString: String = {
+  // We are explicitly not using multi-line string due to stripMargin 
removing "|" in output.
+  s"-- !query\n" +
+sql + "\n" +
+s"-- !query schema\n" +
+schema + "\n" +
+s"-- !query output\n" +
+output
+}
+  }
+
+  protected def replaceNotIncludedMsg(line: String): String = {
+line.replaceAll("#\\d+", "#x")
+  .replaceAll(
+s"Location.*$clsName/",
+s"Location $notIncludedMsg/{warehouse_dir}/")
+  .replaceAll("Created By.*", s"Created By $notIncludedMsg")
+  .replaceAll("Created Time.*", s"Created Time $notIncludedMsg")
+  .replaceAll("Last Access.*", s"Last Access $notIncludedMsg")
+  .replaceAll("Partition Statistics\t\\d+", s"Partition 
Statistics\t$notIncludedMsg")
+  .replaceAll("\\*\\(\\d+\\) ", "*") // remove the WholeStageCodegen 
codegenStageIds
+  }
+
+  /** Executes a query and returns the result as (schema of the output, 
normalized output). */
+  protected def getNormalizedResult(session: SparkSession, sql: String): 
(String, Seq[String]) = {
+// Returns true if the plan is supposed to be sorted.
+def isSorted(plan: LogicalPlan): Boolean = plan match {
+  case _: DescribeCommandBase
+  | _: DescribeColumnCommand
+  | _: DescribeRelation
+  | _: DescribeColumn => true

Review comment:
   ~I modified the code in `isSorted` because we still need to sort output 
rows even if `plan` has a `Sort` node. For example, most sort nodes sort rows 
by some columns of all output columns, but we always need to sort them by  all 
output columns. This change leads to many diffs of the files in 
`sql/core/src/test/resources/sql-tests/results`.~




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

2021-03-25 Thread GitBox


SparkQA commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807861127


   **[Test build #136540 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136540/testReport)**
 for PR 29726 at commit 
[`bb579c6`](https://github.com/apache/spark/commit/bb579c655dfc1538bb682b6e03cecda948ca42e6).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31649: [SPARK-34542][BUILD] Upgrade Parquet to 1.12.0

2021-03-25 Thread GitBox


SparkQA commented on pull request #31649:
URL: https://github.com/apache/spark/pull/31649#issuecomment-807860345


   **[Test build #136539 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136539/testReport)**
 for PR 31649 at commit 
[`8b58e29`](https://github.com/apache/spark/commit/8b58e29adcbad1bf7cdeb416cd1e22928d49c892).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31958: [SPARK-34862][SQL] Support nested column in ORC vectorized reader

2021-03-25 Thread GitBox


SparkQA commented on pull request #31958:
URL: https://github.com/apache/spark/pull/31958#issuecomment-807860163


   **[Test build #136538 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136538/testReport)**
 for PR 31958 at commit 
[`94df62c`](https://github.com/apache/spark/commit/94df62ce3376d2579ad26c1a84347bb46895d9fc).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] SparkQA commented on pull request #31965: [SPARK-34843][SQL] Calculate more precise partition stride in JDBCRelation

2021-03-25 Thread GitBox


SparkQA commented on pull request #31965:
URL: https://github.com/apache/spark/pull/31965#issuecomment-807858203


   Kubernetes integration test starting
   URL: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41121/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] maropu commented on a change in pull request #31965: [SPARK-34843][SQL] Improve stride calculation to decide partitions in JDBCRelation

2021-03-25 Thread GitBox


maropu commented on a change in pull request #31965:
URL: https://github.com/apache/spark/pull/31965#discussion_r601938799



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
##
@@ -118,13 +119,28 @@ private[sql] object JDBCRelation extends Logging {
   s"Upper bound: ${boundValueToString(upperBound)}.")
 upperBound - lowerBound
   }
-// Overflow and silliness can happen if you subtract then divide.
-// Here we get a little roundoff, but that's (hopefully) OK.
-val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
+
+// Overflow can happen if you subtract then divide. For example:
+// (Long.MaxValue - Long.MinValue) / (numPartitions - 2).
+// Also, using fixed-point decimals here to avoid possible inaccuracy from 
floating point.
+val strideUpperCalculation = (upperBound / BigDecimal(numPartitions))

Review comment:
   nit: `strideUpperCalculation` => `upperStride`?

##
File path: sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala
##
@@ -433,6 +433,61 @@ class JDBCSuite extends QueryTest
 assert(ids(2) === 3)
   }
 
+  test("SPARK-34843: columnPartition should generate a correct stride size") {
+
+val schema = StructType(Seq(
+  StructField("PartitionColumn", DateType, nullable = false)
+))
+
+val numPartitions = 1000
+val partitionConfig = Map(
+  "lowerBound" -> "1930-01-01",
+  "upperBound" -> "2020-12-31",
+  "numPartitions" -> numPartitions.toString,
+  "partitionColumn" -> "PartitionColumn"
+)
+
+val partitions = JDBCRelation.columnPartition(
+  schema,
+  analysis.caseInsensitiveResolution,
+  TimeZone.getDefault.toZoneId.toString,
+  new JDBCOptions(url, "table", partitionConfig)
+)
+
+val lastPredicate = partitions(numPartitions - 
1).asInstanceOf[JDBCPartition].whereClause
+assert(lastPredicate == PartitionColumn" >= '2020-08-02'""")
+  }
+
+  test("SPARK-34843: columnPartition should realign the first partition for 
better distribution") {
+

Review comment:
   ditto

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
##
@@ -118,13 +119,28 @@ private[sql] object JDBCRelation extends Logging {
   s"Upper bound: ${boundValueToString(upperBound)}.")
 upperBound - lowerBound
   }
-// Overflow and silliness can happen if you subtract then divide.
-// Here we get a little roundoff, but that's (hopefully) OK.
-val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
+
+// Overflow can happen if you subtract then divide. For example:
+// (Long.MaxValue - Long.MinValue) / (numPartitions - 2).
+// Also, using fixed-point decimals here to avoid possible inaccuracy from 
floating point.
+val strideUpperCalculation = (upperBound / BigDecimal(numPartitions))
+  .setScale(18, RoundingMode.HALF_EVEN)
+val strideLowerCalculation = (lowerBound / BigDecimal(numPartitions))
+  .setScale(18, RoundingMode.HALF_EVEN)

Review comment:
   Why do we use `scale=18` and `RoundingMode.HALF_EVEN` here?

##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
##
@@ -118,13 +119,28 @@ private[sql] object JDBCRelation extends Logging {
   s"Upper bound: ${boundValueToString(upperBound)}.")
 upperBound - lowerBound
   }
-// Overflow and silliness can happen if you subtract then divide.
-// Here we get a little roundoff, but that's (hopefully) OK.
-val stride: Long = upperBound / numPartitions - lowerBound / numPartitions
+
+// Overflow can happen if you subtract then divide. For example:
+// (Long.MaxValue - Long.MinValue) / (numPartitions - 2).
+// Also, using fixed-point decimals here to avoid possible inaccuracy from 
floating point.
+val strideUpperCalculation = (upperBound / BigDecimal(numPartitions))
+  .setScale(18, RoundingMode.HALF_EVEN)
+val strideLowerCalculation = (lowerBound / BigDecimal(numPartitions))
+  .setScale(18, RoundingMode.HALF_EVEN)
+
+val preciseStride = strideUpperCalculation - strideLowerCalculation
+val stride = preciseStride.toLong
+
+// Determine the number of strides the last partition will fall short of 
compared to the
+// supplied upper bound. Take half of those strides, and then add them to 
the lower bound
+// for better distribution of the first and last partitions.
+val lostNumOfStrides = (preciseStride - stride) * numPartitions / stride
+val lowerBoundWithStrideAlignment = lowerBound +
+  ((lostNumOfStrides / 2) * stride).setScale(0, 
RoundingMode.HALF_UP).toLong

Review comment:
   > This can lead to a big difference between the provided upper bound and 
the actual start of the last partition.
   
   Could you add a 

[GitHub] [spark] HyukjinKwon commented on pull request #31958: [SPARK-34862][SQL] Support nested column in ORC vectorized reader

2021-03-25 Thread GitBox


HyukjinKwon commented on pull request #31958:
URL: https://github.com/apache/spark/pull/31958#issuecomment-807853044


   cc @viirya too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] wangyum commented on pull request #29726: [SPARK-32855][SQL] Improve the cost model in pruningHasBenefit for filtering side can not build broadcast by join type

2021-03-25 Thread GitBox


wangyum commented on pull request #29726:
URL: https://github.com/apache/spark/pull/29726#issuecomment-807829270


   retest this please


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   >