[GitHub] [spark] SparkQA commented on pull request #32018: [SPARK-34926][SQL] PartitioningUtils.getPathFragment() should respect partition value is null
SparkQA commented on pull request #32018: URL: https://github.com/apache/spark/pull/32018#issuecomment-812334201 **[Test build #136841 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136841/testReport)** for PR 32018 at commit [`992001b`](https://github.com/apache/spark/commit/992001bcf3ea7569a492659d97fbde25a5f0c406). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis r
AmplabJenkins removed a comment on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812333617 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41418/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32015: [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
AmplabJenkins removed a comment on pull request #32015: URL: https://github.com/apache/spark/pull/32015#issuecomment-812333616 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32034: [SPARK-34940][SQL][TEST] Fix test of BasicWriteTaskStatsTrackerSuite
AmplabJenkins removed a comment on pull request #32034: URL: https://github.com/apache/spark/pull/32034#issuecomment-812333621 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41416/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32033: [SPARK-34939][CORE] Throw fetch failure exception when unable to deserialize map statuses
AmplabJenkins removed a comment on pull request #32033: URL: https://github.com/apache/spark/pull/32033#issuecomment-812333614 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41417/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32034: [SPARK-34940][SQL][TEST] Fix test of BasicWriteTaskStatsTrackerSuite
AmplabJenkins commented on pull request #32034: URL: https://github.com/apache/spark/pull/32034#issuecomment-812333621 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41416/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32033: [SPARK-34939][CORE] Throw fetch failure exception when unable to deserialize map statuses
AmplabJenkins commented on pull request #32033: URL: https://github.com/apache/spark/pull/32033#issuecomment-812333614 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41417/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32015: [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
AmplabJenkins commented on pull request #32015: URL: https://github.com/apache/spark/pull/32015#issuecomment-812333618 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis rules run
AmplabJenkins commented on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812333617 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41418/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis rules run.
SparkQA commented on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812332967 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32015: [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
SparkQA removed a comment on pull request #32015: URL: https://github.com/apache/spark/pull/32015#issuecomment-812287685 **[Test build #136835 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136835/testReport)** for PR 32015 at commit [`dc7b70d`](https://github.com/apache/spark/commit/dc7b70daad9bd8f99952023110578b40a2233732). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32015: [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
SparkQA commented on pull request #32015: URL: https://github.com/apache/spark/pull/32015#issuecomment-812331657 **[Test build #136835 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136835/testReport)** for PR 32015 at commit [`dc7b70d`](https://github.com/apache/spark/commit/dc7b70daad9bd8f99952023110578b40a2233732). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32034: [SPARK-34940][SQL][TEST] Fix test of BasicWriteTaskStatsTrackerSuite
SparkQA commented on pull request #32034: URL: https://github.com/apache/spark/pull/32034#issuecomment-812330792 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32015: [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
SparkQA removed a comment on pull request #32015: URL: https://github.com/apache/spark/pull/32015#issuecomment-812288459 **[Test build #136836 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136836/testReport)** for PR 32015 at commit [`d9f9aae`](https://github.com/apache/spark/commit/d9f9aaec23e8f94bfa357264dda7376f6c615333). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32015: [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
SparkQA commented on pull request #32015: URL: https://github.com/apache/spark/pull/32015#issuecomment-812330387 **[Test build #136836 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136836/testReport)** for PR 32015 at commit [`d9f9aae`](https://github.com/apache/spark/commit/d9f9aaec23e8f94bfa357264dda7376f6c615333). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32033: [SPARK-34939][CORE] Throw fetch failure exception when unable to deserialize map statuses
SparkQA commented on pull request #32033: URL: https://github.com/apache/spark/pull/32033#issuecomment-812329902 Kubernetes integration test unable to build dist. exiting with code: 1 URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41417/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WangGuangxin commented on a change in pull request #31967: [SPARK-34819][SQL] MapType supports orderable semantics
WangGuangxin commented on a change in pull request #31967: URL: https://github.com/apache/spark/pull/31967#discussion_r606078034 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeMapType.scala ## @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import scala.math.Ordering + +import org.apache.spark.sql.catalyst.expressions.{And, EqualTo, ExpectsInputTypes, Expression, UnaryExpression} +import org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext +import org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.{getValue, javaType} +import org.apache.spark.sql.catalyst.expressions.codegen.ExprCode +import org.apache.spark.sql.catalyst.planning.ExtractEquiJoinKeys +import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, Window} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.catalyst.util.{ArrayBasedMapBuilder, MapData, TypeUtils} +import org.apache.spark.sql.types.{AbstractDataType, DataType, MapType} + +/** + * When comparing two maps, we have to make sure two maps have the same key value pairs but + * with different key ordering are equal. + * For example, Map('a' -> 1, 'b' -> 2) equals to Map('b' -> 2, 'a' -> 1). + * + * We have to specially handle this in grouping/join/window because Spark SQL turns + * grouping/join/window partition keys into binary `UnsafeRow` and compare the + * binary data directly instead of using MapType's ordering. So in these cases, we have + * to insert an expression to sort map entries by key. + * + * Note that, this rule must be executed at the end of optimizer, because the optimizer may create + * new joins(the subquery rewrite) and new join conditions(the join reorder). + */ +object NormalizeMapType extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case w: Window if w.partitionSpec.exists(p => needNormalize(p)) => + w.copy(partitionSpec = w.partitionSpec.map(normalize)) + +case j @ ExtractEquiJoinKeys(_, leftKeys, rightKeys, condition, _, _, _) + // The analyzer guarantees left and right joins keys are of the same data type. + if leftKeys.exists(k => needNormalize(k)) => + val newLeftJoinKeys = leftKeys.map(normalize) + val newRightJoinKeys = rightKeys.map(normalize) + val newConditions = newLeftJoinKeys.zip(newRightJoinKeys).map { +case (l, r) => EqualTo(l, r) + } ++ condition + j.copy(condition = Some(newConditions.reduce(And))) + } + + private def needNormalize(expr: Expression): Boolean = expr match { +case SortMapKey(_) => false +case e if e.dataType.isInstanceOf[MapType] => true +case _ => false + } + + private[sql] def normalize(expr: Expression): Expression = expr match { +case _ if !needNormalize(expr) => expr +case e if e.dataType.isInstanceOf[MapType] => + SortMapKey(e) + } +} + +case class SortMapKey(child: Expression) extends UnaryExpression with ExpectsInputTypes { + private lazy val MapType(keyType, valueType, valueContainsNull) = dataType.asInstanceOf[MapType] + private lazy val keyOrdering: Ordering[Any] = TypeUtils.getInterpretedOrdering(keyType) + private lazy val mapBuilder = new ArrayBasedMapBuilder(keyType, valueType) + + override def inputTypes: Seq[AbstractDataType] = Seq(MapType) + + override def dataType: DataType = child.dataType + + override def nullSafeEval(input: Any): Any = { +val childMap = input.asInstanceOf[MapData] +val keys = childMap.keyArray() Review comment: Seems that I missed this case. I'll fix it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #32018: [SPARK-34926][SQL] PartitioningUtils.getPathFragment() should respect partition value is null
MaxGekk commented on pull request #32018: URL: https://github.com/apache/spark/pull/32018#issuecomment-812325437 jenkins, retest this, please -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #32018: [SPARK-34926][SQL] PartitioningUtils.getPathFragment() should respect partition value is null
MaxGekk commented on pull request #32018: URL: https://github.com/apache/spark/pull/32018#issuecomment-812325204 GA are failing on Avro tests, for instance. And jenkins build failed on the latest commit. @AngersZh To continue with the fix, let's re-trigger tests. Also @cloud-fan could you look at this PR since you reviewed previous changes related to null part values. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] attilapiros commented on pull request #31871: [SPARK-34779][CORE] ExecutorMetricsPoller should keep stage entry in stageTCMP until a heartbeat occurs
attilapiros commented on pull request #31871: URL: https://github.com/apache/spark/pull/31871#issuecomment-812323640 Merged to master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadhen commented on a change in pull request #32026: [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled
sadhen commented on a change in pull request #32026: URL: https://github.com/apache/spark/pull/32026#discussion_r606075743 ## File path: python/pyspark/sql/tests/test_arrow.py ## @@ -196,6 +197,33 @@ def test_pandas_round_trip(self): pdf_arrow = df.toPandas() assert_frame_equal(pdf_arrow, pdf) +def test_udt_roundtrip(self): +pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1.0, 1.0), ExamplePoint(2.0, 2.0)])}) +schema = StructType([StructField('point', ExamplePointUDT(), False)]) +with self.sql_conf({"spark.sql.execution.arrow.pyspark.fallback.enabled": True}): +df = self.spark.createDataFrame(pdf, schema) +pdf_arrow = df.toPandas() +assert_frame_equal(pdf_arrow, pdf) +with self.sql_conf({"spark.sql.execution.arrow.pyspark.fallback.enabled": False}): +df = self.spark.createDataFrame(pdf, schema) +pdf_arrow = df.toPandas() +assert_frame_equal(pdf_arrow, pdf) + +def test_array_udt_roundtrip(self): +pdf = pd.DataFrame({'points': pd.Series([ +[ExamplePoint(1.0, 1.0), ExamplePoint(1.0, 2.0), ExamplePoint(1.0, 3.0)], Review comment: For primitive data type, it is not a good practice to wrap it in UDT. As a result, I do not think we should spend too much time on support UDT which is actually primitive data type. This part can be postponed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] attilapiros closed pull request #31871: [SPARK-34779][CORE] ExecutorMetricsPoller should keep stage entry in stageTCMP until a heartbeat occurs
attilapiros closed pull request #31871: URL: https://github.com/apache/spark/pull/31871 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis rules run.
SparkQA commented on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812315631 **[Test build #136840 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136840/testReport)** for PR 32032 at commit [`12fdbe9`](https://github.com/apache/spark/commit/12fdbe9aea3775bd57b8fe04ecf9a944eadc7c8b). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32033: [SPARK-34939][CORE] Throw fetch failure exception when unable to deserialize map statuses
SparkQA commented on pull request #32033: URL: https://github.com/apache/spark/pull/32033#issuecomment-812315600 **[Test build #136839 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136839/testReport)** for PR 32033 at commit [`12e93fa`](https://github.com/apache/spark/commit/12e93fa035d6126927ce54403c9c9983ce90968f). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32034: [SPARK-34940][SQL][TEST] Fix test of BasicWriteTaskStatsTrackerSuite
SparkQA commented on pull request #32034: URL: https://github.com/apache/spark/pull/32034#issuecomment-812315579 **[Test build #136838 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136838/testReport)** for PR 32034 at commit [`f1037c7`](https://github.com/apache/spark/commit/f1037c7efd471ed438871b7c47057fcea73f8592). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30145: [SPARK-33233][SQL]CUBE/ROLLUP/GROUPING SETS support GROUP BY ordinal
AmplabJenkins removed a comment on pull request #30145: URL: https://github.com/apache/spark/pull/30145#issuecomment-812315218 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136829/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #31968: [SPARK-34873][SQL] Avoid wrapped in withNewExecutionId twice when run SQL with side effects
AmplabJenkins removed a comment on pull request #31968: URL: https://github.com/apache/spark/pull/31968#issuecomment-812315217 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41415/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on pull request #32034: [SPARK-34940][SQL][TEST] Fix test of BasicWriteTaskStatsTrackerSuite
c21 commented on pull request #32034: URL: https://github.com/apache/spark/pull/32034#issuecomment-812315290 @cloud-fan could you help take a look when you have time, thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #31968: [SPARK-34873][SQL] Avoid wrapped in withNewExecutionId twice when run SQL with side effects
AmplabJenkins commented on pull request #31968: URL: https://github.com/apache/spark/pull/31968#issuecomment-812315217 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41415/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30145: [SPARK-33233][SQL]CUBE/ROLLUP/GROUPING SETS support GROUP BY ordinal
AmplabJenkins commented on pull request #30145: URL: https://github.com/apache/spark/pull/30145#issuecomment-812315218 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136829/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 opened a new pull request #32034: [SPARK-34940][SQL][TEST] Fix test of BasicWriteTaskStatsTrackerSuite
c21 opened a new pull request #32034: URL: https://github.com/apache/spark/pull/32034 ### What changes were proposed in this pull request? This is to fix the minor typo in unit test of BasicWriteTaskStatsTrackerSuite (https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala#L152 ), where it should be a new file name, e.g. `f-3-3`, because the unit test expects 3 files in statistics (https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BasicWriteTaskStatsTrackerSuite.scala#L160 ). ### Why are the changes needed? Fix minor bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Changed unit test `"Three files, last one empty"` itself. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30145: [SPARK-33233][SQL]CUBE/ROLLUP/GROUPING SETS support GROUP BY ordinal
SparkQA removed a comment on pull request #30145: URL: https://github.com/apache/spark/pull/30145#issuecomment-812236536 **[Test build #136829 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136829/testReport)** for PR 30145 at commit [`ff7971b`](https://github.com/apache/spark/commit/ff7971b2817b46c45b0584dfdfdda999bfd2b96d). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30145: [SPARK-33233][SQL]CUBE/ROLLUP/GROUPING SETS support GROUP BY ordinal
SparkQA commented on pull request #30145: URL: https://github.com/apache/spark/pull/30145#issuecomment-812314277 **[Test build #136829 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136829/testReport)** for PR 30145 at commit [`ff7971b`](https://github.com/apache/spark/commit/ff7971b2817b46c45b0584dfdfdda999bfd2b96d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31968: [SPARK-34873][SQL] Avoid wrapped in withNewExecutionId twice when run SQL with side effects
SparkQA commented on pull request #31968: URL: https://github.com/apache/spark/pull/31968#issuecomment-812313719 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32033: [SPARK-34939][CORE] Throw fetch failure exception when unable to deserialize map statuses
viirya commented on a change in pull request #32033: URL: https://github.com/apache/spark/pull/32033#discussion_r606068095 ## File path: core/src/main/scala/org/apache/spark/MapOutputTracker.scala ## @@ -100,7 +100,7 @@ private class ShuffleStatus(numPartitions: Int) extends Logging { * broadcast variable in order to keep it from being garbage collected and to allow for it to be * explicitly destroyed later on when the ShuffleMapStage is garbage-collected. */ - private[this] var cachedSerializedBroadcast: Broadcast[Array[Byte]] = _ + private[spark] var cachedSerializedBroadcast: Broadcast[Array[Byte]] = _ Review comment: Expose this for test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya opened a new pull request #32033: [SPARK-34939][CORE] Throw fetch failure exception when unable to deserialize map statuses
viirya opened a new pull request #32033: URL: https://github.com/apache/spark/pull/32033 ### What changes were proposed in this pull request? This patch catches `IOException`, which is possibly thrown due to unable to deserialize map statuses (e.g., broadcasted value is destroyed), when deserilizing map statuses. Once `IOException` is caught, `MetadataFetchFailedException` is thrown to let Spark handle it. ### Why are the changes needed? One customer encountered application error. From the log, it is caused by accessing non-existing broadcasted value. The broadcasted value is map statuses. There is a race-condition. After map statuses are broadcasted and the executors obtain serialized broadcasted map statuses. If any fetch failure happens after, Spark scheduler invalidates cached map statuses and destroy broadcasted value of the map statuses. Then any executor trying to deserialize serialized broadcasted map statuses and access broadcasted value, `IOException` will be thrown. Currently we don't catch it in `MapOutputTrackerWorker` and above exception will fail the application. Normally we should throw a fetch failure exception for such case. Spark scheduler will handle this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Wait for customer verification too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #31968: [SPARK-34873][SQL] Avoid wrapped in withNewExecutionId twice when run SQL with side effects
SparkQA commented on pull request #31968: URL: https://github.com/apache/spark/pull/31968#issuecomment-812301835 **[Test build #136837 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136837/testReport)** for PR 31968 at commit [`37d64d5`](https://github.com/apache/spark/commit/37d64d53a8a59de3617b1a8114cd28d25f30c900). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32015: [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
AmplabJenkins removed a comment on pull request #32015: URL: https://github.com/apache/spark/pull/32015#issuecomment-812301609 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41414/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis r
AmplabJenkins removed a comment on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812301610 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136833/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30480: [SPARK-32921][SHUFFLE] MapOutputTracker extensions to support push-based shuffle
AmplabJenkins removed a comment on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-812301608 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136832/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32026: [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled
AmplabJenkins removed a comment on pull request #32026: URL: https://github.com/apache/spark/pull/32026#issuecomment-812301612 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136834/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32026: [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled
AmplabJenkins commented on pull request #32026: URL: https://github.com/apache/spark/pull/32026#issuecomment-812301612 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136834/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis rules run
AmplabJenkins commented on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812301610 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136833/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32015: [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
AmplabJenkins commented on pull request #32015: URL: https://github.com/apache/spark/pull/32015#issuecomment-812301609 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41414/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30480: [SPARK-32921][SHUFFLE] MapOutputTracker extensions to support push-based shuffle
AmplabJenkins commented on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-812301608 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136832/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32026: [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled
SparkQA removed a comment on pull request #32026: URL: https://github.com/apache/spark/pull/32026#issuecomment-812287668 **[Test build #136834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136834/testReport)** for PR 32026 at commit [`92f3829`](https://github.com/apache/spark/commit/92f382957c038d34b4344261e86fa1bc6956369b). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32015: [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
SparkQA commented on pull request #32015: URL: https://github.com/apache/spark/pull/32015#issuecomment-812299706 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32026: [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled
SparkQA commented on pull request #32026: URL: https://github.com/apache/spark/pull/32026#issuecomment-812299694 **[Test build #136834 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136834/testReport)** for PR 32026 at commit [`92f3829`](https://github.com/apache/spark/commit/92f382957c038d34b4344261e86fa1bc6956369b). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis rules r
SparkQA removed a comment on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812274055 **[Test build #136833 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136833/testReport)** for PR 32032 at commit [`43f70b2`](https://github.com/apache/spark/commit/43f70b2319790e6746c53c6ab5255971468cc2b7). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis rules run.
SparkQA commented on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812298245 **[Test build #136833 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136833/testReport)** for PR 32032 at commit [`43f70b2`](https://github.com/apache/spark/commit/43f70b2319790e6746c53c6ab5255971468cc2b7). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #30480: [SPARK-32921][SHUFFLE] MapOutputTracker extensions to support push-based shuffle
SparkQA removed a comment on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-812256558 **[Test build #136832 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136832/testReport)** for PR 30480 at commit [`a10eba1`](https://github.com/apache/spark/commit/a10eba1a558c81335fe69928904a1a2f4b4f85d9). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30480: [SPARK-32921][SHUFFLE] MapOutputTracker extensions to support push-based shuffle
SparkQA commented on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-812291998 **[Test build #136832 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136832/testReport)** for PR 30480 at commit [`a10eba1`](https://github.com/apache/spark/commit/a10eba1a558c81335fe69928904a1a2f4b4f85d9). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis r
AmplabJenkins removed a comment on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812290641 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41413/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis rules run
AmplabJenkins commented on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812290641 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41413/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis rules run.
SparkQA commented on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812290627 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32015: [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
SparkQA commented on pull request #32015: URL: https://github.com/apache/spark/pull/32015#issuecomment-812288459 **[Test build #136836 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136836/testReport)** for PR 32015 at commit [`d9f9aae`](https://github.com/apache/spark/commit/d9f9aaec23e8f94bfa357264dda7376f6c615333). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #32015: [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
HyukjinKwon commented on a change in pull request #32015: URL: https://github.com/apache/spark/pull/32015#discussion_r606045845 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/ExtractBenchmark.scala ## @@ -92,8 +92,9 @@ object ExtractBenchmark extends SqlBasedBenchmark { val intervalFields = Seq("YEAR", "MONTH", "DAY", "HOUR", "MINUTE", "SECOND") val settings = Map( "timestamp" -> datetimeFields, - "date" -> datetimeFields, - "interval" -> intervalFields) + "date" -> datetimeFields) + // TODO(SPARK-34938): Recover the benchmark of interval case Review comment: cc @MaxGekk FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32026: [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled
AmplabJenkins removed a comment on pull request #32026: URL: https://github.com/apache/spark/pull/32026#issuecomment-812074839 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32015: [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
SparkQA commented on pull request #32015: URL: https://github.com/apache/spark/pull/32015#issuecomment-812287685 **[Test build #136835 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136835/testReport)** for PR 32015 at commit [`dc7b70d`](https://github.com/apache/spark/commit/dc7b70daad9bd8f99952023110578b40a2233732). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30480: [SPARK-32921][SHUFFLE] MapOutputTracker extensions to support push-based shuffle
AmplabJenkins removed a comment on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-812287428 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41410/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32026: [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled
SparkQA commented on pull request #32026: URL: https://github.com/apache/spark/pull/32026#issuecomment-812287668 **[Test build #136834 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136834/testReport)** for PR 32026 at commit [`92f3829`](https://github.com/apache/spark/commit/92f382957c038d34b4344261e86fa1bc6956369b). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30480: [SPARK-32921][SHUFFLE] MapOutputTracker extensions to support push-based shuffle
AmplabJenkins commented on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-812287428 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41410/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #32015: [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
HyukjinKwon commented on a change in pull request #32015: URL: https://github.com/apache/spark/pull/32015#discussion_r606045223 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/ExtractBenchmark.scala ## @@ -92,8 +92,9 @@ object ExtractBenchmark extends SqlBasedBenchmark { val intervalFields = Seq("YEAR", "MONTH", "DAY", "HOUR", "MINUTE", "SECOND") val settings = Map( "timestamp" -> datetimeFields, - "date" -> datetimeFields, - "interval" -> intervalFields) + "date" -> datetimeFields) + // TODO(SPARK-34938): Recover the benchmark of internal case Review comment: internal -> interval .. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #32015: [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
HyukjinKwon commented on a change in pull request #32015: URL: https://github.com/apache/spark/pull/32015#discussion_r606044927 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/ExtractBenchmark.scala ## @@ -92,8 +92,9 @@ object ExtractBenchmark extends SqlBasedBenchmark { val intervalFields = Seq("YEAR", "MONTH", "DAY", "HOUR", "MINUTE", "SECOND") val settings = Map( "timestamp" -> datetimeFields, - "date" -> datetimeFields, - "interval" -> intervalFields) + "date" -> datetimeFields) + // TODO(SPARK-34938): Recover the benchmark of internal case Review comment: cc @MaxGekk FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon edited a comment on pull request #32015: [SPARK-34821][INFRA] Set up a workflow for developers to run benchmark in their fork
HyukjinKwon edited a comment on pull request #32015: URL: https://github.com/apache/spark/pull/32015#issuecomment-811064987 Note that I tested subset of benchmarks, verified that it works, and now I am waiting for the final results of running all benchmarks: - [Run benchmarks: * (JDK 11)](https://github.com/HyukjinKwon/spark/actions/runs/710425382) - [Run benchmarks: * (JDK 8)](https://github.com/HyukjinKwon/spark/actions/runs/710425286) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadhen commented on a change in pull request #32026: [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled
sadhen commented on a change in pull request #32026: URL: https://github.com/apache/spark/pull/32026#discussion_r606039240 ## File path: python/pyspark/sql/tests/test_arrow.py ## @@ -196,6 +197,33 @@ def test_pandas_round_trip(self): pdf_arrow = df.toPandas() assert_frame_equal(pdf_arrow, pdf) +def test_udt_roundtrip(self): +pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1.0, 1.0), ExamplePoint(2.0, 2.0)])}) +schema = StructType([StructField('point', ExamplePointUDT(), False)]) +with self.sql_conf({"spark.sql.execution.arrow.pyspark.fallback.enabled": True}): +df = self.spark.createDataFrame(pdf, schema) +pdf_arrow = df.toPandas() +assert_frame_equal(pdf_arrow, pdf) +with self.sql_conf({"spark.sql.execution.arrow.pyspark.fallback.enabled": False}): +df = self.spark.createDataFrame(pdf, schema) +pdf_arrow = df.toPandas() +assert_frame_equal(pdf_arrow, pdf) + +def test_array_udt_roundtrip(self): +pdf = pd.DataFrame({'points': pd.Series([ +[ExamplePoint(1.0, 1.0), ExamplePoint(1.0, 2.0), ExamplePoint(1.0, 3.0)], Review comment: I thought udt is for complex datatype. For udt which is actually primitive type,let me add unit tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30480: [SPARK-32921][SHUFFLE] MapOutputTracker extensions to support push-based shuffle
SparkQA commented on pull request #30480: URL: https://github.com/apache/spark/pull/30480#issuecomment-812278108 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadhen commented on a change in pull request #32026: [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled
sadhen commented on a change in pull request #32026: URL: https://github.com/apache/spark/pull/32026#discussion_r606036073 ## File path: python/pyspark/sql/tests/test_arrow.py ## @@ -196,6 +197,33 @@ def test_pandas_round_trip(self): pdf_arrow = df.toPandas() assert_frame_equal(pdf_arrow, pdf) +def test_udt_roundtrip(self): +pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1.0, 1.0), ExamplePoint(2.0, 2.0)])}) +schema = StructType([StructField('point', ExamplePointUDT(), False)]) +with self.sql_conf({"spark.sql.execution.arrow.pyspark.fallback.enabled": True}): +df = self.spark.createDataFrame(pdf, schema) +pdf_arrow = df.toPandas() +assert_frame_equal(pdf_arrow, pdf) +with self.sql_conf({"spark.sql.execution.arrow.pyspark.fallback.enabled": False}): +df = self.spark.createDataFrame(pdf, schema) +pdf_arrow = df.toPandas() +assert_frame_equal(pdf_arrow, pdf) + +def test_array_udt_roundtrip(self): +pdf = pd.DataFrame({'points': pd.Series([ +[ExamplePoint(1.0, 1.0), ExamplePoint(1.0, 2.0), ExamplePoint(1.0, 3.0)], Review comment: See `_deserialize_pandas_with_udt`, support for StructType is postponed in later PRs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadhen commented on a change in pull request #32026: [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled
sadhen commented on a change in pull request #32026: URL: https://github.com/apache/spark/pull/32026#discussion_r606035670 ## File path: python/pyspark/sql/types.py ## @@ -764,6 +764,21 @@ def __eq__(self, other): return type(self) == type(other) +def _is_datatype_with_udt(dt): Review comment: fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadhen commented on a change in pull request #32026: [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled
sadhen commented on a change in pull request #32026: URL: https://github.com/apache/spark/pull/32026#discussion_r606035644 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowEvalPythonExec.scala ## @@ -89,9 +89,57 @@ case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute] columnarBatchIter.flatMap { batch => val actualDataTypes = (0 until batch.numCols()).map(i => batch.column(i).dataType()) - assert(outputTypes == actualDataTypes, "Invalid schema from pandas_udf: " + -s"expected ${outputTypes.mkString(", ")}, got ${actualDataTypes.mkString(", ")}") + assert(plainSchemaSeq(outputTypes) == actualDataTypes, +"Incompatible schema from pandas_udf: " + + s"expected ${outputTypes.mkString(", ")}, got ${actualDataTypes.mkString(", ")}") batch.rowIterator.asScala } } + + private def plainSchemaSeq(schema: Seq[DataType]): Seq[DataType] = { +schema.map(v => ArrowEvalPythonExec.plainSchema(v)).toList + } + +} + +private[sql] object ArrowEvalPythonExec { + /** + * Erase User-Defined Types and returns the plain Spark StructType instead. + * + * UserDefinedType: + * - will be erased as dt.sqlType Review comment: Fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32031: [WIP] Initial work of Remote Shuffle Service to support dynamic allocation on Kubernetes
AmplabJenkins removed a comment on pull request #32031: URL: https://github.com/apache/spark/pull/32031#issuecomment-812275412 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41412/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service to support dynamic allocation on Kubernetes
AmplabJenkins commented on pull request #32031: URL: https://github.com/apache/spark/pull/32031#issuecomment-812275412 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41412/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service to support dynamic allocation on Kubernetes
SparkQA commented on pull request #32031: URL: https://github.com/apache/spark/pull/32031#issuecomment-812275259 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41412/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service to support dynamic allocation on Kubernetes
SparkQA commented on pull request #32031: URL: https://github.com/apache/spark/pull/32031#issuecomment-812274710 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41412/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] imback82 commented on a change in pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis ru
imback82 commented on a change in pull request #32032: URL: https://github.com/apache/spark/pull/32032#discussion_r606034352 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala ## @@ -830,7 +830,7 @@ abstract class TreeNode[BaseType <: TreeNode[BaseType]] extends Product { } trait LeafLike[T <: TreeNode[T]] { self: TreeNode[T] => - override final def children: Seq[T] = Nil + override def children: Seq[T] = Nil Review comment: @cloud-fan I am removing this `final` temporarily. If the approach of this PR is OK, I will add this back and refactor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis r
AmplabJenkins removed a comment on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812261585 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41411/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis rules run.
SparkQA commented on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812274055 **[Test build #136833 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136833/testReport)** for PR 32032 at commit [`43f70b2`](https://github.com/apache/spark/commit/43f70b2319790e6746c53c6ab5255971468cc2b7). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #31776: [SPARK-34661][SQL] Clean up `OriginalType` and `DecimalMetadata ` usage in Parquet related code
LuciferYang commented on pull request #31776: URL: https://github.com/apache/spark/pull/31776#issuecomment-812272546 Gentle ping, @wangyum @HyukjinKwon @dongjoon-hyun @maropu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32030: [WIP] Improve map children
AmplabJenkins removed a comment on pull request #32030: URL: https://github.com/apache/spark/pull/32030#issuecomment-812271782 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32031: [WIP] Initial work of Remote Shuffle Service to support dynamic allocation on Kubernetes
AmplabJenkins removed a comment on pull request #32031: URL: https://github.com/apache/spark/pull/32031#issuecomment-812271785 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41407/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32030: [WIP] Improve map children
AmplabJenkins commented on pull request #32030: URL: https://github.com/apache/spark/pull/32030#issuecomment-812271783 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service to support dynamic allocation on Kubernetes
AmplabJenkins commented on pull request #32031: URL: https://github.com/apache/spark/pull/32031#issuecomment-812271785 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41407/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32030: [WIP] Improve map children
SparkQA removed a comment on pull request #32030: URL: https://github.com/apache/spark/pull/32030#issuecomment-812236261 **[Test build #136828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136828/testReport)** for PR 32030 at commit [`7045e7a`](https://github.com/apache/spark/commit/7045e7a8bb844e6a5d48fda0ab06926f67c9f4ca). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32030: [WIP] Improve map children
SparkQA commented on pull request #32030: URL: https://github.com/apache/spark/pull/32030#issuecomment-812270113 **[Test build #136828 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136828/testReport)** for PR 32030 at commit [`7045e7a`](https://github.com/apache/spark/commit/7045e7a8bb844e6a5d48fda0ab06926f67c9f4ca). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service to support dynamic allocation on Kubernetes
SparkQA commented on pull request #32031: URL: https://github.com/apache/spark/pull/32031#issuecomment-812266720 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadhen commented on a change in pull request #32026: [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled
sadhen commented on a change in pull request #32026: URL: https://github.com/apache/spark/pull/32026#discussion_r606026938 ## File path: python/pyspark/sql/pandas/conversion.py ## @@ -452,24 +457,27 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone): struct.add(name, from_arrow_type(field.type), nullable=field.nullable) schema = struct -# Determine arrow types to coerce data when creating batches +# Determine data types to coerce data when creating batches if isinstance(schema, StructType): -arrow_types = [to_arrow_type(f.dataType) for f in schema.fields] +data_types = [f.dataType for f in schema.fields] elif isinstance(schema, DataType): raise ValueError("Single data type %s is not supported with Arrow" % str(schema)) else: # Any timestamps must be coerced to be compatible with Spark -arrow_types = [to_arrow_type(TimestampType()) - if is_datetime64_dtype(t) or is_datetime64tz_dtype(t) else None - for t in pdf.dtypes] +data_types = [to_arrow_type(TimestampType()) + if is_datetime64_dtype(t) or is_datetime64tz_dtype(t) else None + for t in pdf.dtypes] # Slice the DataFrame to be batched step = -(-len(pdf) // self.sparkContext.defaultParallelism) # round int up pdf_slices = (pdf.iloc[start:start + step] for start in range(0, len(pdf), step)) # Create list of Arrow (columns, type) for serializer dump_stream -arrow_data = [[(c, t) for (_, c), t in zip(pdf_slice.iteritems(), arrow_types)] - for pdf_slice in pdf_slices] +# Type can be Spark SQL Data Type or Arrow Data Type +arrow_data_with_t = [ Review comment: Well, I should use `adt` or `padt` for PyArrow Data Type and `pdt` for Pandas DataType. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadhen commented on a change in pull request #32026: [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled
sadhen commented on a change in pull request #32026: URL: https://github.com/apache/spark/pull/32026#discussion_r606026544 ## File path: python/pyspark/sql/pandas/conversion.py ## @@ -20,9 +20,10 @@ from pyspark.rdd import _load_from_socket from pyspark.sql.pandas.serializers import ArrowCollectSerializer -from pyspark.sql.types import IntegralType from pyspark.sql.types import ByteType, ShortType, IntegerType, LongType, FloatType, \ -DoubleType, BooleanType, MapType, TimestampType, StructType, DataType +DoubleType, BooleanType, MapType, TimestampType, StructType, DataType, \ +IntegralType, _is_datatype_with_udt +from pyspark.sql.pandas.types import _deserialize_pandas_with_udt Review comment: `git grep _make_type_verifier`, there are other use cases which a function starts with `_` but is used outside where they are defined. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadhen commented on a change in pull request #32026: [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support Enabled
sadhen commented on a change in pull request #32026: URL: https://github.com/apache/spark/pull/32026#discussion_r606025117 ## File path: python/pyspark/sql/pandas/conversion.py ## @@ -452,24 +457,27 @@ def _create_from_pandas_with_arrow(self, pdf, schema, timezone): struct.add(name, from_arrow_type(field.type), nullable=field.nullable) schema = struct -# Determine arrow types to coerce data when creating batches +# Determine data types to coerce data when creating batches if isinstance(schema, StructType): -arrow_types = [to_arrow_type(f.dataType) for f in schema.fields] +data_types = [f.dataType for f in schema.fields] elif isinstance(schema, DataType): raise ValueError("Single data type %s is not supported with Arrow" % str(schema)) else: # Any timestamps must be coerced to be compatible with Spark -arrow_types = [to_arrow_type(TimestampType()) - if is_datetime64_dtype(t) or is_datetime64tz_dtype(t) else None - for t in pdf.dtypes] +data_types = [to_arrow_type(TimestampType()) + if is_datetime64_dtype(t) or is_datetime64tz_dtype(t) else None + for t in pdf.dtypes] # Slice the DataFrame to be batched step = -(-len(pdf) // self.sparkContext.defaultParallelism) # round int up pdf_slices = (pdf.iloc[start:start + step] for start in range(0, len(pdf), step)) # Create list of Arrow (columns, type) for serializer dump_stream -arrow_data = [[(c, t) for (_, c), t in zip(pdf_slice.iteritems(), arrow_types)] - for pdf_slice in pdf_slices] +# Type can be Spark SQL Data Type or Arrow Data Type +arrow_data_with_t = [ Review comment: Yes. It is renamed to indicate it is arrow data with datatype (Spark SQL DataType or Arrow DataType). In `serializers.py`, `dt` is for Spark SQL DataType, `pdt` is for pyarrow DataType. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32030: [WIP] Improve map children
SparkQA commented on pull request #32030: URL: https://github.com/apache/spark/pull/32030#issuecomment-812262654 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41408/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis rules run
AmplabJenkins commented on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812261585 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41411/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis rules run.
SparkQA commented on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812261574 Kubernetes integration test unable to build dist. exiting with code: 1 URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41411/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis r
AmplabJenkins removed a comment on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812259237 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136830/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis rules r
SparkQA removed a comment on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812256299 **[Test build #136830 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136830/testReport)** for PR 32032 at commit [`b98c15c`](https://github.com/apache/spark/commit/b98c15c3862d5e42fc62cd5d393ad6fbf861b143). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis rules run
AmplabJenkins commented on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812259237 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136830/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32032: [SPARK-34701][SQL] Introduce TransformaAfterAnalysis rule that allows a logical plan to be transformed after all the analysis rules run.
SparkQA commented on pull request #32032: URL: https://github.com/apache/spark/pull/32032#issuecomment-812259220 **[Test build #136830 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136830/testReport)** for PR 32032 at commit [`b98c15c`](https://github.com/apache/spark/commit/b98c15c3862d5e42fc62cd5d393ad6fbf861b143). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #30145: [SPARK-33233][SQL]CUBE/ROLLUP/GROUPING SETS support GROUP BY ordinal
AmplabJenkins removed a comment on pull request #30145: URL: https://github.com/apache/spark/pull/30145#issuecomment-812258143 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41409/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #30145: [SPARK-33233][SQL]CUBE/ROLLUP/GROUPING SETS support GROUP BY ordinal
AmplabJenkins commented on pull request #30145: URL: https://github.com/apache/spark/pull/30145#issuecomment-812258143 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/41409/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #30145: [SPARK-33233][SQL]CUBE/ROLLUP/GROUPING SETS support GROUP BY ordinal
SparkQA commented on pull request #30145: URL: https://github.com/apache/spark/pull/30145#issuecomment-812258134 Kubernetes integration test unable to build dist. exiting with code: 1 URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41409/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32031: [WIP] Initial work of Remote Shuffle Service to support dynamic allocation on Kubernetes
AmplabJenkins removed a comment on pull request #32031: URL: https://github.com/apache/spark/pull/32031#issuecomment-812256859 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136831/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service to support dynamic allocation on Kubernetes
AmplabJenkins commented on pull request #32031: URL: https://github.com/apache/spark/pull/32031#issuecomment-812256859 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/136831/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org