[jira] [Commented] (SPARK-34881) New SQL Function: TRY_CAST
[ https://issues.apache.org/jira/browse/SPARK-34881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312853#comment-17312853 ] Apache Spark commented on SPARK-34881: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/32019 > New SQL Function: TRY_CAST > -- > > Key: SPARK-34881 > URL: https://issues.apache.org/jira/browse/SPARK-34881 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.2.0 > > > Add a new SQL function try_cast. try_cast is identical to CAST with > `spark.sql.ansi.enabled` as true, except it returns NULL instead of raising > an error. This expression has one major difference from `cast` with > `spark.sql.ansi.enabled` as true: when the source value can't be stored in > the target integral(Byte/Short/Int/Long) type, `try_cast` returns null > instead of returning the low order bytes of the source value. > This is learned from Google BigQuery and Snowflake: > https://docs.snowflake.com/en/sql-reference/functions/try_cast.html > https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#safe_casting -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34881) New SQL Function: TRY_CAST
[ https://issues.apache.org/jira/browse/SPARK-34881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312852#comment-17312852 ] Apache Spark commented on SPARK-34881: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/32019 > New SQL Function: TRY_CAST > -- > > Key: SPARK-34881 > URL: https://issues.apache.org/jira/browse/SPARK-34881 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.2.0 > > > Add a new SQL function try_cast. try_cast is identical to CAST with > `spark.sql.ansi.enabled` as true, except it returns NULL instead of raising > an error. This expression has one major difference from `cast` with > `spark.sql.ansi.enabled` as true: when the source value can't be stored in > the target integral(Byte/Short/Int/Long) type, `try_cast` returns null > instead of returning the low order bytes of the source value. > This is learned from Google BigQuery and Snowflake: > https://docs.snowflake.com/en/sql-reference/functions/try_cast.html > https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#safe_casting -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312851#comment-17312851 ] Attila Zsolt Piros commented on SPARK-34738: Shane, as Docker is available on every platform I could check the it locally (but I can only promise next week to do that) if I manage to reproduce this locally I can take look into the details. After this I think we should extend our documentation and suggest only those drivers where the tests was running (or at least mentioning docker is used for testing). Thanks for your effort! > Upgrade Minikube and kubernetes cluster version on Jenkins > -- > > Key: SPARK-34738 > URL: https://issues.apache.org/jira/browse/SPARK-34738 > Project: Spark > Issue Type: Task > Components: jenkins, Kubernetes >Affects Versions: 3.2.0 >Reporter: Attila Zsolt Piros >Assignee: Shane Knapp >Priority: Major > > [~shaneknapp] as we discussed [on the mailing > list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html] > Minikube can be upgraded to the latest (v1.18.1) and kubernetes version > should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`). > [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new > method to configure the kubernetes client. Thanks in advance to use it for > testing on the Jenkins after the Minikube version is updated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34930) Install PyArrow and pandas on Jenkins
Hyukjin Kwon created SPARK-34930: Summary: Install PyArrow and pandas on Jenkins Key: SPARK-34930 URL: https://issues.apache.org/jira/browse/SPARK-34930 Project: Spark Issue Type: Test Components: Project Infra Affects Versions: 3.2.0 Reporter: Hyukjin Kwon Assignee: Shane Knapp Looks like Jenkins mahcines don't have pandas and PyArrow (ever since it got upgraded?) which result in skipping related tests in PySpark, see also https://github.com/apache/spark/pull/31470#issuecomment-811618571 It would be great if we can install both in Python 3.6 on Jenkins. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34354) CostBasedJoinReorder can fail on self-join
[ https://issues.apache.org/jira/browse/SPARK-34354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34354: Assignee: (was: Apache Spark) > CostBasedJoinReorder can fail on self-join > -- > > Key: SPARK-34354 > URL: https://issues.apache.org/jira/browse/SPARK-34354 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: wuyi >Priority: Major > > > For example: > {code:java} > test("join reorder with self-join") { > val plan = t2.join(t1, Inner, Some(nameToAttr("t1.k-1-2") === > nameToAttr("t2.k-1-5"))) > .select(nameToAttr("t1.v-1-10")) > .join(t2, Inner, Some(nameToAttr("t1.v-1-10") === nameToAttr("t2.k-1-5"))) > // this can fail > Optimize.execute(plan.analyze) > } > {code} > error: > {code:java} > [info] java.lang.AssertionError: assertion failed > [info] at scala.Predef$.assert(Predef.scala:208) > [info] at > org.apache.spark.sql.catalyst.optimizer.JoinReorderDP$.search(CostBasedJoinReorder.scala:178) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.org$apache$spark$sql$catalyst$optimizer$CostBasedJoinReorder$$reorder(CostBasedJoinReorder.scala:64) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:45) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:41) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317) > [info] at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169) > [info] at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:41) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:35) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34354) CostBasedJoinReorder can fail on self-join
[ https://issues.apache.org/jira/browse/SPARK-34354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34354: Assignee: Apache Spark > CostBasedJoinReorder can fail on self-join > -- > > Key: SPARK-34354 > URL: https://issues.apache.org/jira/browse/SPARK-34354 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > > For example: > {code:java} > test("join reorder with self-join") { > val plan = t2.join(t1, Inner, Some(nameToAttr("t1.k-1-2") === > nameToAttr("t2.k-1-5"))) > .select(nameToAttr("t1.v-1-10")) > .join(t2, Inner, Some(nameToAttr("t1.v-1-10") === nameToAttr("t2.k-1-5"))) > // this can fail > Optimize.execute(plan.analyze) > } > {code} > error: > {code:java} > [info] java.lang.AssertionError: assertion failed > [info] at scala.Predef$.assert(Predef.scala:208) > [info] at > org.apache.spark.sql.catalyst.optimizer.JoinReorderDP$.search(CostBasedJoinReorder.scala:178) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.org$apache$spark$sql$catalyst$optimizer$CostBasedJoinReorder$$reorder(CostBasedJoinReorder.scala:64) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:45) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:41) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317) > [info] at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169) > [info] at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:41) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:35) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-34354) CostBasedJoinReorder can fail on self-join
[ https://issues.apache.org/jira/browse/SPARK-34354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-34354: -- Assignee: (was: wuyi) Reverted in https://github.com/apache/spark/commit/cc451c16a3d28bcdba226fa05f1b786ff2f02612 > CostBasedJoinReorder can fail on self-join > -- > > Key: SPARK-34354 > URL: https://issues.apache.org/jira/browse/SPARK-34354 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: wuyi >Priority: Major > Fix For: 3.2.0 > > > > For example: > {code:java} > test("join reorder with self-join") { > val plan = t2.join(t1, Inner, Some(nameToAttr("t1.k-1-2") === > nameToAttr("t2.k-1-5"))) > .select(nameToAttr("t1.v-1-10")) > .join(t2, Inner, Some(nameToAttr("t1.v-1-10") === nameToAttr("t2.k-1-5"))) > // this can fail > Optimize.execute(plan.analyze) > } > {code} > error: > {code:java} > [info] java.lang.AssertionError: assertion failed > [info] at scala.Predef$.assert(Predef.scala:208) > [info] at > org.apache.spark.sql.catalyst.optimizer.JoinReorderDP$.search(CostBasedJoinReorder.scala:178) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.org$apache$spark$sql$catalyst$optimizer$CostBasedJoinReorder$$reorder(CostBasedJoinReorder.scala:64) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:45) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:41) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317) > [info] at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169) > [info] at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:41) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:35) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34354) CostBasedJoinReorder can fail on self-join
[ https://issues.apache.org/jira/browse/SPARK-34354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34354: - Fix Version/s: (was: 3.2.0) > CostBasedJoinReorder can fail on self-join > -- > > Key: SPARK-34354 > URL: https://issues.apache.org/jira/browse/SPARK-34354 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: wuyi >Priority: Major > > > For example: > {code:java} > test("join reorder with self-join") { > val plan = t2.join(t1, Inner, Some(nameToAttr("t1.k-1-2") === > nameToAttr("t2.k-1-5"))) > .select(nameToAttr("t1.v-1-10")) > .join(t2, Inner, Some(nameToAttr("t1.v-1-10") === nameToAttr("t2.k-1-5"))) > // this can fail > Optimize.execute(plan.analyze) > } > {code} > error: > {code:java} > [info] java.lang.AssertionError: assertion failed > [info] at scala.Predef$.assert(Predef.scala:208) > [info] at > org.apache.spark.sql.catalyst.optimizer.JoinReorderDP$.search(CostBasedJoinReorder.scala:178) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.org$apache$spark$sql$catalyst$optimizer$CostBasedJoinReorder$$reorder(CostBasedJoinReorder.scala:64) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:45) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:41) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317) > [info] at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171) > [info] at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169) > [info] at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:41) > [info] at > org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:35) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34929) MapStatusesSerDeserBenchmark causes a side effect to other benchmarks with tasks being too big (JDK 11)
[ https://issues.apache.org/jira/browse/SPARK-34929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34929: - Description: In JDK 11, MapStatusesSerDeserBenchmark (being started failed) seems affecting other benchmark cases with growing the size of task: {code} 2021-03-31T16:46:43.1179145Z 21/03/31 16:46:43 WARN DAGScheduler: Broadcasting large task binary with size 42.2 MiB 2021-03-31T16:46:47.3079315Z 21/03/31 16:46:47 WARN DAGScheduler: Broadcasting large task binary with size 42.2 MiB 2021-03-31T16:46:51.5920733Z 21/03/31 16:46:51 WARN DAGScheduler: Broadcasting large task binary with size 42.2 MiB 2021-03-31T16:46:55.9175194Z 21/03/31 16:46:55 WARN DAGScheduler: Broadcasting large task binary with size 42.2 MiB 2021-03-31T16:46:57.6874541Z Stopped after 3 iterations, 12928 ms 2021-03-31T16:46:57.6875644Z 2021-03-31T16:46:57.6877153Z OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1041-azure 2021-03-31T16:46:57.7095280Z Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz 2021-03-31T16:46:57.7097654Z from_json as subExpr in Project: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative 2021-03-31T16:46:57.7099059Z 2021-03-31T16:46:57.7100274Z subExprElimination false, codegen: true 38880 412461389 0.0 388800445.2 1.0X 2021-03-31T16:46:57.7101134Z subExprElimination false, codegen: false 35819 381411234 0.0 358188088.6 1.1X 2021-03-31T16:46:57.7106264Z subExprElimination true, codegen: true 3947 4157 364 0.039465629.1 9.9X 2021-03-31T16:46:57.7106982Z subExprElimination true, codegen: false 4191 4309 112 0.041908945.5 9.3X 2021-03-31T16:46:57.7107595Z 2021-03-31T16:46:57.7135178Z Preparing data for benchmarking ... 2021-03-31T16:46:58.5630584Z Running benchmark: from_json as subExpr in Filter 2021-03-31T16:46:58.5633083Z Running case: subExprElimination false, codegen: true 2021-03-31T16:48:25.5619312Z 21/03/31 16:48:25 WARN DAGScheduler: Broadcasting large task binary with size 43.0 MiB {code} It only happens when the benchmarks run sequentially via Benchmarks.scala. was: In JDK 11, MapStatusesSerDeserBenchmark (being started failed) seems affecting other benchmark cases with growing the size of task: ``` 2021-03-31T16:46:43.1179145Z 21/03/31 16:46:43 WARN DAGScheduler: Broadcasting large task binary with size 42.2 MiB 2021-03-31T16:46:47.3079315Z 21/03/31 16:46:47 WARN DAGScheduler: Broadcasting large task binary with size 42.2 MiB 2021-03-31T16:46:51.5920733Z 21/03/31 16:46:51 WARN DAGScheduler: Broadcasting large task binary with size 42.2 MiB 2021-03-31T16:46:55.9175194Z 21/03/31 16:46:55 WARN DAGScheduler: Broadcasting large task binary with size 42.2 MiB 2021-03-31T16:46:57.6874541Z Stopped after 3 iterations, 12928 ms 2021-03-31T16:46:57.6875644Z 2021-03-31T16:46:57.6877153Z OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1041-azure 2021-03-31T16:46:57.7095280Z Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz 2021-03-31T16:46:57.7097654Z from_json as subExpr in Project: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative 2021-03-31T16:46:57.7099059Z 2021-03-31T16:46:57.7100274Z subExprElimination false, codegen: true 38880 412461389 0.0 388800445.2 1.0X 2021-03-31T16:46:57.7101134Z subExprElimination false, codegen: false 35819 381411234 0.0 358188088.6 1.1X 2021-03-31T16:46:57.7106264Z subExprElimination true, codegen: true 3947 4157 364 0.039465629.1 9.9X 2021-03-31T16:46:57.7106982Z subExprElimination true, codegen: false 4191 4309 112 0.041908945.5 9.3X 2021-03-31T16:46:57.7107595Z 2021-03-31T16:46:57.7135178Z Preparing data for benchmarking ... 2021-03-31T16:46:58.5630584Z Running benchmark: from_json as subExpr in Filter 2021-03-31T16:46:58.5633083Z Running case: subExprElimination false, codegen: true 2021-03-31T16:48:25.5619312Z 21/03/31 16:48:25 WARN DAGScheduler: Broadcasting large task binary with size 43.0 MiB ``` It only happens when the benchmarks run sequentially via Benchmarks.scala. > MapStatusesSerDeserBenchmark causes a side effect to other benchmarks with > tasks being too big (JDK 11) > --- > > Key: SPARK-34929 > URL:
[jira] [Created] (SPARK-34929) MapStatusesSerDeserBenchmark causes a side effect to other benchmarks with tasks being too big (JDK 11)
Hyukjin Kwon created SPARK-34929: Summary: MapStatusesSerDeserBenchmark causes a side effect to other benchmarks with tasks being too big (JDK 11) Key: SPARK-34929 URL: https://issues.apache.org/jira/browse/SPARK-34929 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 3.2.0 Reporter: Hyukjin Kwon In JDK 11, MapStatusesSerDeserBenchmark (being started failed) seems affecting other benchmark cases with growing the size of task: ``` 2021-03-31T16:46:43.1179145Z 21/03/31 16:46:43 WARN DAGScheduler: Broadcasting large task binary with size 42.2 MiB 2021-03-31T16:46:47.3079315Z 21/03/31 16:46:47 WARN DAGScheduler: Broadcasting large task binary with size 42.2 MiB 2021-03-31T16:46:51.5920733Z 21/03/31 16:46:51 WARN DAGScheduler: Broadcasting large task binary with size 42.2 MiB 2021-03-31T16:46:55.9175194Z 21/03/31 16:46:55 WARN DAGScheduler: Broadcasting large task binary with size 42.2 MiB 2021-03-31T16:46:57.6874541Z Stopped after 3 iterations, 12928 ms 2021-03-31T16:46:57.6875644Z 2021-03-31T16:46:57.6877153Z OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 5.4.0-1041-azure 2021-03-31T16:46:57.7095280Z Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz 2021-03-31T16:46:57.7097654Z from_json as subExpr in Project: Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative 2021-03-31T16:46:57.7099059Z 2021-03-31T16:46:57.7100274Z subExprElimination false, codegen: true 38880 412461389 0.0 388800445.2 1.0X 2021-03-31T16:46:57.7101134Z subExprElimination false, codegen: false 35819 381411234 0.0 358188088.6 1.1X 2021-03-31T16:46:57.7106264Z subExprElimination true, codegen: true 3947 4157 364 0.039465629.1 9.9X 2021-03-31T16:46:57.7106982Z subExprElimination true, codegen: false 4191 4309 112 0.041908945.5 9.3X 2021-03-31T16:46:57.7107595Z 2021-03-31T16:46:57.7135178Z Preparing data for benchmarking ... 2021-03-31T16:46:58.5630584Z Running benchmark: from_json as subExpr in Filter 2021-03-31T16:46:58.5633083Z Running case: subExprElimination false, codegen: true 2021-03-31T16:48:25.5619312Z 21/03/31 16:48:25 WARN DAGScheduler: Broadcasting large task binary with size 43.0 MiB ``` It only happens when the benchmarks run sequentially via Benchmarks.scala. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34928) CTE Execution fails for Sql Server
[ https://issues.apache.org/jira/browse/SPARK-34928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Supun De Silva updated SPARK-34928: --- Description: h2. Issue We have a simple Sql statement that we intend to execute on SQL Server. This has a CTE component. Execution of this yields to an error that looks like follows {code:java} java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code} We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver* (version 1.3.1) This is a particularly annoying issue and due to this we are having to write inner queries that are fair bit inefficient. h2. SQL statement (not the actual one but a simplified version with renamed parameters) {code:sql} WITH OldChanges as ( SELECT distinct SomeDate, Name FROM [dbo].[DateNameFoo] (nolock) WHERE SomeDate!= '2021-03-30' AND convert(date, UpdateDateTime) = '2021-03-31' SELECT * from OldChanges {code} was: h2. Issue We have a simple Sql statement that we intend to execute on SQL Server. This has a CTE component. Execution of this yields to an error that looks like follows {code:java} java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code} We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver* This is a particularly annoying issue and due to this we are having to write inner queries that are fair bit inefficient. h2. SQL statement (not the actual one but a simplified version with renamed parameters) {code:sql} WITH OldChanges as ( SELECT distinct SomeDate, Name FROM [dbo].[DateNameFoo] (nolock) WHERE SomeDate!= '2021-03-30' AND convert(date, UpdateDateTime) = '2021-03-31' SELECT * from OldChanges {code} > CTE Execution fails for Sql Server > -- > > Key: SPARK-34928 > URL: https://issues.apache.org/jira/browse/SPARK-34928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Supun De Silva >Priority: Minor > > h2. Issue > We have a simple Sql statement that we intend to execute on SQL Server. This > has a CTE component. > Execution of this yields to an error that looks like follows > {code:java} > java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code} > We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver* (version > 1.3.1) > This is a particularly annoying issue and due to this we are having to write > inner queries that are fair bit inefficient. > h2. SQL statement > (not the actual one but a simplified version with renamed parameters) > > {code:sql} > WITH OldChanges as ( >SELECT distinct > SomeDate, > Name >FROM [dbo].[DateNameFoo] (nolock) >WHERE SomeDate!= '2021-03-30' >AND convert(date, UpdateDateTime) = '2021-03-31' > SELECT * from OldChanges {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34928) CTE Execution fails for Sql Server
[ https://issues.apache.org/jira/browse/SPARK-34928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Supun De Silva updated SPARK-34928: --- Priority: Minor (was: Major) > CTE Execution fails for Sql Server > -- > > Key: SPARK-34928 > URL: https://issues.apache.org/jira/browse/SPARK-34928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Supun De Silva >Priority: Minor > > h2. Issue > We have a simple Sql statement that we intend to execute on SQL Server. This > has a CTE component. > Execution of this yields to an error that looks like follows > {code:java} > java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code} > We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver* > This is a particularly annoying issue and due to this we are having to write > inner queries that are fair bit inefficient. > h2. SQL statement > (not the actual one but a simplified version with renamed parameters) > > {code:sql} > WITH OldChanges as ( >SELECT distinct > SomeDate, > Name >FROM [dbo].[DateNameFoo] (nolock) >WHERE SomeDate!= '2021-03-30' >AND convert(date, UpdateDateTime) = '2021-03-31' > SELECT * from OldChanges {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34928) CTE Execution fails for Sql Server
[ https://issues.apache.org/jira/browse/SPARK-34928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Supun De Silva updated SPARK-34928: --- Description: h2. Issue We have a simple Sql statement that we intend to execute on SQL Server. This has a CTE component. Execution of this yields to an error that looks like follows {code:java} java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code} We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver* This is a particularly annoying issue and due to this we are having to write inner queries that are fair bit inefficient. h2. SQL statement (not the actual one but a simplified version with renamed parameters) {code:sql} WITH OldChanges as ( SELECT distinct SomeDate, Name FROM [dbo].[DateNameFoo] (nolock) WHERE SomeDate!= '2021-03-30' AND convert(date, UpdateDateTime) = '2021-03-31' SELECT * from OldChanges {code} was: h2. Issue We have a simple Sql statement that we intend to execute on SQL Server. This has a CTE component. Execution of this yields to an error that looks like follows {code:java} java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code} We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver* This is a particularly annoying issue and due to this we are having to write inner queries that are fair bit inefficient. h2. SQL statement (not the actual one but simplified renamed parameters) {code:sql} WITH OldChanges as ( SELECT distinct SomeDate, Name FROM [dbo].[DateNameFoo] (nolock) WHERE SomeDate!= '2021-03-30' AND convert(date, UpdateDateTime) = '2021-03-31' SELECT * from OldChanges {code} > CTE Execution fails for Sql Server > -- > > Key: SPARK-34928 > URL: https://issues.apache.org/jira/browse/SPARK-34928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Supun De Silva >Priority: Major > > h2. Issue > We have a simple Sql statement that we intend to execute on SQL Server. This > has a CTE component. > Execution of this yields to an error that looks like follows > {code:java} > java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code} > We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver* > This is a particularly annoying issue and due to this we are having to write > inner queries that are fair bit inefficient. > h2. SQL statement > (not the actual one but a simplified version with renamed parameters) > > {code:sql} > WITH OldChanges as ( >SELECT distinct > SomeDate, > Name >FROM [dbo].[DateNameFoo] (nolock) >WHERE SomeDate!= '2021-03-30' >AND convert(date, UpdateDateTime) = '2021-03-31' > SELECT * from OldChanges {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34928) CTE Execution fails for Sql Server
[ https://issues.apache.org/jira/browse/SPARK-34928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Supun De Silva updated SPARK-34928: --- Description: h2. Issue We have a simple Sql statement that we intend to execute on SQL Server. This has a CTE component. Execution of this yields to an error that looks like follows {code:java} java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code} We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver* This is a particularly annoying issue and due to this we are having to write inner queries that are fair bit inefficient. h2. SQL statement (not the actual one but simplified renamed parameters) {code:sql} WITH OldChanges as ( SELECT distinct SomeDate, Name FROM [dbo].[DateNameFoo] (nolock) WHERE SomeDate!= '2021-03-30' AND convert(date, UpdateDateTime) = '2021-03-31' SELECT * from OldChanges {code} was: h2. Issue We have a simple Sql statement that we intend to execute on SQL Server. This has a CTE component. Execution of this yields to an error that looks like follows {code:java} java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code} We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver* This is a particularly annoying issue and due to this we are having to write inner queries that are fair bit inefficient. h2. SQL statement (not the actual one but simplified renamed parameters) {code:sql} WITH OldChanges as ( SELECT distinct Date, Name FROM [dbo].[DateNameFoo] (nolock) WHERE Date != '2021-03-30' AND convert(date, UpdateDateTime) = '2021-03-31' SELECT * from OldChanges {code} > CTE Execution fails for Sql Server > -- > > Key: SPARK-34928 > URL: https://issues.apache.org/jira/browse/SPARK-34928 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Supun De Silva >Priority: Major > > h2. Issue > We have a simple Sql statement that we intend to execute on SQL Server. This > has a CTE component. > Execution of this yields to an error that looks like follows > {code:java} > java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code} > We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver* > This is a particularly annoying issue and due to this we are having to write > inner queries that are fair bit inefficient. > h2. SQL statement > (not the actual one but simplified renamed parameters) > > {code:sql} > WITH OldChanges as ( >SELECT distinct > SomeDate, > Name >FROM [dbo].[DateNameFoo] (nolock) >WHERE SomeDate!= '2021-03-30' >AND convert(date, UpdateDateTime) = '2021-03-31' > SELECT * from OldChanges {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34928) CTE Execution fails for Sql Server
Supun De Silva created SPARK-34928: -- Summary: CTE Execution fails for Sql Server Key: SPARK-34928 URL: https://issues.apache.org/jira/browse/SPARK-34928 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.1 Reporter: Supun De Silva h2. Issue We have a simple Sql statement that we intend to execute on SQL Server. This has a CTE component. Execution of this yields to an error that looks like follows {code:java} java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code} We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver* This is a particularly annoying issue and due to this we are having to write inner queries that are fair bit inefficient. h2. SQL statement (not the actual one but simplified renamed parameters) {code:sql} WITH OldChanges as ( SELECT distinct Date, Name FROM [dbo].[DateNameFoo] (nolock) WHERE Date != '2021-03-30' AND convert(date, UpdateDateTime) = '2021-03-31' SELECT * from OldChanges {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34927) Support TPCDSQueryBenchmark in Benchmarks
Hyukjin Kwon created SPARK-34927: Summary: Support TPCDSQueryBenchmark in Benchmarks Key: SPARK-34927 URL: https://issues.apache.org/jira/browse/SPARK-34927 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 3.2.0 Reporter: Hyukjin Kwon Benchmarks.scala currently does not support TPCDSQueryBenchmark. We should make it supported. See also https://github.com/apache/spark/pull/32015#issuecomment-89046 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34927) Support TPCDSQueryBenchmark in Benchmarks
[ https://issues.apache.org/jira/browse/SPARK-34927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34927: - Priority: Minor (was: Major) > Support TPCDSQueryBenchmark in Benchmarks > - > > Key: SPARK-34927 > URL: https://issues.apache.org/jira/browse/SPARK-34927 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Benchmarks.scala currently does not support TPCDSQueryBenchmark. We should > make it supported. See also > https://github.com/apache/spark/pull/32015#issuecomment-89046 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34926) PartitionUtils.getPathFragment should handle null value
[ https://issues.apache.org/jira/browse/SPARK-34926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312816#comment-17312816 ] Apache Spark commented on SPARK-34926: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32018 > PartitionUtils.getPathFragment should handle null value > --- > > Key: SPARK-34926 > URL: https://issues.apache.org/jira/browse/SPARK-34926 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > We support partition value as null, > PartitionUtils.getPathFragment should support this too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34926) PartitionUtils.getPathFragment should handle null value
[ https://issues.apache.org/jira/browse/SPARK-34926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34926: Assignee: Apache Spark > PartitionUtils.getPathFragment should handle null value > --- > > Key: SPARK-34926 > URL: https://issues.apache.org/jira/browse/SPARK-34926 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > We support partition value as null, > PartitionUtils.getPathFragment should support this too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34926) PartitionUtils.getPathFragment should handle null value
[ https://issues.apache.org/jira/browse/SPARK-34926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34926: Assignee: (was: Apache Spark) > PartitionUtils.getPathFragment should handle null value > --- > > Key: SPARK-34926 > URL: https://issues.apache.org/jira/browse/SPARK-34926 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > We support partition value as null, > PartitionUtils.getPathFragment should support this too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34926) PartitionUtils.getPathFragment should handle null value
[ https://issues.apache.org/jira/browse/SPARK-34926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312815#comment-17312815 ] Apache Spark commented on SPARK-34926: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/32018 > PartitionUtils.getPathFragment should handle null value > --- > > Key: SPARK-34926 > URL: https://issues.apache.org/jira/browse/SPARK-34926 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > We support partition value as null, > PartitionUtils.getPathFragment should support this too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34926) PartitionUtils.getPathFragment should handle null value
angerszhu created SPARK-34926: - Summary: PartitionUtils.getPathFragment should handle null value Key: SPARK-34926 URL: https://issues.apache.org/jira/browse/SPARK-34926 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu We support partition value as null, PartitionUtils.getPathFragment should support this too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34925) Spark shell failed to access external file to read content
[ https://issues.apache.org/jira/browse/SPARK-34925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] czh updated SPARK-34925: Attachment: 微信截图_20210401093951.png > Spark shell failed to access external file to read content > -- > > Key: SPARK-34925 > URL: https://issues.apache.org/jira/browse/SPARK-34925 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 3.0.0 >Reporter: czh >Priority: Major > Attachments: 微信截图_20210401093951.png > > > Spark shell failed to access external file to read content > Spark version 3.0, Hadoop version 3.2 > Start spark normally, enter spark shell and enter Val B1= sc.textFile ("S3a: > / / spark test / a.txt") is normal. When entering B1. Collect(). Foreach > (println) to traverse and read the contents of the file, an error is reported > The error message is: java.lang.NumberFormatException : For input string: > "100M" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34925) Spark shell failed to access external file to read content
czh created SPARK-34925: --- Summary: Spark shell failed to access external file to read content Key: SPARK-34925 URL: https://issues.apache.org/jira/browse/SPARK-34925 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 3.0.0 Reporter: czh Spark shell failed to access external file to read content Spark version 3.0, Hadoop version 3.2 Start spark normally, enter spark shell and enter Val B1= sc.textFile ("S3a: / / spark test / a.txt") is normal. When entering B1. Collect(). Foreach (println) to traverse and read the contents of the file, an error is reported The error message is: java.lang.NumberFormatException : For input string: "100M" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
[ https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311818#comment-17311818 ] Daniel Zhi edited comment on SPARK-23977 at 3/31/21, 11:59 PM: --- [~ste...@apache.org] Thanks for the info. Below are the related (key, value) we used: # spark.hadoop.fs.s3a.committer.name — partitioned # spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a — org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory # spark.sql.sources.commitProtocolClass — org.apache.spark.internal.io.cloud.PathOutputCommitProtocol # spark.sql.parquet.output.committer.class — org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter 3 & 4 appear to be necessary to ensure S3A committers being used by Spark for parquet outputs, except that "INSERT OVERWRITE" is blocked by the dynamicPartitionOverwrite exception. It will be helpful and appreciated if you can patiently elaborate on the proper way to "use the partitioned committer and configure it to do the right thing ..." in Spark. For example: * PathOutputCommitProtocol appears to be needed but its constructor fails with the exception. * S3A committers honor "fs.s3a.committer.staging.conflict-mode" which needs to be "replace" for "INSERT OVERWRITE" but "append" for "INSERT INTO". So it is spark.sql query specific. How to make spark.sql automatically set the right value? * Does above require code change in Spark or there is a configuration-only way? was (Author: danzhi): [~ste...@apache.org] Thanks for the info. Below are the related (key, value) we used: # spark.hadoop.fs.s3a.committer.name — partitioned # spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a — org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory # spark.sql.sources.commitProtocolClass — org.apache.spark.internal.io.cloud.PathOutputCommitProtocol # spark.sql.parquet.output.committer.class — org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter 3 & 4 appear to be necessary to ensure S3A committers being used by Spark for parquet outputs, except that "INSERT OVERWRITE" is blocked by the dynamicPartitionOverwrite exception. It will be helpful and appreciated if you can patiently elaborate on the proper way to "use the partitioned committer and configure it to do the right thing ..." in Spark. For example: * PathOutputCommitProtocol appears to be needed but its constructor fails with the exception. * S3A committers honor "fs.s3a.committer.staging.conflict-mode" which needs to be "replace" for "INSERT OVERWRITE" but "append" for "INSERT INTO". So it is spark.sql query specific. How to make it such way? * Does above require code change in Spark or there is a configuration-only way? > Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism > --- > > Key: SPARK-23977 > URL: https://issues.apache.org/jira/browse/SPARK-23977 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 3.0.0 > > > Hadoop 3.1 adds a mechanism for job-specific and store-specific committers > (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, > HADOOP-13786 > These committers deliver high-performance output of MR and spark jobs to S3, > and offer the key semantics which Spark depends on: no visible output until > job commit, a failure of a task at an stage, including partway through task > commit, can be handled by executing and committing another task attempt. > In contrast, the FileOutputFormat commit algorithms on S3 have issues: > * Awful performance because files are copied by rename > * FileOutputFormat v1: weak task commit failure recovery semantics as the > (v1) expectation: "directory renames are atomic" doesn't hold. > * S3 metadata eventual consistency can cause rename to miss files or fail > entirely (SPARK-15849) > Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of > the commit semantics w.r.t observability of or recovery from task commit > failure, on any filesystem. > The S3A committers address these by way of uploading all data to the > destination through multipart uploads, uploads which are only completed in > job commit. > The new {{PathOutputCommitter}} factory mechanism allows applications to work > with the S3A committers and any other, by adding a plugin mechanism into the > MRv2 FileOutputFormat class, where it job config and filesystem configuration > options can dynamically choose the output committer. > Spark can use these with some binding classes to > # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 > classes and {{PathOutputCommitterFactory}} to
[jira] [Comment Edited] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
[ https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311818#comment-17311818 ] Daniel Zhi edited comment on SPARK-23977 at 3/31/21, 11:57 PM: --- [~ste...@apache.org] Thanks for the info. Below are the related (key, value) we used: # spark.hadoop.fs.s3a.committer.name — partitioned # spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a — org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory # spark.sql.sources.commitProtocolClass — org.apache.spark.internal.io.cloud.PathOutputCommitProtocol # spark.sql.parquet.output.committer.class — org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter 3 & 4 appear to be necessary to ensure S3A committers being used by Spark for parquet outputs, except that "INSERT OVERWRITE" is blocked by the dynamicPartitionOverwrite exception. It will be helpful and appreciated if you can patiently elaborate on the proper way to "use the partitioned committer and configure it to do the right thing ..." in Spark. For example: * PathOutputCommitProtocol appears to be needed but its constructor fails with the exception. * S3A committers honor "fs.s3a.committer.staging.conflict-mode" which needs to be "replace" for "INSERT OVERWRITE" but "append" for "INSERT INTO". So it is spark.sql query specific. How to make it such way? * Does above require code change in Spark or there is a configuration-only way? was (Author: danzhi): [~ste...@apache.org] Thanks for the info. Below are the related (key, value) we used: # spark.hadoop.fs.s3a.committer.name --- partitioned # spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a --- org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory # spark.sql.sources.commitProtocolClass --- org.apache.spark.internal.io.cloud.PathOutputCommitProtocol # spark.sql.parquet.output.committer.class --- org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter 3 & 4 appear to be necessary to ensure S3A committers being used by Spark for parquet outputs, except that "INSERT OVERWRITE" is blocked by the dynamicPartitionOverwrite exception. It will be helpful and appreciated if you can patiently elaborate on the proper way to "use the partitioned committer and configure it to do the right thing ..." in Spark. > Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism > --- > > Key: SPARK-23977 > URL: https://issues.apache.org/jira/browse/SPARK-23977 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Minor > Fix For: 3.0.0 > > > Hadoop 3.1 adds a mechanism for job-specific and store-specific committers > (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, > HADOOP-13786 > These committers deliver high-performance output of MR and spark jobs to S3, > and offer the key semantics which Spark depends on: no visible output until > job commit, a failure of a task at an stage, including partway through task > commit, can be handled by executing and committing another task attempt. > In contrast, the FileOutputFormat commit algorithms on S3 have issues: > * Awful performance because files are copied by rename > * FileOutputFormat v1: weak task commit failure recovery semantics as the > (v1) expectation: "directory renames are atomic" doesn't hold. > * S3 metadata eventual consistency can cause rename to miss files or fail > entirely (SPARK-15849) > Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of > the commit semantics w.r.t observability of or recovery from task commit > failure, on any filesystem. > The S3A committers address these by way of uploading all data to the > destination through multipart uploads, uploads which are only completed in > job commit. > The new {{PathOutputCommitter}} factory mechanism allows applications to work > with the S3A committers and any other, by adding a plugin mechanism into the > MRv2 FileOutputFormat class, where it job config and filesystem configuration > options can dynamically choose the output committer. > Spark can use these with some binding classes to > # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 > classes and {{PathOutputCommitterFactory}} to create the committers. > # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}} > to wire up Parquet output even when code requires the committer to be a > subclass of {{ParquetOutputCommitter}} > This patch builds on SPARK-23807 for setting up the dependencies. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To
[jira] [Comment Edited] (SPARK-34910) JDBC - Add an option for different stride orders
[ https://issues.apache.org/jira/browse/SPARK-34910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312702#comment-17312702 ] Jason Yarbrough edited comment on SPARK-34910 at 3/31/21, 11:51 PM: Hi [~srowen], thanks for asking. The idea here is that if a user's data is heavily skewed to the "right" (or upper bound), that the last strides (and therefore partitions/tasks) will have the most density of data. Say a stage has 32 tasks, and the first 30 finish at the same time, the last two task that are the heaviest would then start. The issue with this is, 2 cores would be running for a long time while the other cores are sitting idle (since there are no more tasks left). The hope is, if we have the option to reverse that, the first 2 task will be the heaviest, and while 2 cores work on those, the other cores will be working on the next tasks. As we increase the partition count, we try to flatten out this issue, but having the option to do descending could still be of help. As for a random stride order, it's just another option that may help flatten the distribution (even more so with a higher partition count). This one is a little more of an edge usage, but could help break apart some hot spots, and is easy to implement within the pattern I've developed. On somewhat of a side-note (although I may try to implement this in the current code, but may save it for another branch), I think something that would be pretty nice to add is a ranking of the density of partitions. Depending on if the column is indexed (I would recommend that for columns people are partitioning on), the performance impact of doing the extra queries/count may not be that bad. Once a count of records per partition is created, we could order the partition array in a way where the most dense partitions are towards the head. Implementing what I'm proposing here gives people more options on how their data is processed, and can also be extended for other algorithms if they make sense. was (Author: hanover-fiste): Hi [~srowen], thanks for asking. The idea here is that if a user's data is heavily skewed to the "right" (or upper bound), that the last strides (and therefore partitions/tasks) will have the most density of data. Say a stage has 32 tasks, and the first 30 finish at the same time, the last two task that are the heaviest would then start. The issue with this is, 2 cores would be running for a long time while the other cores are sitting idle (since there are no more tasks left). The hope is, if we have the option to reverse that, the first 2 task will be the heaviest, and while 2 cores work on those, the other cores will be working on the next tasks. As we increase the partition count, we try to flatten out this issue, but having the option to do descending could still be of help. As for a random stride order, it's just another option that may help flatten the distribution (even more so with a higher partition count). This one is a little more of a edge usage, but could help break apart some hot spots, and is easy to implement within the pattern I've developed. On somewhat of a side-note (although I may try to implement this in the current code, but may save it for another branch), I think something that would be pretty nice to add is a ranking of the density of partitions. Depending on if the column is indexed (I would recommend that for columns people are partitioning on), the performance impact of doing the extra queries/count may not be that bad. Once a count of records per partition is created, we could order the partition array in a way where the most dense partitions are towards the head. Implementing what I'm proposing here gives people more options on how their data is processed, and can also be extended for other algorithms if they make sense. > JDBC - Add an option for different stride orders > > > Key: SPARK-34910 > URL: https://issues.apache.org/jira/browse/SPARK-34910 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Jason Yarbrough >Priority: Trivial > > Currently, the JDBCRelation columnPartition function orders the strides in > ascending order, starting from the lower bound and working its way towards > the upper bound. > I'm proposing leaving that as the default, but adding an option (such as > strideOrder) in JDBCOptions. Since it will default to the current behavior, > this will keep people's current code working as expected. However, people who > may have data skew closer to the upper bound might appreciate being able to > have the strides in descending order, thus filling up the first partition > with the last stride and so forth. Also, people with nondeterministic data >
[jira] [Resolved] (SPARK-34882) RewriteDistinctAggregates can cause a bug if the aggregator does not ignore NULLs
[ https://issues.apache.org/jira/browse/SPARK-34882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-34882. -- Fix Version/s: 3.2.0 Assignee: Tanel Kiis Resolution: Fixed Resolved by https://github.com/apache/spark/pull/31983 > RewriteDistinctAggregates can cause a bug if the aggregator does not ignore > NULLs > - > > Key: SPARK-34882 > URL: https://issues.apache.org/jira/browse/SPARK-34882 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.2.0, 3.1.2, 3.0.3 >Reporter: Tanel Kiis >Assignee: Tanel Kiis >Priority: Major > Labels: correctness > Fix For: 3.2.0 > > > {code:title=group-by.sql} > SELECT > first(DISTINCT a), last(DISTINCT a), > first(a), last(a), > first(DISTINCT b), last(DISTINCT b), > first(b), last(b) > FROM testData WHERE a IS NOT NULL AND b IS NOT NULL;{code} > {code:title=group-by.sql.out} > -- !query schema > struct a):int,first(a):int,last(a):int,first(DISTINCT b):int,last(DISTINCT > b):int,first(b):int,last(b):int> > -- !query output > NULL 1 1 3 1 NULL1 2 > {code} > The results should not be NULL, because NULL inputs are filtered out. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34910) JDBC - Add an option for different stride orders
[ https://issues.apache.org/jira/browse/SPARK-34910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312749#comment-17312749 ] Jason Yarbrough commented on SPARK-34910: - It's not that the records/rows themselves are skewed, it's that number of records within a stride can be skewed (e.g. you're partitioning on year and later years have much more data). Say we have 11 partitions, 5 cores total, and the last partition has triple the amount of data as the first 10 due to data skew. We'll say partitions 1 through 10 take 10 minutes a core, so partition 11 would take 30 minutes on a core. Current stride order of *ascending*: _Group 1_ 5 cores x 10 minutes = 10 minutes (all running concurrently) total of 10 minutes _Group 2_ 5 x 10 = 10 (all running concurrently) total of 10 minutes _Group 3_ 1 x 30 = 30 4 cores sitting idle total of 30 minutes Group 1 (10) + Group 2 (10) + Group 3 (30) *Total of 50 minutes* Stride order of *descending*: _Group 1_ 1 core x 30 minutes = 30 minutes 4 x 10 = 10 (running concurrently with the first core) 4 x 10 = 10 (running concurrently with the first core) 2 x 10 = 10 (running concurrently with the first core) (running concurrently) total of 30 minutes *Total of 30 minutes* Hopefully that's a good example. However, let me know if I've gone off the deep end :) > JDBC - Add an option for different stride orders > > > Key: SPARK-34910 > URL: https://issues.apache.org/jira/browse/SPARK-34910 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Jason Yarbrough >Priority: Trivial > > Currently, the JDBCRelation columnPartition function orders the strides in > ascending order, starting from the lower bound and working its way towards > the upper bound. > I'm proposing leaving that as the default, but adding an option (such as > strideOrder) in JDBCOptions. Since it will default to the current behavior, > this will keep people's current code working as expected. However, people who > may have data skew closer to the upper bound might appreciate being able to > have the strides in descending order, thus filling up the first partition > with the last stride and so forth. Also, people with nondeterministic data > skew or sporadic data density might be able to benefit from a random ordering > of the strides. > I have the code created to implement this, and it creates a pattern that can > be used to add other algorithms that people may want to add (such as counting > the rows and ranking each stride, and then ordering from most dense to > least). The current two options I have coded is 'descending' and 'random.' > The original idea was to create something closer to Spark's hash partitioner, > but for JDBC and pushed down to the database engine for efficiency. However, > that would require adding hashing algorithms for each dialect, and the > performance from those algorithms may outweigh the benefit. The method I'm > proposing in this ticket avoids those complexities while still giving some of > the benefit (in the case of random ordering). > I'll put a PR in if others feel this is a good idea. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34923) Metadata output should not always be propagated
[ https://issues.apache.org/jira/browse/SPARK-34923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312743#comment-17312743 ] Apache Spark commented on SPARK-34923: -- User 'karenfeng' has created a pull request for this issue: https://github.com/apache/spark/pull/32017 > Metadata output should not always be propagated > --- > > Key: SPARK-34923 > URL: https://issues.apache.org/jira/browse/SPARK-34923 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Karen Feng >Priority: Major > > Today, the vast majority of expressions uncritically propagate metadata > output from their children. As a general rule, it seems reasonable that only > expressions that propagate their children's output should also propagate > their children's metadata output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34923) Metadata output should not always be propagated
[ https://issues.apache.org/jira/browse/SPARK-34923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34923: Assignee: Apache Spark > Metadata output should not always be propagated > --- > > Key: SPARK-34923 > URL: https://issues.apache.org/jira/browse/SPARK-34923 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Karen Feng >Assignee: Apache Spark >Priority: Major > > Today, the vast majority of expressions uncritically propagate metadata > output from their children. As a general rule, it seems reasonable that only > expressions that propagate their children's output should also propagate > their children's metadata output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34923) Metadata output should not always be propagated
[ https://issues.apache.org/jira/browse/SPARK-34923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34923: Assignee: (was: Apache Spark) > Metadata output should not always be propagated > --- > > Key: SPARK-34923 > URL: https://issues.apache.org/jira/browse/SPARK-34923 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Karen Feng >Priority: Major > > Today, the vast majority of expressions uncritically propagate metadata > output from their children. As a general rule, it seems reasonable that only > expressions that propagate their children's output should also propagate > their children's metadata output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34910) JDBC - Add an option for different stride orders
[ https://issues.apache.org/jira/browse/SPARK-34910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312708#comment-17312708 ] Sean R. Owen commented on SPARK-34910: -- In this example, how are the rows 'skewed'? If there are 1M rows there are just 1M rows to process in any event. Is this related to the result of an aggregation or are the operations on certain rows unnaturally long-running? if so why would it tend to be the first or last partition that's slow - isn't it smaller if anything? and would just be 1 partition in any event. > JDBC - Add an option for different stride orders > > > Key: SPARK-34910 > URL: https://issues.apache.org/jira/browse/SPARK-34910 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Jason Yarbrough >Priority: Trivial > > Currently, the JDBCRelation columnPartition function orders the strides in > ascending order, starting from the lower bound and working its way towards > the upper bound. > I'm proposing leaving that as the default, but adding an option (such as > strideOrder) in JDBCOptions. Since it will default to the current behavior, > this will keep people's current code working as expected. However, people who > may have data skew closer to the upper bound might appreciate being able to > have the strides in descending order, thus filling up the first partition > with the last stride and so forth. Also, people with nondeterministic data > skew or sporadic data density might be able to benefit from a random ordering > of the strides. > I have the code created to implement this, and it creates a pattern that can > be used to add other algorithms that people may want to add (such as counting > the rows and ranking each stride, and then ordering from most dense to > least). The current two options I have coded is 'descending' and 'random.' > The original idea was to create something closer to Spark's hash partitioner, > but for JDBC and pushed down to the database engine for efficiency. However, > that would require adding hashing algorithms for each dialect, and the > performance from those algorithms may outweigh the benefit. The method I'm > proposing in this ticket avoids those complexities while still giving some of > the benefit (in the case of random ordering). > I'll put a PR in if others feel this is a good idea. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34910) JDBC - Add an option for different stride orders
[ https://issues.apache.org/jira/browse/SPARK-34910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312702#comment-17312702 ] Jason Yarbrough commented on SPARK-34910: - Hi [~srowen], thanks for asking. The idea here is that if a user's data is heavily skewed to the "right" (or upper bound), that the last strides (and therefore partitions/tasks) will have the most density of data. Say a stage has 32 tasks, and the first 30 finish at the same time, the last two task that are the heaviest would then start. The issue with this is, 2 cores would be running for a long time while the other cores are sitting idle (since there are no more tasks left). The hope is, if we have the option to reverse that, the first 2 task will be the heaviest, and while 2 cores work on those, the other cores will be working on the next tasks. As we increase the partition count, we try to flatten out this issue, but having the option to do descending could still be of help. As for a random stride order, it's just another option that may help flatten the distribution (even more so with a higher partition count). This one is a little more of a edge usage, but could help break apart some hot spots, and is easy to implement within the pattern I've developed. On somewhat of a side-note (although I may try to implement this in the current code, but may save it for another branch), I think something that would be pretty nice to add is a ranking of the density of partitions. Depending on if the column is indexed (I would recommend that for columns people are partitioning on), the performance impact of doing the extra queries/count may not be that bad. Once a count of records per partition is created, we could order the partition array in a way where the most dense partitions are towards the head. Implementing what I'm proposing here gives people more options on how their data is processed, and can also be extended for other algorithms if they make sense. > JDBC - Add an option for different stride orders > > > Key: SPARK-34910 > URL: https://issues.apache.org/jira/browse/SPARK-34910 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Jason Yarbrough >Priority: Trivial > > Currently, the JDBCRelation columnPartition function orders the strides in > ascending order, starting from the lower bound and working its way towards > the upper bound. > I'm proposing leaving that as the default, but adding an option (such as > strideOrder) in JDBCOptions. Since it will default to the current behavior, > this will keep people's current code working as expected. However, people who > may have data skew closer to the upper bound might appreciate being able to > have the strides in descending order, thus filling up the first partition > with the last stride and so forth. Also, people with nondeterministic data > skew or sporadic data density might be able to benefit from a random ordering > of the strides. > I have the code created to implement this, and it creates a pattern that can > be used to add other algorithms that people may want to add (such as counting > the rows and ranking each stride, and then ordering from most dense to > least). The current two options I have coded is 'descending' and 'random.' > The original idea was to create something closer to Spark's hash partitioner, > but for JDBC and pushed down to the database engine for efficiency. However, > that would require adding hashing algorithms for each dialect, and the > performance from those algorithms may outweigh the benefit. The method I'm > proposing in this ticket avoids those complexities while still giving some of > the benefit (in the case of random ordering). > I'll put a PR in if others feel this is a good idea. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312696#comment-17312696 ] Shane Knapp edited comment on SPARK-34738 at 3/31/21, 8:38 PM: --- managed to snag the logs from the pod when it errored out: {code:java} ++ id -u + myuid=185 ++ id -g + mygid=0 + set +e ++ getent passwd 185 + uidentry= + set -e + '[' -z '' ']' + '[' -w /etc/passwd ']' + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false' + SPARK_CLASSPATH=':/opt/spark/jars/*' + env + grep SPARK_JAVA_OPT_ + sort -t_ -k4 -n + sed 's/[^=]*=\(.*\)/\1/g' + readarray -t SPARK_EXECUTOR_JAVA_OPTS + '[' -n '' ']' + '[' -z ']' + '[' -z ']' + '[' -n '' ']' + '[' -z ']' + '[' -z x ']' + SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*' + case "$1" in + shift 1 + CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@") + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.17.0.3 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.DFSReadWriteTest local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar /opt/spark/pv-tests/tmp4595937990978494271.txt /opt/spark/pv-tests 21/03/31 20:26:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Given path (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist DFS Read-Write Test Usage: localFile dfsDir localFile - (string) local file to use in test dfsDir - (string) DFS directory for read/write tests log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.{code} this def caught my eye: Given path (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist i sshed in to the cluster, was able (again) to confirm that mk was able to mount the PVC test dir on that worker in /tmp, and that the file tmp4595937990978494271.txt was visible and readable from within mk... however /opt/spark/pv-tests/ wasn't visible within the mk cluster. was (Author: shaneknapp): managed to snag the logs from the pod when it errored out: {code:java} ++ id -u + myuid=185 ++ id -g + mygid=0 + set +e ++ getent passwd 185 + uidentry= + set -e + '[' -z '' ']' + '[' -w /etc/passwd ']' + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false' + SPARK_CLASSPATH=':/opt/spark/jars/*' + env + grep SPARK_JAVA_OPT_ + sort -t_ -k4 -n + sed 's/[^=]*=\(.*\)/\1/g' + readarray -t SPARK_EXECUTOR_JAVA_OPTS + '[' -n '' ']' + '[' -z ']' + '[' -z ']' + '[' -n '' ']' + '[' -z ']' + '[' -z x ']' + SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*' + case "$1" in + shift 1 + CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@") + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.17.0.3 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.DFSReadWriteTest local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar /opt/spark/pv-tests/tmp4595937990978494271.txt /opt/spark/pv-tests 21/03/31 20:26:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Given path (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist DFS Read-Write Test Usage: localFile dfsDir localFile - (string) local file to use in test dfsDir - (string) DFS directory for read/write tests log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.{code} this def caught my eye: Given path (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist i sshed in to the cluster, was able (again) to confirm that mk was able to mount the PVC test dir on that worker, and that the file tmp4595937990978494271.txt was visible and readable from within mk... > Upgrade Minikube and kubernetes cluster version on Jenkins > -- > > Key: SPARK-34738 > URL: https://issues.apache.org/jira/browse/SPARK-34738 > Project: Spark > Issue Type: Task > Components: jenkins, Kubernetes >Affects Versions: 3.2.0 >Reporter: Attila Zsolt Piros >Assignee: Shane Knapp >Priority: Major > > [~shaneknapp] as we discussed [on the mailing >
[jira] [Created] (SPARK-34924) Add nested column scan in DataSourceReadBenchmark
Cheng Su created SPARK-34924: Summary: Add nested column scan in DataSourceReadBenchmark Key: SPARK-34924 URL: https://issues.apache.org/jira/browse/SPARK-34924 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Cheng Su For the table scan benchmark `DataSourceReadBenchmark.scala`, we are only testing atomic data type but not nested column data type (array/struct/map) - [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala#L544-L567] . We should add scan benchmark for nested column data type as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312696#comment-17312696 ] Shane Knapp commented on SPARK-34738: - managed to snag the logs from the pod when it errored out: {code:java} ++ id -u + myuid=185 ++ id -g + mygid=0 + set +e ++ getent passwd 185 + uidentry= + set -e + '[' -z '' ']' + '[' -w /etc/passwd ']' + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false' + SPARK_CLASSPATH=':/opt/spark/jars/*' + env + grep SPARK_JAVA_OPT_ + sort -t_ -k4 -n + sed 's/[^=]*=\(.*\)/\1/g' + readarray -t SPARK_EXECUTOR_JAVA_OPTS + '[' -n '' ']' + '[' -z ']' + '[' -z ']' + '[' -n '' ']' + '[' -z ']' + '[' -z x ']' + SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*' + case "$1" in + shift 1 + CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@") + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.17.0.3 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.DFSReadWriteTest local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar /opt/spark/pv-tests/tmp4595937990978494271.txt /opt/spark/pv-tests 21/03/31 20:26:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Given path (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist DFS Read-Write Test Usage: localFile dfsDir localFile - (string) local file to use in test dfsDir - (string) DFS directory for read/write tests log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.{code} this def caught my eye: Given path (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist i sshed in to the cluster, was able (again) to confirm that mk was able to mount the PVC test dir on that worker, and that the file tmp4595937990978494271.txt was visible and readable from within mk... > Upgrade Minikube and kubernetes cluster version on Jenkins > -- > > Key: SPARK-34738 > URL: https://issues.apache.org/jira/browse/SPARK-34738 > Project: Spark > Issue Type: Task > Components: jenkins, Kubernetes >Affects Versions: 3.2.0 >Reporter: Attila Zsolt Piros >Assignee: Shane Knapp >Priority: Major > > [~shaneknapp] as we discussed [on the mailing > list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html] > Minikube can be upgraded to the latest (v1.18.1) and kubernetes version > should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`). > [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new > method to configure the kubernetes client. Thanks in advance to use it for > testing on the Jenkins after the Minikube version is updated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33638) Full support of V2 table creation in Structured Streaming writer path
[ https://issues.apache.org/jira/browse/SPARK-33638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-33638: - Priority: Major (was: Blocker) > Full support of V2 table creation in Structured Streaming writer path > - > > Key: SPARK-33638 > URL: https://issues.apache.org/jira/browse/SPARK-33638 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Yuanjian Li >Priority: Major > > Currently, we want to add support of creating if not exists in > DataStreamWriter.toTable API. Since the file format in streaming doesn't > support DSv2 for now, the current implementation mainly focuses on V1 > support. We need more work to do for the full support of V2 table creation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33694) Auditing the API changes and behavior changes in Spark 3.1
[ https://issues.apache.org/jira/browse/SPARK-33694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-33694. -- Resolution: Done > Auditing the API changes and behavior changes in Spark 3.1 > -- > > Key: SPARK-33694 > URL: https://issues.apache.org/jira/browse/SPARK-33694 > Project: Spark > Issue Type: Epic > Components: PySpark, Spark Core, SparkR, SQL, Structured Streaming >Affects Versions: 3.1.0 >Reporter: Xiao Li >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34425) Improve error message in default spark session catalog
[ https://issues.apache.org/jira/browse/SPARK-34425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-34425: - Issue Type: Improvement (was: Bug) Priority: Minor (was: Blocker) Feel free to revert the status if I'm missing why it needs to be a Blocker > Improve error message in default spark session catalog > -- > > Key: SPARK-34425 > URL: https://issues.apache.org/jira/browse/SPARK-34425 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > > See [https://github.com/apache/spark/pull/31541/files] > We should make sure that "unsupported function name 'default.ns1.ns2.fun'" > contains information about why the function name is unsupported. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34923) Metadata output should not always be propagated
Karen Feng created SPARK-34923: -- Summary: Metadata output should not always be propagated Key: SPARK-34923 URL: https://issues.apache.org/jira/browse/SPARK-34923 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Karen Feng Today, the vast majority of expressions uncritically propagate metadata output from their children. As a general rule, it seems reasonable that only expressions that propagate their children's output should also propagate their children's metadata output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34904) Old import of LZ4 package inside CompressionCodec.scala
[ https://issues.apache.org/jira/browse/SPARK-34904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34904. -- Resolution: Not A Problem > Old import of LZ4 package inside CompressionCodec.scala > > > Key: SPARK-34904 > URL: https://issues.apache.org/jira/browse/SPARK-34904 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.0, 3.1.1 >Reporter: Michal Zeman >Priority: Minor > > This commit should upgrade the version of the LZ4 package: > [https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f] > > The dependency was changed. However, inside the file [CompressionCodec.scala > |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala] > the old import referencing net.jpountz.lz4 (where versions up to 1.3. are > held) remains. > Because of probably backward compatibility, the newer version of org.lz4 > package still contains net.jpountz.lz4 > ([https://github.com/lz4/lz4-java/tree/master/src/java/net/jpountz/lz4)]. > Therefore this import does not cause problems at a first sight. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34903) Return day-time interval from timestamps subtraction
[ https://issues.apache.org/jira/browse/SPARK-34903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34903: Assignee: Max Gekk (was: Apache Spark) > Return day-time interval from timestamps subtraction > > > Key: SPARK-34903 > URL: https://issues.apache.org/jira/browse/SPARK-34903 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Modify SubtractTimestamps to return DayTimeIntervalType (under a flag). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34903) Return day-time interval from timestamps subtraction
[ https://issues.apache.org/jira/browse/SPARK-34903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312568#comment-17312568 ] Apache Spark commented on SPARK-34903: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/32016 > Return day-time interval from timestamps subtraction > > > Key: SPARK-34903 > URL: https://issues.apache.org/jira/browse/SPARK-34903 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Modify SubtractTimestamps to return DayTimeIntervalType (under a flag). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34903) Return day-time interval from timestamps subtraction
[ https://issues.apache.org/jira/browse/SPARK-34903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312569#comment-17312569 ] Apache Spark commented on SPARK-34903: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/32016 > Return day-time interval from timestamps subtraction > > > Key: SPARK-34903 > URL: https://issues.apache.org/jira/browse/SPARK-34903 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Modify SubtractTimestamps to return DayTimeIntervalType (under a flag). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34903) Return day-time interval from timestamps subtraction
[ https://issues.apache.org/jira/browse/SPARK-34903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34903: Assignee: Apache Spark (was: Max Gekk) > Return day-time interval from timestamps subtraction > > > Key: SPARK-34903 > URL: https://issues.apache.org/jira/browse/SPARK-34903 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Modify SubtractTimestamps to return DayTimeIntervalType (under a flag). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312548#comment-17312548 ] Shane Knapp commented on SPARK-34738: - fwiw, the pod is mounting the local FS correctly: {code:java} jenkins@research-jenkins-worker-08:~$ minikube ssh docker@minikube:~$ cd /tmp docker@minikube:/tmp$ ls gvisor h.829 h.912 hostpath-provisioner hostpath_pv tmp.iSJp3I8otl docker@minikube:/tmp$ cd tmp.iSJp3I8otl/ docker@minikube:/tmp/tmp.iSJp3I8otl$ touch ASDF docker@minikube:/tmp/tmp.iSJp3I8otl$ logout jenkins@research-jenkins-worker-08:~$ cd /tmp/tmp.iSJp3I8otl/ jenkins@research-jenkins-worker-08:/tmp/tmp.iSJp3I8otl$ ls ASDF tmp3533667297442374713.txt jenkins@research-jenkins-worker-08:/tmp/tmp.iSJp3I8otl$ cat tmp3533667297442374713.txt test PVs{code} > Upgrade Minikube and kubernetes cluster version on Jenkins > -- > > Key: SPARK-34738 > URL: https://issues.apache.org/jira/browse/SPARK-34738 > Project: Spark > Issue Type: Task > Components: jenkins, Kubernetes >Affects Versions: 3.2.0 >Reporter: Attila Zsolt Piros >Assignee: Shane Knapp >Priority: Major > > [~shaneknapp] as we discussed [on the mailing > list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html] > Minikube can be upgraded to the latest (v1.18.1) and kubernetes version > should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`). > [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new > method to configure the kubernetes client. Thanks in advance to use it for > testing on the Jenkins after the Minikube version is updated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34904) Old import of LZ4 package inside CompressionCodec.scala
[ https://issues.apache.org/jira/browse/SPARK-34904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312547#comment-17312547 ] Michal Zeman commented on SPARK-34904: -- [~srowen] Nope, sorry embarrassing. I have mistaken the location of artifact and namespace of actual code... > Old import of LZ4 package inside CompressionCodec.scala > > > Key: SPARK-34904 > URL: https://issues.apache.org/jira/browse/SPARK-34904 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.0, 3.1.1 >Reporter: Michal Zeman >Priority: Minor > > This commit should upgrade the version of the LZ4 package: > [https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f] > > The dependency was changed. However, inside the file [CompressionCodec.scala > |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala] > the old import referencing net.jpountz.lz4 (where versions up to 1.3. are > held) remains. > Because of probably backward compatibility, the newer version of org.lz4 > package still contains net.jpountz.lz4 > ([https://github.com/lz4/lz4-java/tree/master/src/java/net/jpountz/lz4)]. > Therefore this import does not cause problems at a first sight. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34904) Old import of LZ4 package inside CompressionCodec.scala
[ https://issues.apache.org/jira/browse/SPARK-34904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michal Zeman updated SPARK-34904: - Issue Type: Question (was: Bug) > Old import of LZ4 package inside CompressionCodec.scala > > > Key: SPARK-34904 > URL: https://issues.apache.org/jira/browse/SPARK-34904 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.0, 3.1.1 >Reporter: Michal Zeman >Priority: Minor > > This commit should upgrade the version of the LZ4 package: > [https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f] > > The dependency was changed. However, inside the file [CompressionCodec.scala > |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala] > the old import referencing net.jpountz.lz4 (where versions up to 1.3. are > held) remains. > Because of probably backward compatibility, the newer version of org.lz4 > package still contains net.jpountz.lz4 > ([https://github.com/lz4/lz4-java/tree/master/src/java/net/jpountz/lz4)]. > Therefore this import does not cause problems at a first sight. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312517#comment-17312517 ] Shane Knapp edited comment on SPARK-34738 at 3/31/21, 4:43 PM: --- alright, sometimes these things go smoothly, sometimes not. this is firmly in the 'not' camp. after upgrading minikube and k8s, i was unable to mount a persistent volume when using the kvm2 driver. much debugging ensued. no progress was made and the error reported was that the minikube pod was unable to connect to the localhost and mount (Connection refused). so, i decided to randomly try the docker minikube driver. voila! i'm now able to happily mount persistent volumes. however, when running the k8s integration test, everything passes *except* the PVs w/local storage. from [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-clone:] {code:java} - PVs with local storage *** FAILED *** The code passed to eventually never returned normally. Attempted 179 times over 3.00242447046 minutes. Last failure message: container not found ("spark-kubernetes-driver"). (PVTestsSuite.scala:117){code} i've never seen this error before, and apparently there aren't many things here's how we launch minikube and create the mount: {code:java} minikube --vm-driver=docker start --memory 6000 --cpus 8 minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} --9p-version=9p2000.L & {code} we're using ZFS on the bare metal, and minikube is complaining: {code:java} ! docker is currently using the zfs storage driver, consider switching to overlay2 for better performance{code} i'll continue to dig in to this today, but i'm currently blocked... was (Author: shaneknapp): alright, sometimes these things go smoothly, sometimes not. this is firmly in the 'not' camp. after upgrading minikube and k8s, i was unable to mount a persistent volume when using the kvm2 driver. much debugging ensued. no progress was made. so, i decided to randomly try the docker minikube driver. voila! i'm now able to happily mount persistent volumes. however, when running the k8s integration test, everything passes *except* the PVs w/local storage. from [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-clone:] {code:java} - PVs with local storage *** FAILED *** The code passed to eventually never returned normally. Attempted 179 times over 3.00242447046 minutes. Last failure message: container not found ("spark-kubernetes-driver"). (PVTestsSuite.scala:117){code} i've never seen this error before, and apparently there aren't many things here's how we launch minikube and create the mount: {code:java} minikube --vm-driver=docker start --memory 6000 --cpus 8 minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} --9p-version=9p2000.L & {code} we're using ZFS on the bare metal, and minikube is complaining: {code:java} ! docker is currently using the zfs storage driver, consider switching to overlay2 for better performance{code} i'll continue to dig in to this today, but i'm currently blocked... > Upgrade Minikube and kubernetes cluster version on Jenkins > -- > > Key: SPARK-34738 > URL: https://issues.apache.org/jira/browse/SPARK-34738 > Project: Spark > Issue Type: Task > Components: jenkins, Kubernetes >Affects Versions: 3.2.0 >Reporter: Attila Zsolt Piros >Assignee: Shane Knapp >Priority: Major > > [~shaneknapp] as we discussed [on the mailing > list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html] > Minikube can be upgraded to the latest (v1.18.1) and kubernetes version > should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`). > [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new > method to configure the kubernetes client. Thanks in advance to use it for > testing on the Jenkins after the Minikube version is updated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312517#comment-17312517 ] Shane Knapp edited comment on SPARK-34738 at 3/31/21, 4:20 PM: --- alright, sometimes these things go smoothly, sometimes not. this is firmly in the 'not' camp. after upgrading minikube and k8s, i was unable to mount a persistent volume when using the kvm2 driver. much debugging ensued. no progress was made. so, i decided to randomly try the docker minikube driver. voila! i'm now able to happily mount persistent volumes. however, when running the k8s integration test, everything passes *except* the PVs w/local storage. from [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-clone:] {code:java} - PVs with local storage *** FAILED *** The code passed to eventually never returned normally. Attempted 179 times over 3.00242447046 minutes. Last failure message: container not found ("spark-kubernetes-driver"). (PVTestsSuite.scala:117){code} i've never seen this error before, and apparently there aren't many things here's how we launch minikube and create the mount: {code:java} minikube --vm-driver=docker start --memory 6000 --cpus 8 minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} --9p-version=9p2000.L & {code} we're using ZFS on the bare metal, and minikube is complaining: {code:java} ! docker is currently using the zfs storage driver, consider switching to overlay2 for better performance{code} i'll continue to dig in to this today, but i'm currently blocked... was (Author: shaneknapp): alright, sometimes these things go smoothly, sometimes not. this is firmly in the 'not' camp. after upgrading minikube and k8s, i was unable to mount a persistent volume when using the kvm2 driver. much debugging ensued. no progress was made. so, i decided to randomly try the docker minikube driver. voila! i'm now able to happily mount persistent volumes. however, when running the k8s integration test, everything passes *except* the PVs w/local storage. from https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-clone: {code:java} - PVs with local storage *** FAILED *** The code passed to eventually never returned normally. Attempted 179 times over 3.00242447046 minutes. Last failure message: container not found ("spark-kubernetes-driver"). (PVTestsSuite.scala:117){code} i've never seen this error before, and apparently there aren't many things here's how we launch minikube and create the mount: {code:java} minikube --vm-driver=docker start --memory 6000 --cpus 8 minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} --9p-version=9p2000.L & {code} i'll continue to dig in to this today, but i'm currently blocked... > Upgrade Minikube and kubernetes cluster version on Jenkins > -- > > Key: SPARK-34738 > URL: https://issues.apache.org/jira/browse/SPARK-34738 > Project: Spark > Issue Type: Task > Components: jenkins, Kubernetes >Affects Versions: 3.2.0 >Reporter: Attila Zsolt Piros >Assignee: Shane Knapp >Priority: Major > > [~shaneknapp] as we discussed [on the mailing > list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html] > Minikube can be upgraded to the latest (v1.18.1) and kubernetes version > should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`). > [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new > method to configure the kubernetes client. Thanks in advance to use it for > testing on the Jenkins after the Minikube version is updated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312518#comment-17312518 ] Shane Knapp commented on SPARK-34738: - [~skonto] any insights? > Upgrade Minikube and kubernetes cluster version on Jenkins > -- > > Key: SPARK-34738 > URL: https://issues.apache.org/jira/browse/SPARK-34738 > Project: Spark > Issue Type: Task > Components: jenkins, Kubernetes >Affects Versions: 3.2.0 >Reporter: Attila Zsolt Piros >Assignee: Shane Knapp >Priority: Major > > [~shaneknapp] as we discussed [on the mailing > list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html] > Minikube can be upgraded to the latest (v1.18.1) and kubernetes version > should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`). > [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new > method to configure the kubernetes client. Thanks in advance to use it for > testing on the Jenkins after the Minikube version is updated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312517#comment-17312517 ] Shane Knapp commented on SPARK-34738: - alright, sometimes these things go smoothly, sometimes not. this is firmly in the 'not' camp. after upgrading minikube and k8s, i was unable to mount a persistent volume when using the kvm2 driver. much debugging ensued. no progress was made. so, i decided to randomly try the docker minikube driver. voila! i'm now able to happily mount persistent volumes. however, when running the k8s integration test, everything passes *except* the PVs w/local storage. from https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-clone: {code:java} - PVs with local storage *** FAILED *** The code passed to eventually never returned normally. Attempted 179 times over 3.00242447046 minutes. Last failure message: container not found ("spark-kubernetes-driver"). (PVTestsSuite.scala:117){code} i've never seen this error before, and apparently there aren't many things here's how we launch minikube and create the mount: {code:java} minikube --vm-driver=docker start --memory 6000 --cpus 8 minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} --9p-version=9p2000.L & {code} i'll continue to dig in to this today, but i'm currently blocked... > Upgrade Minikube and kubernetes cluster version on Jenkins > -- > > Key: SPARK-34738 > URL: https://issues.apache.org/jira/browse/SPARK-34738 > Project: Spark > Issue Type: Task > Components: jenkins, Kubernetes >Affects Versions: 3.2.0 >Reporter: Attila Zsolt Piros >Assignee: Shane Knapp >Priority: Major > > [~shaneknapp] as we discussed [on the mailing > list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html] > Minikube can be upgraded to the latest (v1.18.1) and kubernetes version > should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`). > [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new > method to configure the kubernetes client. Thanks in advance to use it for > testing on the Jenkins after the Minikube version is updated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34883) Setting CSV reader option "multiLine" to "true" causes URISyntaxException when colon is in file path
[ https://issues.apache.org/jira/browse/SPARK-34883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312504#comment-17312504 ] Brady Tello commented on SPARK-34883: - Sure thing. This Scala code results in the JVM stack trace below. Looks like an issue with hadoop's Globber class {code:java} var csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" var tempDF = (spark.read .option("sep", "\t") .option("multiLine", "True") .csv(csvFile) ) {code} {code:java} IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: test:dir Caused by: URISyntaxException: Relative path in absolute URI: test:dir at org.apache.hadoop.fs.Path.initialize(Path.java:205) at org.apache.hadoop.fs.Path.(Path.java:171) at org.apache.hadoop.fs.Path.(Path.java:93) at org.apache.hadoop.fs.Globber.glob(Globber.java:211) at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) at org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51) at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:51) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:303) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:303) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:307) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:303) at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1435) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:419) at org.apache.spark.rdd.RDD.take(RDD.scala:1429) at org.apache.spark.sql.execution.datasources.csv.MultiLineCSVDataSource$.infer(CSVDataSource.scala:225) at org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:70) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:62) at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$11(DataSource.scala:230) at scala.Option.orElse(Option.scala:447) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:223) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:460) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:408) at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:390) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:390) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:912) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705) at $lined238ff10de5c41758f3b1477d427b85e25.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-8965899:6) at $lined238ff10de5c41758f3b1477d427b85e25.$read$$iw$$iw$$iw$$iw$$iw.(command-8965899:50) at $lined238ff10de5c41758f3b1477d427b85e25.$read$$iw$$iw$$iw$$iw.(command-8965899:52) at $lined238ff10de5c41758f3b1477d427b85e25.$read$$iw$$iw$$iw.(command-8965899:54) at $lined238ff10de5c41758f3b1477d427b85e25.$read$$iw$$iw.(command-8965899:56) at $lined238ff10de5c41758f3b1477d427b85e25.$read$$iw.(command-8965899:58) at $lined238ff10de5c41758f3b1477d427b85e25.$read.(command-8965899:60) at $lined238ff10de5c41758f3b1477d427b85e25.$read$.(command-8965899:64) at $lined238ff10de5c41758f3b1477d427b85e25.$read$.(command-8965899) at $lined238ff10de5c41758f3b1477d427b85e25.$eval$.$print$lzycompute(:7) at $lined238ff10de5c41758f3b1477d427b85e25.$eval$.$print(:6) at $lined238ff10de5c41758f3b1477d427b85e25.$eval.$print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at
[jira] [Commented] (SPARK-34779) ExecutorMetricsPoller should keep stage entry in stageTCMP until a heartbeat occurs
[ https://issues.apache.org/jira/browse/SPARK-34779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312498#comment-17312498 ] Baohe Zhang commented on SPARK-34779: - Thanks for pointing it out! I didn't aware that task peak metrics will contribute to the executor peak metrics. > ExecutorMetricsPoller should keep stage entry in stageTCMP until a heartbeat > occurs > --- > > Key: SPARK-34779 > URL: https://issues.apache.org/jira/browse/SPARK-34779 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: Baohe Zhang >Priority: Major > > The current implementation of ExecutoMetricsPoller uses task count in each > stage to decide whether to keep a stage entry or not. In the case of the > executor only has 1 core, it may have these issues: > # Peak metrics missing (due to stage entry being removed within a heartbeat > interval) > # Unnecessary and frequent hashmap entry removal and insertion. > Assuming an executor with 1 core has 2 tasks (task1 and task2, both belong to > stage (0,0)) to execute in a heartbeat interval, the workflow in current > ExecutorMetricsPoller implementation would be: > 1. task1 start -> stage (0, 0) entry created in stageTCMP, task count > increment to1 > 2. 1st poll() -> update peak metrics of stage (0, 0) > 3. task1 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry > removed, peak metrics lost. > 4. task2 start -> stage (0, 0) entry created in stageTCMP, task count > increment to1 > 5. 2nd poll() -> update peak metrics of stage (0, 0) > 6. task2 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry > removed, peak metrics lost > 7. heartbeat() -> empty or inaccurate peak metrics for stage(0,0) reported. > We can fix the issue by keeping entries with task count = 0 in stageTCMP map > until a heartbeat occurs. At the heartbeat, after reporting the peak metrics > for each stage, we scan each stage in stageTCMP and remove entries with task > count = 0. > After the fix, the workflow would be: > 1. task1 start -> stage (0, 0) entry created in stageTCMP, task count > increment to1 > 2. 1st poll() -> update peak metrics of stage (0, 0) > 3. task1 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) > still remain. > 4. task2 start -> task count of stage (0,0) increment to1 > 5. 2nd poll() -> update peak metrics of stage (0, 0) > 6. task2 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) > still remain. > 7. heartbeat() -> accurate peak metrics for stage (0, 0) reported. Remove > entry for stage (0,0) in stageTCMP because its task count is 0. > > How to verify the behavior? > Submit a job with a custom polling interval (e.g., 2s) and > spark.executor.cores=1 and check the debug logs of ExecutoMetricsPoller. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34684) Hadoop config could not be successfully serilized from driver pods to executor pods
[ https://issues.apache.org/jira/browse/SPARK-34684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34684. -- Resolution: Not A Problem > Hadoop config could not be successfully serilized from driver pods to > executor pods > --- > > Key: SPARK-34684 > URL: https://issues.apache.org/jira/browse/SPARK-34684 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1, 3.0.2 >Reporter: Yue Peng >Priority: Major > > I have set HADOOP_CONF_DIR correctly. And I have verified that hadoop configs > have been stored into a configmap and mounted to driver. However, spark pi > example job keeps failing where executor do not know how to talk to hdfs. I > highly suspect that there is a bug causing it, as I manually create a > configmap storing hadoop configs and mounted it to executor in template file, > which could fix the error. > > Spark submit command: > /opt/spark-3.0/bin/spark-submit --class org.apache.spark.examples.SparkPi > --deploy-mode cluster --master k8s://https://10.***.18.96:6443 > --num-executors 1 --conf spark.kubernetes.namespace=test --conf > spark.kubernetes.container.image= --conf > spark.kubernetes.driver.podTemplateFile=/opt/spark-3.0/conf/spark-driver.template > --conf > spark.kubernetes.executor.podTemplateFile=/opt/spark-3.0/conf/spark-executor.template > --conf spark.kubernetes.file.upload.path=/opt/spark-3.0/examples/jars > hdfs:///tmp/spark-examples_2.12-3.0.125067.jar 1000 > > > Error log: > > 21/03/10 06:59:58 INFO TransportClientFactory: Successfully created > connection to > org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078 > after 608 ms (392 ms spent in bootstraps) > 21/03/10 06:59:58 INFO SecurityManager: Changing view acls to: root > 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls to: root > 21/03/10 06:59:58 INFO SecurityManager: Changing view acls groups to: > 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls groups to: > 21/03/10 06:59:58 INFO SecurityManager: SecurityManager: authentication > enabled; ui acls disabled; users with view permissions: Set(root); groups > with view permissions: Set(); users with modify permissions: Set(root); > groups with modify permissions: Set() > 21/03/10 06:59:59 INFO TransportClientFactory: Successfully created > connection to > org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078 > after 130 ms (104 ms spent in bootstraps) > 21/03/10 06:59:59 INFO DiskBlockManager: Created local directory at > /var/data/spark-0f541e3d-994f-4c7a-843f-f7dac57dfc13/blockmgr-981cfb62-5b27-4d1a-8fbd-eddb466faf1d > 21/03/10 06:59:59 INFO MemoryStore: MemoryStore started with capacity 2047.2 > MiB > 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Connecting to driver: > spark://coarsegrainedschedu...@org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc:7078 > 21/03/10 06:59:59 INFO ResourceUtils: > == > 21/03/10 06:59:59 INFO ResourceUtils: Resources for spark.executor: > 21/03/10 06:59:59 INFO ResourceUtils: > == > 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Successfully registered > with driver > 21/03/10 06:59:59 INFO Executor: Starting executor ID 1 on host 100.64.0.192 > 21/03/10 07:00:00 INFO Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37956. > 21/03/10 07:00:00 INFO NettyBlockTransferService: Server created on > 100.64.0.192:37956 > 21/03/10 07:00:00 INFO BlockManager: Using > org.apache.spark.storage.RandomBlockReplicationPolicy for block replication > policy > 21/03/10 07:00:00 INFO BlockManagerMaster: Registering BlockManager > BlockManagerId(1, 100.64.0.192, 37956, None) > 21/03/10 07:00:00 INFO BlockManagerMaster: Registered BlockManager > BlockManagerId(1, 100.64.0.192, 37956, None) > 21/03/10 07:00:00 INFO BlockManager: Initialized BlockManager: > BlockManagerId(1, 100.64.0.192, 37956, None) > 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 0 > 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 1 > 21/03/10 07:00:01 INFO Executor: Running task 1.0 in stage 0.0 (TID 1) > 21/03/10 07:00:01 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) > 21/03/10 07:00:01 INFO Executor: Fetching > spark://org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc:7078/jars/spark-examples_2.12-3.0.125067.jar > with timestamp 1615359587432 > 21/03/10 07:00:01 INFO TransportClientFactory: Successfully created > connection to >
[jira] [Resolved] (SPARK-34751) Parquet with invalid chars on column name reads double as null when a clean schema is applied
[ https://issues.apache.org/jira/browse/SPARK-34751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34751. -- Resolution: Duplicate > Parquet with invalid chars on column name reads double as null when a clean > schema is applied > - > > Key: SPARK-34751 > URL: https://issues.apache.org/jira/browse/SPARK-34751 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.3, 3.1.1 > Environment: Pyspark 2.4.3 > AWS Glue Dev Endpoint EMR >Reporter: Nivas Umapathy >Priority: Major > Attachments: invalid_columns_double.parquet > > > I have a parquet file that has data with invalid column names on it. > [#Reference](https://issues.apache.org/jira/browse/SPARK-27442) Here is the > file attached with this ticket. > I tried to load this file with > {{df = glue_context.read.parquet('invalid_columns_double.parquet')}} > {{df = df.withColumnRenamed('COL 1', 'COL_1')}} > {{df = df.withColumnRenamed('COL,2', 'COL_2')}} > {{df = df.withColumnRenamed('COL;3', 'COL_3') }} > and so on. > Now if i call > {{df.show()}} > it throws this exception that is still pointing to the old column name. > {{pyspark.sql.utils.AnalysisException: 'Attribute name "COL 1" contains > invalid character(s) among " ,;{}()}} > {{n}} > {{t=". Please use alias to rename it.;'}} > > When i read about it in some blogs, there was suggestion to re-read the same > parquet with new schema applied. So i did > {{df = > glue_context.read.schema(df.schema).parquet(}}{{'invalid_columns_double.parquet')}} > > and it works, but all the data in the dataframe are null. The same works for > String datatypes > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34865) cannot resolve methods when & col in spark java
[ https://issues.apache.org/jira/browse/SPARK-34865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34865. -- Resolution: Duplicate > cannot resolve methods when & col in spark java > --- > > Key: SPARK-34865 > URL: https://issues.apache.org/jira/browse/SPARK-34865 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.1.1 >Reporter: unical1988 >Priority: Major > > Hello, > > I am using Spark 3.1.1 with JAVA and the method when of Column and also col > but when i compile it says cannot resolve method when & cannot resolve method > col. > > What is the matter ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34864) cannot resolve methods when & col in spark java
[ https://issues.apache.org/jira/browse/SPARK-34864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34864. -- Resolution: Invalid > cannot resolve methods when & col in spark java > --- > > Key: SPARK-34864 > URL: https://issues.apache.org/jira/browse/SPARK-34864 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.1.1 >Reporter: unical1988 >Priority: Major > > Hello, > > I am using Spark 3.1.1 with JAVA and the method when of Column and also col > but when i compile it says cannot resolve method when & cannot resolve method > col. > > What is the matter ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34866) cannot resolve method when of Column() spark java
[ https://issues.apache.org/jira/browse/SPARK-34866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34866. -- Resolution: Invalid > cannot resolve method when of Column() spark java > - > > Key: SPARK-34866 > URL: https://issues.apache.org/jira/browse/SPARK-34866 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.1.1 >Reporter: unical1988 >Priority: Major > > cannot resolve method when of Column, spark java and also method col, what's > the problem ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34867) cannot resolve method when of Column() spark java
[ https://issues.apache.org/jira/browse/SPARK-34867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34867. -- Resolution: Duplicate > cannot resolve method when of Column() spark java > - > > Key: SPARK-34867 > URL: https://issues.apache.org/jira/browse/SPARK-34867 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.1.1 >Reporter: unical1988 >Priority: Major > > The error says cannot resolve method main (of Column()) and col in Spark 3.1.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34883) Setting CSV reader option "multiLine" to "true" causes URISyntaxException when colon is in file path
[ https://issues.apache.org/jira/browse/SPARK-34883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312473#comment-17312473 ] Sean R. Owen commented on SPARK-34883: -- I'm guessing it's somehow that a different code path takes over for multiline parsing with univocity, but not sure. Do you have the JVM side stack trace? > Setting CSV reader option "multiLine" to "true" causes URISyntaxException > when colon is in file path > > > Key: SPARK-34883 > URL: https://issues.apache.org/jira/browse/SPARK-34883 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.1.1 >Reporter: Brady Tello >Priority: Major > > Setting the CSV reader's "multiLine" option to "True" throws the following > exception when a ':' character is in the file path. > > {code:java} > java.net.URISyntaxException: Relative path in absolute URI: test:dir > {code} > I've tested this in both Spark 3.0.0 and Spark 3.1.1 and I get the same error > whether I use Scala, Python, or SQL. > The following code works fine: > > {code:java} > csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" > tempDF = (spark.read.option("sep", "\t").csv(csvFile) > {code} > While the following code fails: > > {code:java} > csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" > tempDF = (spark.read.option("sep", "\t").option("multiLine", > "True").csv(csvFile) > {code} > Full Stack Trace from Python: > > {code:java} > --- > IllegalArgumentException Traceback (most recent call last) > in > 3 csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" > 4 > > 5 tempDF = (spark.read.option("sep", "\t").option("multiLine", "True") > /databricks/spark/python/pyspark/sql/readwriter.py in csv(self, path, schema, > sep, encoding, quote, escape, comment, header, inferSchema, > ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, > positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, > maxCharsPerColumn, maxMalformedLogPerPartition, mode, > columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, > samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, > recursiveFileLookup, modifiedBefore, modifiedAfter, unescapedQuoteHandling) > 735 path = [path] > 736 if type(path) == list: > --> 737 return > self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) > 738 elif isinstance(path, RDD): > 739 def func(iterator): > /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in > __call__(self, *args) > 1302 > 1303 answer = self.gateway_client.send_command(command) > -> 1304 return_value = get_return_value( > 1305 answer, self.gateway_client, self.target_id, self.name) > 1306 > /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) > 114 # Hide where the exception came from that shows a non-Pythonic > 115 # JVM exception message. > --> 116 raise converted from None > 117 else: > 118 raise IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: test:dir > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34904) Old import of LZ4 package inside CompressionCodec.scala
[ https://issues.apache.org/jira/browse/SPARK-34904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312471#comment-17312471 ] Sean R. Owen commented on SPARK-34904: -- [~langustav] I don't see any code in the org.lz4 namespace in the project? what am I missing? https://github.com/lz4/lz4-java/tree/master/src/java > Old import of LZ4 package inside CompressionCodec.scala > > > Key: SPARK-34904 > URL: https://issues.apache.org/jira/browse/SPARK-34904 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 3.1.1 >Reporter: Michal Zeman >Priority: Minor > > This commit should upgrade the version of the LZ4 package: > [https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f] > > The dependency was changed. However, inside the file [CompressionCodec.scala > |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala] > the old import referencing net.jpountz.lz4 (where versions up to 1.3. are > held) remains. > Because of probably backward compatibility, the newer version of org.lz4 > package still contains net.jpountz.lz4 > ([https://github.com/lz4/lz4-java/tree/master/src/java/net/jpountz/lz4)]. > Therefore this import does not cause problems at a first sight. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34904) Old import of LZ4 package inside CompressionCodec.scala
[ https://issues.apache.org/jira/browse/SPARK-34904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-34904: - Priority: Minor (was: Major) > Old import of LZ4 package inside CompressionCodec.scala > > > Key: SPARK-34904 > URL: https://issues.apache.org/jira/browse/SPARK-34904 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 3.1.1 >Reporter: Michal Zeman >Priority: Minor > > This commit should upgrade the version of the LZ4 package: > [https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f] > > The dependency was changed. However, inside the file [CompressionCodec.scala > |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala] > the old import referencing net.jpountz.lz4 (where versions up to 1.3. are > held) remains. > Because of probably backward compatibility, the newer version of org.lz4 > package still contains net.jpountz.lz4 > ([https://github.com/lz4/lz4-java/tree/master/src/java/net/jpountz/lz4)]. > Therefore this import does not cause problems at a first sight. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34910) JDBC - Add an option for different stride orders
[ https://issues.apache.org/jira/browse/SPARK-34910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312466#comment-17312466 ] Sean R. Owen commented on SPARK-34910: -- Out of curiosity, when is that advantageous? > JDBC - Add an option for different stride orders > > > Key: SPARK-34910 > URL: https://issues.apache.org/jira/browse/SPARK-34910 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Jason Yarbrough >Priority: Trivial > > Currently, the JDBCRelation columnPartition function orders the strides in > ascending order, starting from the lower bound and working its way towards > the upper bound. > I'm proposing leaving that as the default, but adding an option (such as > strideOrder) in JDBCOptions. Since it will default to the current behavior, > this will keep people's current code working as expected. However, people who > may have data skew closer to the upper bound might appreciate being able to > have the strides in descending order, thus filling up the first partition > with the last stride and so forth. Also, people with nondeterministic data > skew or sporadic data density might be able to benefit from a random ordering > of the strides. > I have the code created to implement this, and it creates a pattern that can > be used to add other algorithms that people may want to add (such as counting > the rows and ranking each stride, and then ordering from most dense to > least). The current two options I have coded is 'descending' and 'random.' > The original idea was to create something closer to Spark's hash partitioner, > but for JDBC and pushed down to the database engine for efficiency. However, > that would require adding hashing algorithms for each dialect, and the > performance from those algorithms may outweigh the benefit. The method I'm > proposing in this ticket avoids those complexities while still giving some of > the benefit (in the case of random ordering). > I'll put a PR in if others feel this is a good idea. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34912) Start spark shell to read file and report an error
[ https://issues.apache.org/jira/browse/SPARK-34912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34912. -- Resolution: Invalid Very old Spark + something wrong with your config of AWS libs > Start spark shell to read file and report an error > -- > > Key: SPARK-34912 > URL: https://issues.apache.org/jira/browse/SPARK-34912 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.0 >Reporter: czh >Priority: Major > > Start spark shell to read external file and report an error > > Class org.apache.hadoop .fs.s3a.S3AFileSystem not found > > Spark version 1.6 and Hadoop version 2.6.0 have copied aws-java-sdk-1.7.4.jar > and hadoop-aws-2.6.0.jar to the Lib package under spark -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34913) Start spark shell to read file and report an error
[ https://issues.apache.org/jira/browse/SPARK-34913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34913. -- Resolution: Duplicate > Start spark shell to read file and report an error > -- > > Key: SPARK-34913 > URL: https://issues.apache.org/jira/browse/SPARK-34913 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.6.0 >Reporter: czh >Priority: Major > Attachments: 微信截图_20210331110125.png > > > Start spark shell to read external file and report an error > Class org.apache.hadoop .fs.s3a.S3AFileSystem not found > Spark version 1.6 and Hadoop version 2.6.0 have copied aws-java-sdk-1.7.4.jar > and hadoop-aws-2.6.0.jar to the Lib package under spark -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34921) Error in spark accessing file
[ https://issues.apache.org/jira/browse/SPARK-34921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34921. -- Resolution: Invalid Not nearly enough info > Error in spark accessing file > - > > Key: SPARK-34921 > URL: https://issues.apache.org/jira/browse/SPARK-34921 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 3.0.0 >Reporter: czh >Priority: Major > Attachments: 微信截图_20210331183234.png > > > Error in spark accessing file java.lang.NumberFormatException : For input > string: "100M" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR
[ https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312460#comment-17312460 ] Sean R. Owen commented on SPARK-34510: -- Got it. Without more detail or errors, or a repro against OSS Spark, hard to say or do much here. It does seem like something to do with the structure of the code or its execution, not Spark. > .foreachPartition command hangs when ran inside Python package but works when > ran from Python file outside the package on EMR > - > > Key: SPARK-34510 > URL: https://issues.apache.org/jira/browse/SPARK-34510 > Project: Spark > Issue Type: Bug > Components: EC2, PySpark >Affects Versions: 3.0.0 >Reporter: Yuriy >Priority: Minor > Attachments: Code.zip > > > I'm running on EMR Pyspark 3.0.0. with project structure below, process.py is > what controls the flow of the application and calls code inside the > _file_processor_ package. The command hangs when the .foreachPartition code > that is located inside _s3_repo.py_ is called by _process.py_. When the same > .foreachPartition code is moved from _s3_repo.py_ and placed inside the > _process.py_ it runs just fine. > {code:java} > process.py > file_processor > config > spark.py > repository > s3_repo.py > structure > table_creator.py > {code} > *process.py* > {code:java} > from file_processor.structure import table_creator > from file_processor.repository import s3_repo > def process(): > table_creator.create_table() > s3_repo.save_to_s3() > if __name__ == '__main__': > process() > {code} > *spark.py* > {code:java} > from pyspark.sql import SparkSession > spark_session = SparkSession.builder.appName("Test").getOrCreate() > {code} > *s3_repo.py* > {code:java} > from file_processor.config.spark import spark_session > def save_to_s3(): > spark_session.sql('SELECT * FROM > rawFileData').toJSON().foreachPartition(_save_to_s3) > def _save_to_s3(iterator): > for record in iterator: > print(record) > {code} > *table_creator.py* > {code:java} > from file_processor.config.spark import spark_session > from pyspark.sql import Row > def create_table(): > file_contents = [ > {'line_num': 1, 'contents': 'line 1'}, > {'line_num': 2, 'contents': 'line 2'}, > {'line_num': 3, 'contents': 'line 3'} > ] > spark_session.createDataFrame(Row(**row) for row in > file_contents).cache().createOrReplaceTempView("rawFileData") > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34257) Improve performance for last_value over unbounded window frame
[ https://issues.apache.org/jira/browse/SPARK-34257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34257. -- Resolution: Won't Fix > Improve performance for last_value over unbounded window frame > -- > > Key: SPARK-34257 > URL: https://issues.apache.org/jira/browse/SPARK-34257 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: jiaan.geng >Priority: Major > > The current implement of `last_value` over unbounded window frame will > execute `updateExpressions` multiple times. > In fact, `last_value` only execute `updateExpressions` once on the last row. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34806) Helper class for batch Dataset.observe()
[ https://issues.apache.org/jira/browse/SPARK-34806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312403#comment-17312403 ] Enrico Minack commented on SPARK-34806: --- The interaction with {{Observation}} can be split into three steps: # create an {{Observation}} instance # observe a {{Dataset}} # retrieve the metrics Each of these could be designed in different ways: *1. create an {{Observation}} instance* {code:scala} // without column expressions val observation = Observation("name") val observation = spark.observation("name") // with column expressions val observation = Observation("name", count(lit(1)), sum($"id"), mean($"id")) val observation = spark.observation("name", avg($"id").cast("int").as("avg_val")) {code} *2. observe a {{Dataset}}* {code:scala} val observed = df.observe(observation) val observed = df.observe(observation, count(lit(1)), sum($"id"), mean($"id")) val observed = observation.on(ds) val observed = ds.transform(observation.on) {code} *3. retrieve the metrics* {code:scala} // ways to retrieve the metrics val metrics: Row = observation.get {code} So we have these design decisions: - Are column expressions constant to an observation? -- I would prefer this: increases immutability of the observation instance - Create an {{Observation}} through a session method or by simply instantiating class {{Observation}}? -- I would prefer this: created instance can register right away with the session listeners, not on "2. observe a {{Dataset}}" - Extend the {{Dataset}} API by overloading `observe`? -- I would prefer: {{df.observe(observation)}} as it mimics existing {{df.observe(name, col, cols)}}, we could still have {{Observation.on(Dataset)}} and users can pick their favourite - Move the logic that rejects stream datasets from `Dataset.observe` to {{Observation.on}}. As it is not the {{Dataset}} that rejects this construct, it is the {{Observation}}. - Make {{Observation}} thread-safe? -- I'd argue, the user of {{Observation}} has to make it thread safe by calling the actions on the observed {{Dataset}} in the same thread as calling {{Observation.get}}, or synchronizing the threads that do these bits. I would not expect it a general use case to have these two in different threads. - Make the observation name optional -- What is the use to give it a name? When optional, it could generate a random UUID name internally. > Helper class for batch Dataset.observe() > > > Key: SPARK-34806 > URL: https://issues.apache.org/jira/browse/SPARK-34806 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Enrico Minack >Priority: Minor > > The {{observe}} method has been added to the {{Dataset}} API in 3.0.0. It > allows to collect aggregate metrics over data of a Dataset while they are > being processed during an action. > These metrics are collected in a separate thread after registering > {{QueryExecutionListener}} for batch datasets and {{StreamingQueryListener}} > for stream datasets, respectively. While in streaming context it makes > perfectly sense to process incremental metrics in an event-based fashion, for > simple batch datatset processing, a single result should be retrievable > without the need to register listeners or handling threading. > Introducing an {{Observation}} helper class can hide that complexity for > simple use-cases in batch processing. > Similar to {{AccumulatorV2}} provided by {{SparkContext}} (e.g. > {{SparkContext.LongAccumulator}}), the {{SparkSession}} can provide a method > to create a new {{Observation}} instance and register it with the session. > Alternatively, an {{Observation}} instance could be instantiated on its own > which on calling {{Observation.on(Dataset)}} registers with > {{Dataset.sparkSession}}. This "registration" registers a listener with the > session that retrieves the metrics. > The {{Observation}} class provides methods to retrieve the metrics. This > retrieval has to wait for the listener to be called in a separate thread. So > all methods will wait for this, optionally with a timeout: > - {{Observation.get}} waits without timeout and returns the metric. > - {{Observation.option(time, unit)}} waits at most {{time}}, returns the > metric as an {{Option}}, or {{None}} when the timeout occurs. > - {{Observation.waitCompleted(time, unit)}} waits for the metrics and > indicates timeout by returning {{false}}. > Obviously, an action has to be called on the observed dataset before any of > these methods are called, otherwise a timeout will occur. > With {{Observation.reset}}, another action can be observed. Finally, > {{Observation.close}} unregisters the listener from the session. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (SPARK-34821) Set up a workflow for developers to run benchmark in their fork
[ https://issues.apache.org/jira/browse/SPARK-34821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34821: Assignee: (was: Apache Spark) > Set up a workflow for developers to run benchmark in their fork > --- > > Key: SPARK-34821 > URL: https://issues.apache.org/jira/browse/SPARK-34821 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > Please refer the discussion in https://github.com/apache/spark/pull/31917. > Currently we don't have a pined environment to run the benchmark, and it's > difficult to track and follow the performance differences. > It would be great if we have an easy way to get the benchmark results like > "Running tests in your forked repository using GitHub Actions" in > https://spark.apache.org/developer-tools.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34821) Set up a workflow for developers to run benchmark in their fork
[ https://issues.apache.org/jira/browse/SPARK-34821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312374#comment-17312374 ] Apache Spark commented on SPARK-34821: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/32015 > Set up a workflow for developers to run benchmark in their fork > --- > > Key: SPARK-34821 > URL: https://issues.apache.org/jira/browse/SPARK-34821 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > Please refer the discussion in https://github.com/apache/spark/pull/31917. > Currently we don't have a pined environment to run the benchmark, and it's > difficult to track and follow the performance differences. > It would be great if we have an easy way to get the benchmark results like > "Running tests in your forked repository using GitHub Actions" in > https://spark.apache.org/developer-tools.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34821) Set up a workflow for developers to run benchmark in their fork
[ https://issues.apache.org/jira/browse/SPARK-34821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34821: Assignee: Apache Spark > Set up a workflow for developers to run benchmark in their fork > --- > > Key: SPARK-34821 > URL: https://issues.apache.org/jira/browse/SPARK-34821 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Please refer the discussion in https://github.com/apache/spark/pull/31917. > Currently we don't have a pined environment to run the benchmark, and it's > difficult to track and follow the performance differences. > It would be great if we have an easy way to get the benchmark results like > "Running tests in your forked repository using GitHub Actions" in > https://spark.apache.org/developer-tools.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34881) New SQL Function: TRY_CAST
[ https://issues.apache.org/jira/browse/SPARK-34881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-34881. Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31982 [https://github.com/apache/spark/pull/31982] > New SQL Function: TRY_CAST > -- > > Key: SPARK-34881 > URL: https://issues.apache.org/jira/browse/SPARK-34881 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.2.0 > > > Add a new SQL function try_cast. try_cast is identical to CAST with > `spark.sql.ansi.enabled` as true, except it returns NULL instead of raising > an error. This expression has one major difference from `cast` with > `spark.sql.ansi.enabled` as true: when the source value can't be stored in > the target integral(Byte/Short/Int/Long) type, `try_cast` returns null > instead of returning the low order bytes of the source value. > This is learned from Google BigQuery and Snowflake: > https://docs.snowflake.com/en/sql-reference/functions/try_cast.html > https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#safe_casting -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34922) Use better CBO cost function
[ https://issues.apache.org/jira/browse/SPARK-34922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34922: Assignee: Apache Spark > Use better CBO cost function > > > Key: SPARK-34922 > URL: https://issues.apache.org/jira/browse/SPARK-34922 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Assignee: Apache Spark >Priority: Major > > In SPARK-33935 we changed the CBO cost function such that it would be > symetric - A.betterThan(B) implies that !B.betterThan(A). Before both could > have been true. > That change introduced a performance regression in some queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34922) Use better CBO cost function
[ https://issues.apache.org/jira/browse/SPARK-34922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312316#comment-17312316 ] Apache Spark commented on SPARK-34922: -- User 'tanelk' has created a pull request for this issue: https://github.com/apache/spark/pull/32014 > Use better CBO cost function > > > Key: SPARK-34922 > URL: https://issues.apache.org/jira/browse/SPARK-34922 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Major > > In SPARK-33935 we changed the CBO cost function such that it would be > symetric - A.betterThan(B) implies that !B.betterThan(A). Before both could > have been true. > That change introduced a performance regression in some queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34922) Use better CBO cost function
[ https://issues.apache.org/jira/browse/SPARK-34922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34922: Assignee: (was: Apache Spark) > Use better CBO cost function > > > Key: SPARK-34922 > URL: https://issues.apache.org/jira/browse/SPARK-34922 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Tanel Kiis >Priority: Major > > In SPARK-33935 we changed the CBO cost function such that it would be > symetric - A.betterThan(B) implies that !B.betterThan(A). Before both could > have been true. > That change introduced a performance regression in some queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34922) Use better CBO cost function
Tanel Kiis created SPARK-34922: -- Summary: Use better CBO cost function Key: SPARK-34922 URL: https://issues.apache.org/jira/browse/SPARK-34922 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Tanel Kiis In SPARK-33935 we changed the CBO cost function such that it would be symetric - A.betterThan(B) implies that !B.betterThan(A). Before both could have been true. That change introduced a performance regression in some queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34915) Cache Maven, SBT and Scala in all jobs that use them
[ https://issues.apache.org/jira/browse/SPARK-34915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34915: - Fix Version/s: 3.0.3 3.1.2 > Cache Maven, SBT and Scala in all jobs that use them > > > Key: SPARK-34915 > URL: https://issues.apache.org/jira/browse/SPARK-34915 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.2, 3.2.0, 3.1.1 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.2.0, 3.1.2, 3.0.3 > > > We should cache SBT, Maven and Scala for all jobs that use them. This is > currently missing in some jobs such as > https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L411-L430 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34921) Error in spark accessing file
[ https://issues.apache.org/jira/browse/SPARK-34921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] czh updated SPARK-34921: Attachment: 微信截图_20210331183234.png > Error in spark accessing file > - > > Key: SPARK-34921 > URL: https://issues.apache.org/jira/browse/SPARK-34921 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 3.0.0 >Reporter: czh >Priority: Major > Attachments: 微信截图_20210331183234.png > > > Error in spark accessing file java.lang.NumberFormatException : For input > string: "100M" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34921) Error in spark accessing file
czh created SPARK-34921: --- Summary: Error in spark accessing file Key: SPARK-34921 URL: https://issues.apache.org/jira/browse/SPARK-34921 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 3.0.0 Reporter: czh Attachments: 微信截图_20210331183234.png Error in spark accessing file java.lang.NumberFormatException : For input string: "100M" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34920) Introduce SQLSTATE and ERRORCODE to SQL Exception
[ https://issues.apache.org/jira/browse/SPARK-34920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312258#comment-17312258 ] Apache Spark commented on SPARK-34920: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/32013 > Introduce SQLSTATE and ERRORCODE to SQL Exception > - > > Key: SPARK-34920 > URL: https://issues.apache.org/jira/browse/SPARK-34920 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > SQLSTATE is SQL standard state. Please see -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34920) Introduce SQLSTATE and ERRORCODE to SQL Exception
[ https://issues.apache.org/jira/browse/SPARK-34920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34920: Assignee: Apache Spark > Introduce SQLSTATE and ERRORCODE to SQL Exception > - > > Key: SPARK-34920 > URL: https://issues.apache.org/jira/browse/SPARK-34920 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > SQLSTATE is SQL standard state. Please see -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34920) Introduce SQLSTATE and ERRORCODE to SQL Exception
[ https://issues.apache.org/jira/browse/SPARK-34920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34920: Assignee: (was: Apache Spark) > Introduce SQLSTATE and ERRORCODE to SQL Exception > - > > Key: SPARK-34920 > URL: https://issues.apache.org/jira/browse/SPARK-34920 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > SQLSTATE is SQL standard state. Please see -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34920) Introduce SQLSTATE and ERRORCODE to SQL Exception
[ https://issues.apache.org/jira/browse/SPARK-34920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-34920: Description: SQLSTATE is SQL standard state. Please see was: SQLSTATE is SQL standard state > Introduce SQLSTATE and ERRORCODE to SQL Exception > - > > Key: SPARK-34920 > URL: https://issues.apache.org/jira/browse/SPARK-34920 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Major > > SQLSTATE is SQL standard state. Please see -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34920) Introduce SQLSTATE and ERRORCODE to SQL Exception
Yuming Wang created SPARK-34920: --- Summary: Introduce SQLSTATE and ERRORCODE to SQL Exception Key: SPARK-34920 URL: https://issues.apache.org/jira/browse/SPARK-34920 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.2.0 Reporter: Yuming Wang SQLSTATE is SQL standard state -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34915) Cache Maven, SBT and Scala in all jobs that use them
[ https://issues.apache.org/jira/browse/SPARK-34915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-34915. Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32011 [https://github.com/apache/spark/pull/32011] > Cache Maven, SBT and Scala in all jobs that use them > > > Key: SPARK-34915 > URL: https://issues.apache.org/jira/browse/SPARK-34915 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.2, 3.2.0, 3.1.1 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.2.0 > > > We should cache SBT, Maven and Scala for all jobs that use them. This is > currently missing in some jobs such as > https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L411-L430 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34915) Cache Maven, SBT and Scala in all jobs that use them
[ https://issues.apache.org/jira/browse/SPARK-34915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-34915: -- Assignee: Hyukjin Kwon > Cache Maven, SBT and Scala in all jobs that use them > > > Key: SPARK-34915 > URL: https://issues.apache.org/jira/browse/SPARK-34915 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.2, 3.2.0, 3.1.1 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > We should cache SBT, Maven and Scala for all jobs that use them. This is > currently missing in some jobs such as > https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L411-L430 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34919) Change partitioning to SinglePartition if partition number is 1
[ https://issues.apache.org/jira/browse/SPARK-34919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312249#comment-17312249 ] Apache Spark commented on SPARK-34919: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/32012 > Change partitioning to SinglePartition if partition number is 1 > --- > > Key: SPARK-34919 > URL: https://issues.apache.org/jira/browse/SPARK-34919 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Minor > > For node `Repartition` and `RepartitionByExpression`, if partition number is > 1 we can use `SinglePartition` instead of other `Partitioning`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34919) Change partitioning to SinglePartition if partition number is 1
[ https://issues.apache.org/jira/browse/SPARK-34919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312250#comment-17312250 ] Apache Spark commented on SPARK-34919: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/32012 > Change partitioning to SinglePartition if partition number is 1 > --- > > Key: SPARK-34919 > URL: https://issues.apache.org/jira/browse/SPARK-34919 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Minor > > For node `Repartition` and `RepartitionByExpression`, if partition number is > 1 we can use `SinglePartition` instead of other `Partitioning`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34919) Change partitioning to SinglePartition if partition number is 1
[ https://issues.apache.org/jira/browse/SPARK-34919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34919: Assignee: (was: Apache Spark) > Change partitioning to SinglePartition if partition number is 1 > --- > > Key: SPARK-34919 > URL: https://issues.apache.org/jira/browse/SPARK-34919 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Priority: Minor > > For node `Repartition` and `RepartitionByExpression`, if partition number is > 1 we can use `SinglePartition` instead of other `Partitioning`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34919) Change partitioning to SinglePartition if partition number is 1
[ https://issues.apache.org/jira/browse/SPARK-34919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34919: Assignee: Apache Spark > Change partitioning to SinglePartition if partition number is 1 > --- > > Key: SPARK-34919 > URL: https://issues.apache.org/jira/browse/SPARK-34919 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: Apache Spark >Priority: Minor > > For node `Repartition` and `RepartitionByExpression`, if partition number is > 1 we can use `SinglePartition` instead of other `Partitioning`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34919) Change partitioning to SinglePartition if partition number is 1
ulysses you created SPARK-34919: --- Summary: Change partitioning to SinglePartition if partition number is 1 Key: SPARK-34919 URL: https://issues.apache.org/jira/browse/SPARK-34919 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: ulysses you For node `Repartition` and `RepartitionByExpression`, if partition number is 1 we can use `SinglePartition` instead of other `Partitioning`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34771) Support UDT for Pandas/Spark convertion with Arrow support Enabled
[ https://issues.apache.org/jira/browse/SPARK-34771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darcy Shen updated SPARK-34771: --- Summary: Support UDT for Pandas/Spark convertion with Arrow support Enabled (was: Support UDT for Pandas/Spark convertion with Arrow Enabled) > Support UDT for Pandas/Spark convertion with Arrow support Enabled > -- > > Key: SPARK-34771 > URL: https://issues.apache.org/jira/browse/SPARK-34771 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.2, 3.1.1 >Reporter: Darcy Shen >Priority: Major > > {code:python} > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > from pyspark.testing.sqlutils import ExamplePoint > import pandas as pd > pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, > 2)])}) > df = spark.createDataFrame(pdf) > df.toPandas() > {code} > with `spark.sql.execution.arrow.enabled` = false, the above snippet works > fine without WARNINGS. > with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine > with WARNINGS. Because of Unsupported type in conversion, the Arrow > optimization is actually turned off. > Detailed steps to reproduce: > {code:python} > $ bin/pyspark > Python 3.8.8 (default, Feb 24 2021, 13:46:16) > [Clang 10.0.0 ] :: Anaconda, Inc. on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 3.2.0-SNAPSHOT > /_/ > Using Python version 3.8.8 (default, Feb 24 2021 13:46:16) > Spark context Web UI available at http://172.30.0.226:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1615994008526). > SparkSession available as 'spark'. > >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") > 21/03/17 23:13:31 WARN SQLConf: The SQL config > 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may > be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' > instead of it. > >>> from pyspark.testing.sqlutils import ExamplePoint > >>> import pandas as pd > >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), > >>> ExamplePoint(2, 2)])}) > >>> df = spark.createDataFrame(pdf) > /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: > UserWarning: createDataFrame attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed > by the reason below: > Could not convert (1,1) with type ExamplePoint: did not recognize Python > value type when inferring an Arrow data type > Attempting non-optimization as > 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. > warnings.warn(msg) > >>> > >>> df.show() > +--+ > | point| > +--+ > |(0.0, 0.0)| > |(0.0, 0.0)| > +--+ > >>> df.schema > StructType(List(StructField(point,ExamplePointUDT,true))) > >>> df.toPandas() > /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed > by the reason below: > Unsupported type in conversion to Arrow: ExamplePointUDT > Attempting non-optimization as > 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. > warnings.warn(msg) >point > 0 (0.0,0.0) > 1 (0.0,0.0) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34771) Support UDT for Pandas/Spark conversion with Arrow support Enabled
[ https://issues.apache.org/jira/browse/SPARK-34771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darcy Shen updated SPARK-34771: --- Summary: Support UDT for Pandas/Spark conversion with Arrow support Enabled (was: Support UDT for Pandas/Spark convertion with Arrow support Enabled) > Support UDT for Pandas/Spark conversion with Arrow support Enabled > -- > > Key: SPARK-34771 > URL: https://issues.apache.org/jira/browse/SPARK-34771 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.2, 3.1.1 >Reporter: Darcy Shen >Priority: Major > > {code:python} > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > from pyspark.testing.sqlutils import ExamplePoint > import pandas as pd > pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, > 2)])}) > df = spark.createDataFrame(pdf) > df.toPandas() > {code} > with `spark.sql.execution.arrow.enabled` = false, the above snippet works > fine without WARNINGS. > with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine > with WARNINGS. Because of Unsupported type in conversion, the Arrow > optimization is actually turned off. > Detailed steps to reproduce: > {code:python} > $ bin/pyspark > Python 3.8.8 (default, Feb 24 2021, 13:46:16) > [Clang 10.0.0 ] :: Anaconda, Inc. on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 3.2.0-SNAPSHOT > /_/ > Using Python version 3.8.8 (default, Feb 24 2021 13:46:16) > Spark context Web UI available at http://172.30.0.226:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1615994008526). > SparkSession available as 'spark'. > >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") > 21/03/17 23:13:31 WARN SQLConf: The SQL config > 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may > be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' > instead of it. > >>> from pyspark.testing.sqlutils import ExamplePoint > >>> import pandas as pd > >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), > >>> ExamplePoint(2, 2)])}) > >>> df = spark.createDataFrame(pdf) > /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: > UserWarning: createDataFrame attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed > by the reason below: > Could not convert (1,1) with type ExamplePoint: did not recognize Python > value type when inferring an Arrow data type > Attempting non-optimization as > 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. > warnings.warn(msg) > >>> > >>> df.show() > +--+ > | point| > +--+ > |(0.0, 0.0)| > |(0.0, 0.0)| > +--+ > >>> df.schema > StructType(List(StructField(point,ExamplePointUDT,true))) > >>> df.toPandas() > /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed > by the reason below: > Unsupported type in conversion to Arrow: ExamplePointUDT > Attempting non-optimization as > 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. > warnings.warn(msg) >point > 0 (0.0,0.0) > 1 (0.0,0.0) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34771) Support UDT for Pandas/Spark convertion with Arrow Enabled
[ https://issues.apache.org/jira/browse/SPARK-34771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darcy Shen updated SPARK-34771: --- Summary: Support UDT for Pandas/Spark convertion with Arrow Enabled (was: Support UDT for Pandas/Spark convertion with Arrow Optimization) > Support UDT for Pandas/Spark convertion with Arrow Enabled > -- > > Key: SPARK-34771 > URL: https://issues.apache.org/jira/browse/SPARK-34771 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.2, 3.1.1 >Reporter: Darcy Shen >Priority: Major > > {code:python} > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > from pyspark.testing.sqlutils import ExamplePoint > import pandas as pd > pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, > 2)])}) > df = spark.createDataFrame(pdf) > df.toPandas() > {code} > with `spark.sql.execution.arrow.enabled` = false, the above snippet works > fine without WARNINGS. > with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine > with WARNINGS. Because of Unsupported type in conversion, the Arrow > optimization is actually turned off. > Detailed steps to reproduce: > {code:python} > $ bin/pyspark > Python 3.8.8 (default, Feb 24 2021, 13:46:16) > [Clang 10.0.0 ] :: Anaconda, Inc. on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 3.2.0-SNAPSHOT > /_/ > Using Python version 3.8.8 (default, Feb 24 2021 13:46:16) > Spark context Web UI available at http://172.30.0.226:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1615994008526). > SparkSession available as 'spark'. > >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") > 21/03/17 23:13:31 WARN SQLConf: The SQL config > 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may > be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' > instead of it. > >>> from pyspark.testing.sqlutils import ExamplePoint > >>> import pandas as pd > >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), > >>> ExamplePoint(2, 2)])}) > >>> df = spark.createDataFrame(pdf) > /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: > UserWarning: createDataFrame attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed > by the reason below: > Could not convert (1,1) with type ExamplePoint: did not recognize Python > value type when inferring an Arrow data type > Attempting non-optimization as > 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. > warnings.warn(msg) > >>> > >>> df.show() > +--+ > | point| > +--+ > |(0.0, 0.0)| > |(0.0, 0.0)| > +--+ > >>> df.schema > StructType(List(StructField(point,ExamplePointUDT,true))) > >>> df.toPandas() > /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed > by the reason below: > Unsupported type in conversion to Arrow: ExamplePointUDT > Attempting non-optimization as > 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. > warnings.warn(msg) >point > 0 (0.0,0.0) > 1 (0.0,0.0) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34771) Support UDT for Pandas/Spark convertion with Arrow Optimization
[ https://issues.apache.org/jira/browse/SPARK-34771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darcy Shen updated SPARK-34771: --- Summary: Support UDT for Pandas/Spark convertion with Arrow Optimization (was: Support UDT for Pandas with Arrow Optimization) > Support UDT for Pandas/Spark convertion with Arrow Optimization > --- > > Key: SPARK-34771 > URL: https://issues.apache.org/jira/browse/SPARK-34771 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.2, 3.1.1 >Reporter: Darcy Shen >Priority: Major > > {code:python} > spark.conf.set("spark.sql.execution.arrow.enabled", "true") > from pyspark.testing.sqlutils import ExamplePoint > import pandas as pd > pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, > 2)])}) > df = spark.createDataFrame(pdf) > df.toPandas() > {code} > with `spark.sql.execution.arrow.enabled` = false, the above snippet works > fine without WARNINGS. > with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine > with WARNINGS. Because of Unsupported type in conversion, the Arrow > optimization is actually turned off. > Detailed steps to reproduce: > {code:python} > $ bin/pyspark > Python 3.8.8 (default, Feb 24 2021, 13:46:16) > [Clang 10.0.0 ] :: Anaconda, Inc. on darwin > Type "help", "copyright", "credits" or "license" for more information. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 3.2.0-SNAPSHOT > /_/ > Using Python version 3.8.8 (default, Feb 24 2021 13:46:16) > Spark context Web UI available at http://172.30.0.226:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1615994008526). > SparkSession available as 'spark'. > >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") > 21/03/17 23:13:31 WARN SQLConf: The SQL config > 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may > be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' > instead of it. > >>> from pyspark.testing.sqlutils import ExamplePoint > >>> import pandas as pd > >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), > >>> ExamplePoint(2, 2)])}) > >>> df = spark.createDataFrame(pdf) > /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: > UserWarning: createDataFrame attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed > by the reason below: > Could not convert (1,1) with type ExamplePoint: did not recognize Python > value type when inferring an Arrow data type > Attempting non-optimization as > 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. > warnings.warn(msg) > >>> > >>> df.show() > +--+ > | point| > +--+ > |(0.0, 0.0)| > |(0.0, 0.0)| > +--+ > >>> df.schema > StructType(List(StructField(point,ExamplePointUDT,true))) > >>> df.toPandas() > /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: > UserWarning: toPandas attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed > by the reason below: > Unsupported type in conversion to Arrow: ExamplePointUDT > Attempting non-optimization as > 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. > warnings.warn(msg) >point > 0 (0.0,0.0) > 1 (0.0,0.0) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34911) Fix code close issue in monitoring.md
[ https://issues.apache.org/jira/browse/SPARK-34911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-34911: Assignee: angerszhu > Fix code close issue in monitoring.md > - > > Key: SPARK-34911 > URL: https://issues.apache.org/jira/browse/SPARK-34911 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Trivial > > Fix code close issue in monitoring.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34911) Fix code close issue in monitoring.md
[ https://issues.apache.org/jira/browse/SPARK-34911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34911. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32008 [https://github.com/apache/spark/pull/32008] > Fix code close issue in monitoring.md > - > > Key: SPARK-34911 > URL: https://issues.apache.org/jira/browse/SPARK-34911 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Trivial > Fix For: 3.2.0 > > > Fix code close issue in monitoring.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org