[jira] [Commented] (SPARK-34881) New SQL Function: TRY_CAST

2021-03-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312853#comment-17312853
 ] 

Apache Spark commented on SPARK-34881:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32019

> New SQL Function: TRY_CAST
> --
>
> Key: SPARK-34881
> URL: https://issues.apache.org/jira/browse/SPARK-34881
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Add a new SQL function try_cast. try_cast is identical to CAST with 
> `spark.sql.ansi.enabled` as true, except it returns NULL instead of raising 
> an error. This expression has one major difference from `cast` with 
> `spark.sql.ansi.enabled` as true: when the source value can't be stored in 
> the target integral(Byte/Short/Int/Long) type, `try_cast` returns null 
> instead of returning the low order bytes of the source value.
> This is learned from Google BigQuery and Snowflake:
> https://docs.snowflake.com/en/sql-reference/functions/try_cast.html
> https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#safe_casting



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34881) New SQL Function: TRY_CAST

2021-03-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312852#comment-17312852
 ] 

Apache Spark commented on SPARK-34881:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32019

> New SQL Function: TRY_CAST
> --
>
> Key: SPARK-34881
> URL: https://issues.apache.org/jira/browse/SPARK-34881
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Add a new SQL function try_cast. try_cast is identical to CAST with 
> `spark.sql.ansi.enabled` as true, except it returns NULL instead of raising 
> an error. This expression has one major difference from `cast` with 
> `spark.sql.ansi.enabled` as true: when the source value can't be stored in 
> the target integral(Byte/Short/Int/Long) type, `try_cast` returns null 
> instead of returning the low order bytes of the source value.
> This is learned from Google BigQuery and Snowflake:
> https://docs.snowflake.com/en/sql-reference/functions/try_cast.html
> https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#safe_casting



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins

2021-03-31 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312851#comment-17312851
 ] 

Attila Zsolt Piros commented on SPARK-34738:


Shane, as Docker is available on every platform I could check the it locally 
(but I can only promise next week to do that) if I manage to reproduce this 
locally I can take look into the details. 

After this I think we should extend our documentation and suggest only those 
drivers where the tests was running (or at least mentioning docker is used for 
testing). 

Thanks for your effort! 

> Upgrade Minikube and kubernetes cluster version on Jenkins
> --
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34930) Install PyArrow and pandas on Jenkins

2021-03-31 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-34930:


 Summary: Install PyArrow and pandas on Jenkins
 Key: SPARK-34930
 URL: https://issues.apache.org/jira/browse/SPARK-34930
 Project: Spark
  Issue Type: Test
  Components: Project Infra
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon
Assignee: Shane Knapp


Looks like Jenkins mahcines don't have pandas and PyArrow (ever since it got 
upgraded?) which result in skipping related tests in PySpark, see also 
https://github.com/apache/spark/pull/31470#issuecomment-811618571

It would be great if we can install both in Python 3.6 on Jenkins.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34354) CostBasedJoinReorder can fail on self-join

2021-03-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34354:


Assignee: (was: Apache Spark)

> CostBasedJoinReorder can fail on self-join
> --
>
> Key: SPARK-34354
> URL: https://issues.apache.org/jira/browse/SPARK-34354
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>
>  
> For example:
> {code:java}
> test("join reorder with self-join") {
>   val plan = t2.join(t1, Inner, Some(nameToAttr("t1.k-1-2") === 
> nameToAttr("t2.k-1-5")))
> .select(nameToAttr("t1.v-1-10"))
> .join(t2, Inner, Some(nameToAttr("t1.v-1-10") === nameToAttr("t2.k-1-5")))
>   // this can fail
>   Optimize.execute(plan.analyze)
> }
> {code}
> error:
> {code:java}
> [info]   java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:208)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.JoinReorderDP$.search(CostBasedJoinReorder.scala:178)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.org$apache$spark$sql$catalyst$optimizer$CostBasedJoinReorder$$reorder(CostBasedJoinReorder.scala:64)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:45)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:41)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:41)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:35)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34354) CostBasedJoinReorder can fail on self-join

2021-03-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34354:


Assignee: Apache Spark

> CostBasedJoinReorder can fail on self-join
> --
>
> Key: SPARK-34354
> URL: https://issues.apache.org/jira/browse/SPARK-34354
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
>  
> For example:
> {code:java}
> test("join reorder with self-join") {
>   val plan = t2.join(t1, Inner, Some(nameToAttr("t1.k-1-2") === 
> nameToAttr("t2.k-1-5")))
> .select(nameToAttr("t1.v-1-10"))
> .join(t2, Inner, Some(nameToAttr("t1.v-1-10") === nameToAttr("t2.k-1-5")))
>   // this can fail
>   Optimize.execute(plan.analyze)
> }
> {code}
> error:
> {code:java}
> [info]   java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:208)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.JoinReorderDP$.search(CostBasedJoinReorder.scala:178)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.org$apache$spark$sql$catalyst$optimizer$CostBasedJoinReorder$$reorder(CostBasedJoinReorder.scala:64)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:45)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:41)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:41)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:35)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-34354) CostBasedJoinReorder can fail on self-join

2021-03-31 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-34354:
--
  Assignee: (was: wuyi)

Reverted in 
https://github.com/apache/spark/commit/cc451c16a3d28bcdba226fa05f1b786ff2f02612

> CostBasedJoinReorder can fail on self-join
> --
>
> Key: SPARK-34354
> URL: https://issues.apache.org/jira/browse/SPARK-34354
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
> Fix For: 3.2.0
>
>
>  
> For example:
> {code:java}
> test("join reorder with self-join") {
>   val plan = t2.join(t1, Inner, Some(nameToAttr("t1.k-1-2") === 
> nameToAttr("t2.k-1-5")))
> .select(nameToAttr("t1.v-1-10"))
> .join(t2, Inner, Some(nameToAttr("t1.v-1-10") === nameToAttr("t2.k-1-5")))
>   // this can fail
>   Optimize.execute(plan.analyze)
> }
> {code}
> error:
> {code:java}
> [info]   java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:208)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.JoinReorderDP$.search(CostBasedJoinReorder.scala:178)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.org$apache$spark$sql$catalyst$optimizer$CostBasedJoinReorder$$reorder(CostBasedJoinReorder.scala:64)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:45)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:41)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:41)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:35)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34354) CostBasedJoinReorder can fail on self-join

2021-03-31 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34354:
-
Fix Version/s: (was: 3.2.0)

> CostBasedJoinReorder can fail on self-join
> --
>
> Key: SPARK-34354
> URL: https://issues.apache.org/jira/browse/SPARK-34354
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>
>  
> For example:
> {code:java}
> test("join reorder with self-join") {
>   val plan = t2.join(t1, Inner, Some(nameToAttr("t1.k-1-2") === 
> nameToAttr("t2.k-1-5")))
> .select(nameToAttr("t1.v-1-10"))
> .join(t2, Inner, Some(nameToAttr("t1.v-1-10") === nameToAttr("t2.k-1-5")))
>   // this can fail
>   Optimize.execute(plan.analyze)
> }
> {code}
> error:
> {code:java}
> [info]   java.lang.AssertionError: assertion failed
> [info]   at scala.Predef$.assert(Predef.scala:208)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.JoinReorderDP$.search(CostBasedJoinReorder.scala:178)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.org$apache$spark$sql$catalyst$optimizer$CostBasedJoinReorder$$reorder(CostBasedJoinReorder.scala:64)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:45)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$1.applyOrElse(CostBasedJoinReorder.scala:41)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169)
> [info]   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:41)
> [info]   at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$.apply(CostBasedJoinReorder.scala:35)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34929) MapStatusesSerDeserBenchmark causes a side effect to other benchmarks with tasks being too big (JDK 11)

2021-03-31 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34929:
-
Description: 
In JDK 11, MapStatusesSerDeserBenchmark (being started failed) seems affecting 
other benchmark cases with growing the size of task:

{code}
2021-03-31T16:46:43.1179145Z 21/03/31 16:46:43 WARN DAGScheduler: Broadcasting 
large task binary with size 42.2 MiB
2021-03-31T16:46:47.3079315Z 21/03/31 16:46:47 WARN DAGScheduler: Broadcasting 
large task binary with size 42.2 MiB
2021-03-31T16:46:51.5920733Z 21/03/31 16:46:51 WARN DAGScheduler: Broadcasting 
large task binary with size 42.2 MiB
2021-03-31T16:46:55.9175194Z 21/03/31 16:46:55 WARN DAGScheduler: Broadcasting 
large task binary with size 42.2 MiB
2021-03-31T16:46:57.6874541Z   Stopped after 3 iterations, 12928 ms
2021-03-31T16:46:57.6875644Z 
2021-03-31T16:46:57.6877153Z OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 
5.4.0-1041-azure
2021-03-31T16:46:57.7095280Z Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
2021-03-31T16:46:57.7097654Z from_json as subExpr in Project:  Best 
Time(ms)   Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
2021-03-31T16:46:57.7099059Z 

2021-03-31T16:46:57.7100274Z subExprElimination false, codegen: true   
38880  412461389  0.0   388800445.2   1.0X
2021-03-31T16:46:57.7101134Z subExprElimination false, codegen: false  
35819  381411234  0.0   358188088.6   1.1X
2021-03-31T16:46:57.7106264Z subExprElimination true, codegen: true 
3947   4157 364  0.039465629.1   9.9X
2021-03-31T16:46:57.7106982Z subExprElimination true, codegen: false
4191   4309 112  0.041908945.5   9.3X
2021-03-31T16:46:57.7107595Z 
2021-03-31T16:46:57.7135178Z Preparing data for benchmarking ...
2021-03-31T16:46:58.5630584Z Running benchmark: from_json as subExpr in Filter
2021-03-31T16:46:58.5633083Z   Running case: subExprElimination false, codegen: 
true
2021-03-31T16:48:25.5619312Z 21/03/31 16:48:25 WARN DAGScheduler: Broadcasting 
large task binary with size 43.0 MiB
{code}

It only happens when the benchmarks run sequentially via Benchmarks.scala. 

  was:
In JDK 11, MapStatusesSerDeserBenchmark (being started failed) seems affecting 
other benchmark cases with growing the size of task:

```
2021-03-31T16:46:43.1179145Z 21/03/31 16:46:43 WARN DAGScheduler: Broadcasting 
large task binary with size 42.2 MiB
2021-03-31T16:46:47.3079315Z 21/03/31 16:46:47 WARN DAGScheduler: Broadcasting 
large task binary with size 42.2 MiB
2021-03-31T16:46:51.5920733Z 21/03/31 16:46:51 WARN DAGScheduler: Broadcasting 
large task binary with size 42.2 MiB
2021-03-31T16:46:55.9175194Z 21/03/31 16:46:55 WARN DAGScheduler: Broadcasting 
large task binary with size 42.2 MiB
2021-03-31T16:46:57.6874541Z   Stopped after 3 iterations, 12928 ms
2021-03-31T16:46:57.6875644Z 
2021-03-31T16:46:57.6877153Z OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 
5.4.0-1041-azure
2021-03-31T16:46:57.7095280Z Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
2021-03-31T16:46:57.7097654Z from_json as subExpr in Project:  Best 
Time(ms)   Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
2021-03-31T16:46:57.7099059Z 

2021-03-31T16:46:57.7100274Z subExprElimination false, codegen: true   
38880  412461389  0.0   388800445.2   1.0X
2021-03-31T16:46:57.7101134Z subExprElimination false, codegen: false  
35819  381411234  0.0   358188088.6   1.1X
2021-03-31T16:46:57.7106264Z subExprElimination true, codegen: true 
3947   4157 364  0.039465629.1   9.9X
2021-03-31T16:46:57.7106982Z subExprElimination true, codegen: false
4191   4309 112  0.041908945.5   9.3X
2021-03-31T16:46:57.7107595Z 
2021-03-31T16:46:57.7135178Z Preparing data for benchmarking ...
2021-03-31T16:46:58.5630584Z Running benchmark: from_json as subExpr in Filter
2021-03-31T16:46:58.5633083Z   Running case: subExprElimination false, codegen: 
true
2021-03-31T16:48:25.5619312Z 21/03/31 16:48:25 WARN DAGScheduler: Broadcasting 
large task binary with size 43.0 MiB
```

It only happens when the benchmarks run sequentially via Benchmarks.scala. 


> MapStatusesSerDeserBenchmark causes a side effect to other benchmarks with 
> tasks being too big (JDK 11)
> ---
>
> Key: SPARK-34929
> URL: 

[jira] [Created] (SPARK-34929) MapStatusesSerDeserBenchmark causes a side effect to other benchmarks with tasks being too big (JDK 11)

2021-03-31 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-34929:


 Summary: MapStatusesSerDeserBenchmark causes a side effect to 
other benchmarks with tasks being too big (JDK 11)
 Key: SPARK-34929
 URL: https://issues.apache.org/jira/browse/SPARK-34929
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


In JDK 11, MapStatusesSerDeserBenchmark (being started failed) seems affecting 
other benchmark cases with growing the size of task:

```
2021-03-31T16:46:43.1179145Z 21/03/31 16:46:43 WARN DAGScheduler: Broadcasting 
large task binary with size 42.2 MiB
2021-03-31T16:46:47.3079315Z 21/03/31 16:46:47 WARN DAGScheduler: Broadcasting 
large task binary with size 42.2 MiB
2021-03-31T16:46:51.5920733Z 21/03/31 16:46:51 WARN DAGScheduler: Broadcasting 
large task binary with size 42.2 MiB
2021-03-31T16:46:55.9175194Z 21/03/31 16:46:55 WARN DAGScheduler: Broadcasting 
large task binary with size 42.2 MiB
2021-03-31T16:46:57.6874541Z   Stopped after 3 iterations, 12928 ms
2021-03-31T16:46:57.6875644Z 
2021-03-31T16:46:57.6877153Z OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Linux 
5.4.0-1041-azure
2021-03-31T16:46:57.7095280Z Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
2021-03-31T16:46:57.7097654Z from_json as subExpr in Project:  Best 
Time(ms)   Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
2021-03-31T16:46:57.7099059Z 

2021-03-31T16:46:57.7100274Z subExprElimination false, codegen: true   
38880  412461389  0.0   388800445.2   1.0X
2021-03-31T16:46:57.7101134Z subExprElimination false, codegen: false  
35819  381411234  0.0   358188088.6   1.1X
2021-03-31T16:46:57.7106264Z subExprElimination true, codegen: true 
3947   4157 364  0.039465629.1   9.9X
2021-03-31T16:46:57.7106982Z subExprElimination true, codegen: false
4191   4309 112  0.041908945.5   9.3X
2021-03-31T16:46:57.7107595Z 
2021-03-31T16:46:57.7135178Z Preparing data for benchmarking ...
2021-03-31T16:46:58.5630584Z Running benchmark: from_json as subExpr in Filter
2021-03-31T16:46:58.5633083Z   Running case: subExprElimination false, codegen: 
true
2021-03-31T16:48:25.5619312Z 21/03/31 16:48:25 WARN DAGScheduler: Broadcasting 
large task binary with size 43.0 MiB
```

It only happens when the benchmarks run sequentially via Benchmarks.scala. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34928) CTE Execution fails for Sql Server

2021-03-31 Thread Supun De Silva (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Supun De Silva updated SPARK-34928:
---
Description: 
h2. Issue

We have a simple Sql statement that we intend to execute on SQL Server. This 
has a CTE component.

Execution of this yields to an error that looks like follows
{code:java}
java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code}
We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver* (version 1.3.1)

This is a particularly annoying issue and due to this we are having to write 
inner queries that are fair bit inefficient.
h2. SQL statement

(not the actual one but a simplified version with renamed parameters)

 
{code:sql}
WITH OldChanges as (
   SELECT distinct 
SomeDate,
Name
   FROM [dbo].[DateNameFoo] (nolock)
   WHERE SomeDate!= '2021-03-30'
   AND convert(date, UpdateDateTime) = '2021-03-31'

SELECT * from OldChanges {code}

  was:
h2. Issue

We have a simple Sql statement that we intend to execute on SQL Server. This 
has a CTE component.

Execution of this yields to an error that looks like follows
{code:java}
java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code}
We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver*

This is a particularly annoying issue and due to this we are having to write 
inner queries that are fair bit inefficient.
h2. SQL statement

(not the actual one but a simplified version with renamed parameters)

 
{code:sql}
WITH OldChanges as (
   SELECT distinct 
SomeDate,
Name
   FROM [dbo].[DateNameFoo] (nolock)
   WHERE SomeDate!= '2021-03-30'
   AND convert(date, UpdateDateTime) = '2021-03-31'

SELECT * from OldChanges {code}


> CTE Execution fails for Sql Server
> --
>
> Key: SPARK-34928
> URL: https://issues.apache.org/jira/browse/SPARK-34928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Supun De Silva
>Priority: Minor
>
> h2. Issue
> We have a simple Sql statement that we intend to execute on SQL Server. This 
> has a CTE component.
> Execution of this yields to an error that looks like follows
> {code:java}
> java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code}
> We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver* (version 
> 1.3.1)
> This is a particularly annoying issue and due to this we are having to write 
> inner queries that are fair bit inefficient.
> h2. SQL statement
> (not the actual one but a simplified version with renamed parameters)
>  
> {code:sql}
> WITH OldChanges as (
>SELECT distinct 
> SomeDate,
> Name
>FROM [dbo].[DateNameFoo] (nolock)
>WHERE SomeDate!= '2021-03-30'
>AND convert(date, UpdateDateTime) = '2021-03-31'
> SELECT * from OldChanges {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34928) CTE Execution fails for Sql Server

2021-03-31 Thread Supun De Silva (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Supun De Silva updated SPARK-34928:
---
Priority: Minor  (was: Major)

> CTE Execution fails for Sql Server
> --
>
> Key: SPARK-34928
> URL: https://issues.apache.org/jira/browse/SPARK-34928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Supun De Silva
>Priority: Minor
>
> h2. Issue
> We have a simple Sql statement that we intend to execute on SQL Server. This 
> has a CTE component.
> Execution of this yields to an error that looks like follows
> {code:java}
> java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code}
> We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver*
> This is a particularly annoying issue and due to this we are having to write 
> inner queries that are fair bit inefficient.
> h2. SQL statement
> (not the actual one but a simplified version with renamed parameters)
>  
> {code:sql}
> WITH OldChanges as (
>SELECT distinct 
> SomeDate,
> Name
>FROM [dbo].[DateNameFoo] (nolock)
>WHERE SomeDate!= '2021-03-30'
>AND convert(date, UpdateDateTime) = '2021-03-31'
> SELECT * from OldChanges {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34928) CTE Execution fails for Sql Server

2021-03-31 Thread Supun De Silva (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Supun De Silva updated SPARK-34928:
---
Description: 
h2. Issue

We have a simple Sql statement that we intend to execute on SQL Server. This 
has a CTE component.

Execution of this yields to an error that looks like follows
{code:java}
java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code}
We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver*

This is a particularly annoying issue and due to this we are having to write 
inner queries that are fair bit inefficient.
h2. SQL statement

(not the actual one but a simplified version with renamed parameters)

 
{code:sql}
WITH OldChanges as (
   SELECT distinct 
SomeDate,
Name
   FROM [dbo].[DateNameFoo] (nolock)
   WHERE SomeDate!= '2021-03-30'
   AND convert(date, UpdateDateTime) = '2021-03-31'

SELECT * from OldChanges {code}

  was:
h2. Issue

We have a simple Sql statement that we intend to execute on SQL Server. This 
has a CTE component.

Execution of this yields to an error that looks like follows
{code:java}
java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code}
We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver*

This is a particularly annoying issue and due to this we are having to write 
inner queries that are fair bit inefficient.
h2. SQL statement

(not the actual one but simplified renamed parameters)

 
{code:sql}
WITH OldChanges as (
   SELECT distinct 
SomeDate,
Name
   FROM [dbo].[DateNameFoo] (nolock)
   WHERE SomeDate!= '2021-03-30'
   AND convert(date, UpdateDateTime) = '2021-03-31'

SELECT * from OldChanges {code}


> CTE Execution fails for Sql Server
> --
>
> Key: SPARK-34928
> URL: https://issues.apache.org/jira/browse/SPARK-34928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Supun De Silva
>Priority: Major
>
> h2. Issue
> We have a simple Sql statement that we intend to execute on SQL Server. This 
> has a CTE component.
> Execution of this yields to an error that looks like follows
> {code:java}
> java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code}
> We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver*
> This is a particularly annoying issue and due to this we are having to write 
> inner queries that are fair bit inefficient.
> h2. SQL statement
> (not the actual one but a simplified version with renamed parameters)
>  
> {code:sql}
> WITH OldChanges as (
>SELECT distinct 
> SomeDate,
> Name
>FROM [dbo].[DateNameFoo] (nolock)
>WHERE SomeDate!= '2021-03-30'
>AND convert(date, UpdateDateTime) = '2021-03-31'
> SELECT * from OldChanges {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34928) CTE Execution fails for Sql Server

2021-03-31 Thread Supun De Silva (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Supun De Silva updated SPARK-34928:
---
Description: 
h2. Issue

We have a simple Sql statement that we intend to execute on SQL Server. This 
has a CTE component.

Execution of this yields to an error that looks like follows
{code:java}
java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code}
We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver*

This is a particularly annoying issue and due to this we are having to write 
inner queries that are fair bit inefficient.
h2. SQL statement

(not the actual one but simplified renamed parameters)

 
{code:sql}
WITH OldChanges as (
   SELECT distinct 
SomeDate,
Name
   FROM [dbo].[DateNameFoo] (nolock)
   WHERE SomeDate!= '2021-03-30'
   AND convert(date, UpdateDateTime) = '2021-03-31'

SELECT * from OldChanges {code}

  was:
h2. Issue

We have a simple Sql statement that we intend to execute on SQL Server. This 
has a CTE component.

Execution of this yields to an error that looks like follows
{code:java}
java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code}
We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver*

This is a particularly annoying issue and due to this we are having to write 
inner queries that are fair bit inefficient.
h2. SQL statement

(not the actual one but simplified renamed parameters)

 
{code:sql}
WITH OldChanges as (
   SELECT distinct 
Date,
Name
   FROM [dbo].[DateNameFoo] (nolock)
   WHERE Date != '2021-03-30'
   AND convert(date, UpdateDateTime) = '2021-03-31'

SELECT * from OldChanges {code}


> CTE Execution fails for Sql Server
> --
>
> Key: SPARK-34928
> URL: https://issues.apache.org/jira/browse/SPARK-34928
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Supun De Silva
>Priority: Major
>
> h2. Issue
> We have a simple Sql statement that we intend to execute on SQL Server. This 
> has a CTE component.
> Execution of this yields to an error that looks like follows
> {code:java}
> java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code}
> We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver*
> This is a particularly annoying issue and due to this we are having to write 
> inner queries that are fair bit inefficient.
> h2. SQL statement
> (not the actual one but simplified renamed parameters)
>  
> {code:sql}
> WITH OldChanges as (
>SELECT distinct 
> SomeDate,
> Name
>FROM [dbo].[DateNameFoo] (nolock)
>WHERE SomeDate!= '2021-03-30'
>AND convert(date, UpdateDateTime) = '2021-03-31'
> SELECT * from OldChanges {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34928) CTE Execution fails for Sql Server

2021-03-31 Thread Supun De Silva (Jira)
Supun De Silva created SPARK-34928:
--

 Summary: CTE Execution fails for Sql Server
 Key: SPARK-34928
 URL: https://issues.apache.org/jira/browse/SPARK-34928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: Supun De Silva


h2. Issue

We have a simple Sql statement that we intend to execute on SQL Server. This 
has a CTE component.

Execution of this yields to an error that looks like follows
{code:java}
java.sql.SQLException: Incorrect syntax near the keyword 'WITH'.{code}
We are using the jdbc driver *net.sourceforge.jtds.jdbc.Driver*

This is a particularly annoying issue and due to this we are having to write 
inner queries that are fair bit inefficient.
h2. SQL statement

(not the actual one but simplified renamed parameters)

 
{code:sql}
WITH OldChanges as (
   SELECT distinct 
Date,
Name
   FROM [dbo].[DateNameFoo] (nolock)
   WHERE Date != '2021-03-30'
   AND convert(date, UpdateDateTime) = '2021-03-31'

SELECT * from OldChanges {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34927) Support TPCDSQueryBenchmark in Benchmarks

2021-03-31 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-34927:


 Summary: Support TPCDSQueryBenchmark in Benchmarks
 Key: SPARK-34927
 URL: https://issues.apache.org/jira/browse/SPARK-34927
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


Benchmarks.scala currently does not support TPCDSQueryBenchmark. We should make 
it supported. See also 
https://github.com/apache/spark/pull/32015#issuecomment-89046



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34927) Support TPCDSQueryBenchmark in Benchmarks

2021-03-31 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34927:
-
Priority: Minor  (was: Major)

> Support TPCDSQueryBenchmark in Benchmarks
> -
>
> Key: SPARK-34927
> URL: https://issues.apache.org/jira/browse/SPARK-34927
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Benchmarks.scala currently does not support TPCDSQueryBenchmark. We should 
> make it supported. See also 
> https://github.com/apache/spark/pull/32015#issuecomment-89046



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34926) PartitionUtils.getPathFragment should handle null value

2021-03-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312816#comment-17312816
 ] 

Apache Spark commented on SPARK-34926:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32018

> PartitionUtils.getPathFragment should handle null value
> ---
>
> Key: SPARK-34926
> URL: https://issues.apache.org/jira/browse/SPARK-34926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> We support partition value as null, 
> PartitionUtils.getPathFragment  should support this too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34926) PartitionUtils.getPathFragment should handle null value

2021-03-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34926:


Assignee: Apache Spark

> PartitionUtils.getPathFragment should handle null value
> ---
>
> Key: SPARK-34926
> URL: https://issues.apache.org/jira/browse/SPARK-34926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> We support partition value as null, 
> PartitionUtils.getPathFragment  should support this too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34926) PartitionUtils.getPathFragment should handle null value

2021-03-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34926:


Assignee: (was: Apache Spark)

> PartitionUtils.getPathFragment should handle null value
> ---
>
> Key: SPARK-34926
> URL: https://issues.apache.org/jira/browse/SPARK-34926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> We support partition value as null, 
> PartitionUtils.getPathFragment  should support this too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34926) PartitionUtils.getPathFragment should handle null value

2021-03-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312815#comment-17312815
 ] 

Apache Spark commented on SPARK-34926:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32018

> PartitionUtils.getPathFragment should handle null value
> ---
>
> Key: SPARK-34926
> URL: https://issues.apache.org/jira/browse/SPARK-34926
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> We support partition value as null, 
> PartitionUtils.getPathFragment  should support this too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34926) PartitionUtils.getPathFragment should handle null value

2021-03-31 Thread angerszhu (Jira)
angerszhu created SPARK-34926:
-

 Summary: PartitionUtils.getPathFragment should handle null value
 Key: SPARK-34926
 URL: https://issues.apache.org/jira/browse/SPARK-34926
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu


We support partition value as null, 
PartitionUtils.getPathFragment  should support this too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34925) Spark shell failed to access external file to read content

2021-03-31 Thread czh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

czh updated SPARK-34925:

Attachment: 微信截图_20210401093951.png

> Spark shell failed to access external file to read content
> --
>
> Key: SPARK-34925
> URL: https://issues.apache.org/jira/browse/SPARK-34925
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 3.0.0
>Reporter: czh
>Priority: Major
> Attachments: 微信截图_20210401093951.png
>
>
> Spark shell failed to access external file to read content
> Spark version 3.0, Hadoop version 3.2
> Start spark normally, enter spark shell and enter Val B1= sc.textFile ("S3a: 
> / / spark test / a.txt") is normal. When entering B1. Collect(). Foreach 
> (println) to traverse and read the contents of the file, an error is reported
> The error message is: java.lang.NumberFormatException : For input string: 
> "100M"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34925) Spark shell failed to access external file to read content

2021-03-31 Thread czh (Jira)
czh created SPARK-34925:
---

 Summary: Spark shell failed to access external file to read content
 Key: SPARK-34925
 URL: https://issues.apache.org/jira/browse/SPARK-34925
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 3.0.0
Reporter: czh


Spark shell failed to access external file to read content

Spark version 3.0, Hadoop version 3.2

Start spark normally, enter spark shell and enter Val B1= sc.textFile ("S3a: / 
/ spark test / a.txt") is normal. When entering B1. Collect(). Foreach 
(println) to traverse and read the contents of the file, an error is reported

The error message is: java.lang.NumberFormatException : For input string: "100M"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

2021-03-31 Thread Daniel Zhi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311818#comment-17311818
 ] 

Daniel Zhi edited comment on SPARK-23977 at 3/31/21, 11:59 PM:
---

[~ste...@apache.org] Thanks for the info. Below are the related (key, value) we 
used:
 # spark.hadoop.fs.s3a.committer.name — partitioned
 # spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a — 
org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
 # spark.sql.sources.commitProtocolClass — 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
 # spark.sql.parquet.output.committer.class — 
org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter

3 & 4 appear to be necessary to ensure S3A committers being used by Spark for 
parquet outputs, except that "INSERT OVERWRITE" is blocked by the 
dynamicPartitionOverwrite exception. It will be helpful and appreciated if you 
can patiently elaborate on the proper way to "use the partitioned committer and 
configure it to do the right thing ..." in Spark. For example:
 * PathOutputCommitProtocol appears to be needed but its constructor fails with 
the exception.
 * S3A committers honor "fs.s3a.committer.staging.conflict-mode" which needs to 
be "replace" for "INSERT OVERWRITE" but "append" for "INSERT INTO". So it is 
spark.sql query specific. How to make spark.sql automatically set the right 
value?
 * Does above require code change in Spark or there is a configuration-only way?


was (Author: danzhi):
[~ste...@apache.org] Thanks for the info. Below are the related (key, value) we 
used:
 # spark.hadoop.fs.s3a.committer.name — partitioned
 # spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a — 
org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
 # spark.sql.sources.commitProtocolClass — 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
 # spark.sql.parquet.output.committer.class — 
org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter

3 & 4 appear to be necessary to ensure S3A committers being used by Spark for 
parquet outputs, except that "INSERT OVERWRITE" is blocked by the 
dynamicPartitionOverwrite exception. It will be helpful and appreciated if you 
can patiently elaborate on the proper way to "use the partitioned committer and 
configure it to do the right thing ..." in Spark. For example:
 * PathOutputCommitProtocol appears to be needed but its constructor fails with 
the exception.
 * S3A committers honor "fs.s3a.committer.staging.conflict-mode" which needs to 
be "replace" for "INSERT OVERWRITE" but "append" for "INSERT INTO". So it is 
spark.sql query specific. How to make it such way?
 * Does above require code change in Spark or there is a configuration-only way?

> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to 

[jira] [Comment Edited] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

2021-03-31 Thread Daniel Zhi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311818#comment-17311818
 ] 

Daniel Zhi edited comment on SPARK-23977 at 3/31/21, 11:57 PM:
---

[~ste...@apache.org] Thanks for the info. Below are the related (key, value) we 
used:
 # spark.hadoop.fs.s3a.committer.name — partitioned
 # spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a — 
org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
 # spark.sql.sources.commitProtocolClass — 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
 # spark.sql.parquet.output.committer.class — 
org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter

3 & 4 appear to be necessary to ensure S3A committers being used by Spark for 
parquet outputs, except that "INSERT OVERWRITE" is blocked by the 
dynamicPartitionOverwrite exception. It will be helpful and appreciated if you 
can patiently elaborate on the proper way to "use the partitioned committer and 
configure it to do the right thing ..." in Spark. For example:
 * PathOutputCommitProtocol appears to be needed but its constructor fails with 
the exception.
 * S3A committers honor "fs.s3a.committer.staging.conflict-mode" which needs to 
be "replace" for "INSERT OVERWRITE" but "append" for "INSERT INTO". So it is 
spark.sql query specific. How to make it such way?
 * Does above require code change in Spark or there is a configuration-only way?


was (Author: danzhi):
[~ste...@apache.org] Thanks for the info. Below are the related (key, value) we 
used:
 # spark.hadoop.fs.s3a.committer.name --- partitioned
 # spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a --- 
org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
 # spark.sql.sources.commitProtocolClass --- 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
 # spark.sql.parquet.output.committer.class --- 
org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter

3 & 4 appear to be necessary to ensure S3A committers being used by Spark for 
parquet outputs, except that "INSERT OVERWRITE" is blocked by the 
dynamicPartitionOverwrite exception. It will be helpful and appreciated if you 
can patiently elaborate on the proper way to "use the partitioned committer and 
configure it to do the right thing ..." in Spark.

> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> ---
>
> Key: SPARK-23977
> URL: https://issues.apache.org/jira/browse/SPARK-23977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Minor
> Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers 
> (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, 
> HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, 
> and offer the key semantics which Spark depends on: no visible output until 
> job commit, a failure of a task at an stage, including partway through task 
> commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the 
> (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail 
> entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of 
> the commit semantics w.r.t observability of or recovery from task commit 
> failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the 
> destination through multipart uploads, uploads which are only completed in 
> job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work 
> with the S3A committers and any other, by adding a plugin mechanism into the 
> MRv2 FileOutputFormat class, where it job config and filesystem configuration 
> options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 
> classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a 
> subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To 

[jira] [Comment Edited] (SPARK-34910) JDBC - Add an option for different stride orders

2021-03-31 Thread Jason Yarbrough (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312702#comment-17312702
 ] 

Jason Yarbrough edited comment on SPARK-34910 at 3/31/21, 11:51 PM:


Hi [~srowen], thanks for asking. The idea here is that if a user's data is 
heavily skewed to the "right" (or upper bound), that the last strides (and 
therefore partitions/tasks) will have the most density of data. Say a stage has 
32 tasks, and the first 30 finish at the same time, the last two task that are 
the heaviest would then start. The issue with this is, 2 cores would be running 
for a long time while the other cores are sitting idle (since there are no more 
tasks left). The hope is, if we have the option to reverse that, the first 2 
task will be the heaviest, and while 2 cores work on those, the other cores 
will be working on the next tasks. As we increase the partition count, we try 
to flatten out this issue, but having the option to do descending could still 
be of help.

As for a random stride order, it's just another option that may help flatten 
the distribution (even more so with a higher partition count). This one is a 
little more of an edge usage, but could help break apart some hot spots, and is 
easy to implement within the pattern I've developed.

On somewhat of a side-note (although I may try to implement this in the current 
code, but may save it for another branch), I think something that would be 
pretty nice to add is a ranking of the density of partitions. Depending on if 
the column is indexed (I would recommend that for columns people are 
partitioning on), the performance impact of doing the extra queries/count may 
not be that bad. Once a count of records per partition is created, we could 
order the partition array in a way where the most dense partitions are towards 
the head.

Implementing what I'm proposing here gives people more options on how their 
data is processed, and can also be extended for other algorithms if they make 
sense.


was (Author: hanover-fiste):
Hi [~srowen], thanks for asking. The idea here is that if a user's data is 
heavily skewed to the "right" (or upper bound), that the last strides (and 
therefore partitions/tasks) will have the most density of data. Say a stage has 
32 tasks, and the first 30 finish at the same time, the last two task that are 
the heaviest would then start. The issue with this is, 2 cores would be running 
for a long time while the other cores are sitting idle (since there are no more 
tasks left). The hope is, if we have the option to reverse that, the first 2 
task will be the heaviest, and while 2 cores work on those, the other cores 
will be working on the next tasks. As we increase the partition count, we try 
to flatten out this issue, but having the option to do descending could still 
be of help.

As for a random stride order, it's just another option that may help flatten 
the distribution (even more so with a higher partition count). This one is a 
little more of a edge usage, but could help break apart some hot spots, and is 
easy to implement within the pattern I've developed.

On somewhat of a side-note (although I may try to implement this in the current 
code, but may save it for another branch), I think something that would be 
pretty nice to add is a ranking of the density of partitions. Depending on if 
the column is indexed (I would recommend that for columns people are 
partitioning on), the performance impact of doing the extra queries/count may 
not be that bad. Once a count of records per partition is created, we could 
order the partition array in a way where the most dense partitions are towards 
the head.

Implementing what I'm proposing here gives people more options on how their 
data is processed, and can also be extended for other algorithms if they make 
sense.

> JDBC - Add an option for different stride orders
> 
>
> Key: SPARK-34910
> URL: https://issues.apache.org/jira/browse/SPARK-34910
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jason Yarbrough
>Priority: Trivial
>
> Currently, the JDBCRelation columnPartition function orders the strides in 
> ascending order, starting from the lower bound and working its way towards 
> the upper bound.
> I'm proposing leaving that as the default, but adding an option (such as 
> strideOrder) in JDBCOptions. Since it will default to the current behavior, 
> this will keep people's current code working as expected. However, people who 
> may have data skew closer to the upper bound might appreciate being able to 
> have the strides in descending order, thus filling up the first partition 
> with the last stride and so forth. Also, people with nondeterministic data 
> 

[jira] [Resolved] (SPARK-34882) RewriteDistinctAggregates can cause a bug if the aggregator does not ignore NULLs

2021-03-31 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-34882.
--
Fix Version/s: 3.2.0
 Assignee: Tanel Kiis
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/31983

> RewriteDistinctAggregates can cause a bug if the aggregator does not ignore 
> NULLs
> -
>
> Key: SPARK-34882
> URL: https://issues.apache.org/jira/browse/SPARK-34882
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.2.0, 3.1.2, 3.0.3
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Major
>  Labels: correctness
> Fix For: 3.2.0
>
>
> {code:title=group-by.sql}
> SELECT
> first(DISTINCT a), last(DISTINCT a),
> first(a), last(a),
> first(DISTINCT b), last(DISTINCT b),
> first(b), last(b)
> FROM testData WHERE a IS NOT NULL AND b IS NOT NULL;{code}
> {code:title=group-by.sql.out}
> -- !query schema
> struct a):int,first(a):int,last(a):int,first(DISTINCT b):int,last(DISTINCT 
> b):int,first(b):int,last(b):int>
> -- !query output
> NULL  1   1   3   1   NULL1   2
> {code}
> The results should not be NULL, because NULL inputs are filtered out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34910) JDBC - Add an option for different stride orders

2021-03-31 Thread Jason Yarbrough (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312749#comment-17312749
 ] 

Jason Yarbrough commented on SPARK-34910:
-

It's not that the records/rows themselves are skewed, it's that number of 
records within a stride can be skewed (e.g. you're partitioning on year and 
later years have much more data).

 

Say we have 11 partitions, 5 cores total, and the last partition has triple the 
amount of data as the first 10 due to data skew. We'll say partitions 1 through 
10 take 10 minutes a core, so partition 11 would take 30 minutes on a core.

 

Current stride order of *ascending*:

_Group 1_
5 cores x 10 minutes = 10 minutes 

(all running concurrently) total of 10 minutes

_Group 2_
5 x 10 = 10 

(all running concurrently) total of 10 minutes

_Group 3_
1 x 30 = 30

4 cores sitting idle

total of 30 minutes

Group 1 (10) + Group 2 (10) + Group 3 (30)
*Total of 50 minutes*

 

Stride order of *descending*:

_Group 1_
1 core x 30 minutes = 30 minutes 
4 x 10 = 10 (running concurrently with the first core)
4 x 10 = 10 (running concurrently with the first core)
2 x 10 = 10 (running concurrently with the first core)

(running concurrently) total of 30 minutes

*Total of 30 minutes*

 

Hopefully that's a good example. However, let me know if I've gone off the deep 
end :)

> JDBC - Add an option for different stride orders
> 
>
> Key: SPARK-34910
> URL: https://issues.apache.org/jira/browse/SPARK-34910
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jason Yarbrough
>Priority: Trivial
>
> Currently, the JDBCRelation columnPartition function orders the strides in 
> ascending order, starting from the lower bound and working its way towards 
> the upper bound.
> I'm proposing leaving that as the default, but adding an option (such as 
> strideOrder) in JDBCOptions. Since it will default to the current behavior, 
> this will keep people's current code working as expected. However, people who 
> may have data skew closer to the upper bound might appreciate being able to 
> have the strides in descending order, thus filling up the first partition 
> with the last stride and so forth. Also, people with nondeterministic data 
> skew or sporadic data density might be able to benefit from a random ordering 
> of the strides.
> I have the code created to implement this, and it creates a pattern that can 
> be used to add other algorithms that people may want to add (such as counting 
> the rows and ranking each stride, and then ordering from most dense to 
> least). The current two options I have coded is 'descending' and 'random.'
> The original idea was to create something closer to Spark's hash partitioner, 
> but for JDBC and pushed down to the database engine for efficiency. However, 
> that would require adding hashing algorithms for each dialect, and the 
> performance from those algorithms may outweigh the benefit. The method I'm 
> proposing in this ticket avoids those complexities while still giving some of 
> the benefit (in the case of random ordering).
> I'll put a PR in if others feel this is a good idea.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34923) Metadata output should not always be propagated

2021-03-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312743#comment-17312743
 ] 

Apache Spark commented on SPARK-34923:
--

User 'karenfeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/32017

> Metadata output should not always be propagated
> ---
>
> Key: SPARK-34923
> URL: https://issues.apache.org/jira/browse/SPARK-34923
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Karen Feng
>Priority: Major
>
> Today, the vast majority of expressions uncritically propagate metadata 
> output from their children. As a general rule, it seems reasonable that only 
> expressions that propagate their children's output should also propagate 
> their children's metadata output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34923) Metadata output should not always be propagated

2021-03-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34923:


Assignee: Apache Spark

> Metadata output should not always be propagated
> ---
>
> Key: SPARK-34923
> URL: https://issues.apache.org/jira/browse/SPARK-34923
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Karen Feng
>Assignee: Apache Spark
>Priority: Major
>
> Today, the vast majority of expressions uncritically propagate metadata 
> output from their children. As a general rule, it seems reasonable that only 
> expressions that propagate their children's output should also propagate 
> their children's metadata output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34923) Metadata output should not always be propagated

2021-03-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34923:


Assignee: (was: Apache Spark)

> Metadata output should not always be propagated
> ---
>
> Key: SPARK-34923
> URL: https://issues.apache.org/jira/browse/SPARK-34923
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Karen Feng
>Priority: Major
>
> Today, the vast majority of expressions uncritically propagate metadata 
> output from their children. As a general rule, it seems reasonable that only 
> expressions that propagate their children's output should also propagate 
> their children's metadata output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34910) JDBC - Add an option for different stride orders

2021-03-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312708#comment-17312708
 ] 

Sean R. Owen commented on SPARK-34910:
--

In this example, how are the rows 'skewed'? If there are 1M rows there are just 
1M rows to process in any event. Is this related to the result of an 
aggregation or are the operations on certain rows unnaturally long-running? if 
so why would it tend to be the first or last partition that's slow - isn't it 
smaller if anything? and would just be 1 partition in any event.


> JDBC - Add an option for different stride orders
> 
>
> Key: SPARK-34910
> URL: https://issues.apache.org/jira/browse/SPARK-34910
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jason Yarbrough
>Priority: Trivial
>
> Currently, the JDBCRelation columnPartition function orders the strides in 
> ascending order, starting from the lower bound and working its way towards 
> the upper bound.
> I'm proposing leaving that as the default, but adding an option (such as 
> strideOrder) in JDBCOptions. Since it will default to the current behavior, 
> this will keep people's current code working as expected. However, people who 
> may have data skew closer to the upper bound might appreciate being able to 
> have the strides in descending order, thus filling up the first partition 
> with the last stride and so forth. Also, people with nondeterministic data 
> skew or sporadic data density might be able to benefit from a random ordering 
> of the strides.
> I have the code created to implement this, and it creates a pattern that can 
> be used to add other algorithms that people may want to add (such as counting 
> the rows and ranking each stride, and then ordering from most dense to 
> least). The current two options I have coded is 'descending' and 'random.'
> The original idea was to create something closer to Spark's hash partitioner, 
> but for JDBC and pushed down to the database engine for efficiency. However, 
> that would require adding hashing algorithms for each dialect, and the 
> performance from those algorithms may outweigh the benefit. The method I'm 
> proposing in this ticket avoids those complexities while still giving some of 
> the benefit (in the case of random ordering).
> I'll put a PR in if others feel this is a good idea.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34910) JDBC - Add an option for different stride orders

2021-03-31 Thread Jason Yarbrough (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312702#comment-17312702
 ] 

Jason Yarbrough commented on SPARK-34910:
-

Hi [~srowen], thanks for asking. The idea here is that if a user's data is 
heavily skewed to the "right" (or upper bound), that the last strides (and 
therefore partitions/tasks) will have the most density of data. Say a stage has 
32 tasks, and the first 30 finish at the same time, the last two task that are 
the heaviest would then start. The issue with this is, 2 cores would be running 
for a long time while the other cores are sitting idle (since there are no more 
tasks left). The hope is, if we have the option to reverse that, the first 2 
task will be the heaviest, and while 2 cores work on those, the other cores 
will be working on the next tasks. As we increase the partition count, we try 
to flatten out this issue, but having the option to do descending could still 
be of help.

As for a random stride order, it's just another option that may help flatten 
the distribution (even more so with a higher partition count). This one is a 
little more of a edge usage, but could help break apart some hot spots, and is 
easy to implement within the pattern I've developed.

On somewhat of a side-note (although I may try to implement this in the current 
code, but may save it for another branch), I think something that would be 
pretty nice to add is a ranking of the density of partitions. Depending on if 
the column is indexed (I would recommend that for columns people are 
partitioning on), the performance impact of doing the extra queries/count may 
not be that bad. Once a count of records per partition is created, we could 
order the partition array in a way where the most dense partitions are towards 
the head.

Implementing what I'm proposing here gives people more options on how their 
data is processed, and can also be extended for other algorithms if they make 
sense.

> JDBC - Add an option for different stride orders
> 
>
> Key: SPARK-34910
> URL: https://issues.apache.org/jira/browse/SPARK-34910
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jason Yarbrough
>Priority: Trivial
>
> Currently, the JDBCRelation columnPartition function orders the strides in 
> ascending order, starting from the lower bound and working its way towards 
> the upper bound.
> I'm proposing leaving that as the default, but adding an option (such as 
> strideOrder) in JDBCOptions. Since it will default to the current behavior, 
> this will keep people's current code working as expected. However, people who 
> may have data skew closer to the upper bound might appreciate being able to 
> have the strides in descending order, thus filling up the first partition 
> with the last stride and so forth. Also, people with nondeterministic data 
> skew or sporadic data density might be able to benefit from a random ordering 
> of the strides.
> I have the code created to implement this, and it creates a pattern that can 
> be used to add other algorithms that people may want to add (such as counting 
> the rows and ranking each stride, and then ordering from most dense to 
> least). The current two options I have coded is 'descending' and 'random.'
> The original idea was to create something closer to Spark's hash partitioner, 
> but for JDBC and pushed down to the database engine for efficiency. However, 
> that would require adding hashing algorithms for each dialect, and the 
> performance from those algorithms may outweigh the benefit. The method I'm 
> proposing in this ticket avoids those complexities while still giving some of 
> the benefit (in the case of random ordering).
> I'll put a PR in if others feel this is a good idea.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins

2021-03-31 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312696#comment-17312696
 ] 

Shane Knapp edited comment on SPARK-34738 at 3/31/21, 8:38 PM:
---

managed to snag the logs from the pod when it errored out:
{code:java}
++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry=
+ set -e
+ '[' -z '' ']'
+ '[' -w /etc/passwd ']'
+ echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*'
+ case "$1" in
+ shift 1
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf 
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=172.17.0.3 --deploy-mode client --properties-file 
/opt/spark/conf/spark.properties --class 
org.apache.spark.examples.DFSReadWriteTest 
local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar 
/opt/spark/pv-tests/tmp4595937990978494271.txt /opt/spark/pv-tests
21/03/31 20:26:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Given path (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist
DFS Read-Write Test
Usage: localFile dfsDir
localFile - (string) local file to use in test
dfsDir - (string) DFS directory for read/write tests
log4j:WARN No appenders could be found for logger 
(org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.{code}
this def caught my eye:  Given path 
(/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist

i sshed in to the cluster, was able (again) to confirm that mk was able to 
mount the PVC test dir on that worker in /tmp, and that the file 
tmp4595937990978494271.txt was visible and readable from within mk...  however 
/opt/spark/pv-tests/ wasn't visible within the mk cluster.


was (Author: shaneknapp):
managed to snag the logs from the pod when it errored out:
{code:java}
++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry=
+ set -e
+ '[' -z '' ']'
+ '[' -w /etc/passwd ']'
+ echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*'
+ case "$1" in
+ shift 1
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf 
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=172.17.0.3 --deploy-mode client --properties-file 
/opt/spark/conf/spark.properties --class 
org.apache.spark.examples.DFSReadWriteTest 
local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar 
/opt/spark/pv-tests/tmp4595937990978494271.txt /opt/spark/pv-tests
21/03/31 20:26:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Given path (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist
DFS Read-Write Test
Usage: localFile dfsDir
localFile - (string) local file to use in test
dfsDir - (string) DFS directory for read/write tests
log4j:WARN No appenders could be found for logger 
(org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.{code}
this def caught my eye:  Given path 
(/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist

i sshed in to the cluster, was able (again) to confirm that mk was able to 
mount the PVC test dir on that worker, and that the file 
tmp4595937990978494271.txt was visible and readable from within mk...

> Upgrade Minikube and kubernetes cluster version on Jenkins
> --
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
>
> [~shaneknapp] as we discussed [on the mailing 
> 

[jira] [Created] (SPARK-34924) Add nested column scan in DataSourceReadBenchmark

2021-03-31 Thread Cheng Su (Jira)
Cheng Su created SPARK-34924:


 Summary: Add nested column scan in DataSourceReadBenchmark
 Key: SPARK-34924
 URL: https://issues.apache.org/jira/browse/SPARK-34924
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Cheng Su


For the table scan benchmark `DataSourceReadBenchmark.scala`, we are only 
testing atomic data type but not nested column data type (array/struct/map) - 
[https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala#L544-L567]
 . We should add scan benchmark for nested column data type as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins

2021-03-31 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312696#comment-17312696
 ] 

Shane Knapp commented on SPARK-34738:
-

managed to snag the logs from the pod when it errored out:
{code:java}
++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry=
+ set -e
+ '[' -z '' ']'
+ '[' -w /etc/passwd ']'
+ echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*'
+ case "$1" in
+ shift 1
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf 
"spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
spark.driver.bindAddress=172.17.0.3 --deploy-mode client --properties-file 
/opt/spark/conf/spark.properties --class 
org.apache.spark.examples.DFSReadWriteTest 
local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar 
/opt/spark/pv-tests/tmp4595937990978494271.txt /opt/spark/pv-tests
21/03/31 20:26:24 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Given path (/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist
DFS Read-Write Test
Usage: localFile dfsDir
localFile - (string) local file to use in test
dfsDir - (string) DFS directory for read/write tests
log4j:WARN No appenders could be found for logger 
(org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.{code}
this def caught my eye:  Given path 
(/opt/spark/pv-tests/tmp4595937990978494271.txt) does not exist

i sshed in to the cluster, was able (again) to confirm that mk was able to 
mount the PVC test dir on that worker, and that the file 
tmp4595937990978494271.txt was visible and readable from within mk...

> Upgrade Minikube and kubernetes cluster version on Jenkins
> --
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33638) Full support of V2 table creation in Structured Streaming writer path

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-33638:
-
Priority: Major  (was: Blocker)

> Full support of V2 table creation in Structured Streaming writer path
> -
>
> Key: SPARK-33638
> URL: https://issues.apache.org/jira/browse/SPARK-33638
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Priority: Major
>
> Currently, we want to add support of creating if not exists in 
> DataStreamWriter.toTable API. Since the file format in streaming doesn't 
> support DSv2 for now, the current implementation mainly focuses on V1 
> support. We need more work to do for the full support of V2 table creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33694) Auditing the API changes and behavior changes in Spark 3.1

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-33694.
--
Resolution: Done

> Auditing the API changes and behavior changes in Spark 3.1
> --
>
> Key: SPARK-33694
> URL: https://issues.apache.org/jira/browse/SPARK-33694
> Project: Spark
>  Issue Type: Epic
>  Components: PySpark, Spark Core, SparkR, SQL, Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34425) Improve error message in default spark session catalog

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-34425:
-
Issue Type: Improvement  (was: Bug)
  Priority: Minor  (was: Blocker)

Feel free to revert the status if I'm missing why it needs to be a Blocker

> Improve error message in default spark session catalog
> --
>
> Key: SPARK-34425
> URL: https://issues.apache.org/jira/browse/SPARK-34425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
>
> See [https://github.com/apache/spark/pull/31541/files]
> We should make sure that "unsupported function name 'default.ns1.ns2.fun'" 
> contains information about why the function name is unsupported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34923) Metadata output should not always be propagated

2021-03-31 Thread Karen Feng (Jira)
Karen Feng created SPARK-34923:
--

 Summary: Metadata output should not always be propagated
 Key: SPARK-34923
 URL: https://issues.apache.org/jira/browse/SPARK-34923
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Karen Feng


Today, the vast majority of expressions uncritically propagate metadata output 
from their children. As a general rule, it seems reasonable that only 
expressions that propagate their children's output should also propagate their 
children's metadata output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34904) Old import of LZ4 package inside CompressionCodec.scala

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34904.
--
Resolution: Not A Problem

> Old import of LZ4 package inside CompressionCodec.scala 
> 
>
> Key: SPARK-34904
> URL: https://issues.apache.org/jira/browse/SPARK-34904
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.1.1
>Reporter: Michal Zeman
>Priority: Minor
>
> This commit should upgrade the version of the LZ4 package: 
> [https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f]
>  
> The dependency was changed. However, inside the file [CompressionCodec.scala 
> |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala]
>  the old import referencing net.jpountz.lz4 (where versions up to 1.3. are 
> held) remains. 
> Because of probably backward compatibility, the newer version of org.lz4 
> package still contains net.jpountz.lz4 
> ([https://github.com/lz4/lz4-java/tree/master/src/java/net/jpountz/lz4)]. 
> Therefore this import does not cause problems at a first sight. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34903) Return day-time interval from timestamps subtraction

2021-03-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34903:


Assignee: Max Gekk  (was: Apache Spark)

> Return day-time interval from timestamps subtraction
> 
>
> Key: SPARK-34903
> URL: https://issues.apache.org/jira/browse/SPARK-34903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Modify SubtractTimestamps to return DayTimeIntervalType (under a flag).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34903) Return day-time interval from timestamps subtraction

2021-03-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312568#comment-17312568
 ] 

Apache Spark commented on SPARK-34903:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/32016

> Return day-time interval from timestamps subtraction
> 
>
> Key: SPARK-34903
> URL: https://issues.apache.org/jira/browse/SPARK-34903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Modify SubtractTimestamps to return DayTimeIntervalType (under a flag).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34903) Return day-time interval from timestamps subtraction

2021-03-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312569#comment-17312569
 ] 

Apache Spark commented on SPARK-34903:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/32016

> Return day-time interval from timestamps subtraction
> 
>
> Key: SPARK-34903
> URL: https://issues.apache.org/jira/browse/SPARK-34903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Modify SubtractTimestamps to return DayTimeIntervalType (under a flag).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34903) Return day-time interval from timestamps subtraction

2021-03-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34903:


Assignee: Apache Spark  (was: Max Gekk)

> Return day-time interval from timestamps subtraction
> 
>
> Key: SPARK-34903
> URL: https://issues.apache.org/jira/browse/SPARK-34903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Modify SubtractTimestamps to return DayTimeIntervalType (under a flag).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins

2021-03-31 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312548#comment-17312548
 ] 

Shane Knapp commented on SPARK-34738:
-

fwiw, the pod is mounting the local FS correctly:
{code:java}
jenkins@research-jenkins-worker-08:~$ minikube ssh
docker@minikube:~$ cd /tmp
docker@minikube:/tmp$ ls
gvisor h.829 h.912 hostpath-provisioner hostpath_pv tmp.iSJp3I8otl
docker@minikube:/tmp$ cd tmp.iSJp3I8otl/
docker@minikube:/tmp/tmp.iSJp3I8otl$ touch ASDF
docker@minikube:/tmp/tmp.iSJp3I8otl$ logout
jenkins@research-jenkins-worker-08:~$ cd /tmp/tmp.iSJp3I8otl/
jenkins@research-jenkins-worker-08:/tmp/tmp.iSJp3I8otl$ ls
ASDF tmp3533667297442374713.txt
jenkins@research-jenkins-worker-08:/tmp/tmp.iSJp3I8otl$ cat 
tmp3533667297442374713.txt
test PVs{code}

> Upgrade Minikube and kubernetes cluster version on Jenkins
> --
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34904) Old import of LZ4 package inside CompressionCodec.scala

2021-03-31 Thread Michal Zeman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312547#comment-17312547
 ] 

Michal Zeman commented on SPARK-34904:
--

[~srowen] Nope, sorry embarrassing. I have mistaken the location of artifact 
and namespace of actual code... 

> Old import of LZ4 package inside CompressionCodec.scala 
> 
>
> Key: SPARK-34904
> URL: https://issues.apache.org/jira/browse/SPARK-34904
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.1.1
>Reporter: Michal Zeman
>Priority: Minor
>
> This commit should upgrade the version of the LZ4 package: 
> [https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f]
>  
> The dependency was changed. However, inside the file [CompressionCodec.scala 
> |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala]
>  the old import referencing net.jpountz.lz4 (where versions up to 1.3. are 
> held) remains. 
> Because of probably backward compatibility, the newer version of org.lz4 
> package still contains net.jpountz.lz4 
> ([https://github.com/lz4/lz4-java/tree/master/src/java/net/jpountz/lz4)]. 
> Therefore this import does not cause problems at a first sight. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34904) Old import of LZ4 package inside CompressionCodec.scala

2021-03-31 Thread Michal Zeman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michal Zeman updated SPARK-34904:
-
Issue Type: Question  (was: Bug)

> Old import of LZ4 package inside CompressionCodec.scala 
> 
>
> Key: SPARK-34904
> URL: https://issues.apache.org/jira/browse/SPARK-34904
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.1.1
>Reporter: Michal Zeman
>Priority: Minor
>
> This commit should upgrade the version of the LZ4 package: 
> [https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f]
>  
> The dependency was changed. However, inside the file [CompressionCodec.scala 
> |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala]
>  the old import referencing net.jpountz.lz4 (where versions up to 1.3. are 
> held) remains. 
> Because of probably backward compatibility, the newer version of org.lz4 
> package still contains net.jpountz.lz4 
> ([https://github.com/lz4/lz4-java/tree/master/src/java/net/jpountz/lz4)]. 
> Therefore this import does not cause problems at a first sight. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins

2021-03-31 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312517#comment-17312517
 ] 

Shane Knapp edited comment on SPARK-34738 at 3/31/21, 4:43 PM:
---

alright, sometimes these things go smoothly, sometimes not.

this is firmly in the 'not' camp.

after upgrading minikube and k8s, i was unable to mount a persistent volume 
when using the kvm2 driver.  much debugging ensued.  no progress was made and 
the error reported was that the minikube pod was unable to connect to the 
localhost and mount (Connection refused).

so, i decided to randomly try the docker minikube driver.  voila!  i'm now able 
to happily mount persistent volumes.

however, when running the k8s integration test, everything passes *except* the 
PVs w/local storage.

from [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-clone:]
{code:java}
- PVs with local storage *** FAILED ***
 The code passed to eventually never returned normally. Attempted 179 times 
over 3.00242447046 minutes. Last failure message: container not found 
("spark-kubernetes-driver"). (PVTestsSuite.scala:117){code}
i've never seen this error before, and apparently there aren't many things 

here's how we launch minikube and create the mount:
{code:java}
minikube --vm-driver=docker start --memory 6000 --cpus 8
minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} 
--9p-version=9p2000.L &
{code}
we're using ZFS on the bare metal, and minikube is complaining:
{code:java}
! docker is currently using the zfs storage driver, consider switching to 
overlay2 for better performance{code}
i'll continue to dig in to this today, but i'm currently blocked...


was (Author: shaneknapp):
alright, sometimes these things go smoothly, sometimes not.

this is firmly in the 'not' camp.

after upgrading minikube and k8s, i was unable to mount a persistent volume 
when using the kvm2 driver.  much debugging ensued.  no progress was made.

so, i decided to randomly try the docker minikube driver.  voila!  i'm now able 
to happily mount persistent volumes.

however, when running the k8s integration test, everything passes *except* the 
PVs w/local storage.

from [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-clone:]
{code:java}
- PVs with local storage *** FAILED ***
 The code passed to eventually never returned normally. Attempted 179 times 
over 3.00242447046 minutes. Last failure message: container not found 
("spark-kubernetes-driver"). (PVTestsSuite.scala:117){code}
i've never seen this error before, and apparently there aren't many things 

here's how we launch minikube and create the mount:
{code:java}
minikube --vm-driver=docker start --memory 6000 --cpus 8
minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} 
--9p-version=9p2000.L &
{code}
we're using ZFS on the bare metal, and minikube is complaining:
{code:java}
! docker is currently using the zfs storage driver, consider switching to 
overlay2 for better performance{code}
i'll continue to dig in to this today, but i'm currently blocked...

> Upgrade Minikube and kubernetes cluster version on Jenkins
> --
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins

2021-03-31 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312517#comment-17312517
 ] 

Shane Knapp edited comment on SPARK-34738 at 3/31/21, 4:20 PM:
---

alright, sometimes these things go smoothly, sometimes not.

this is firmly in the 'not' camp.

after upgrading minikube and k8s, i was unable to mount a persistent volume 
when using the kvm2 driver.  much debugging ensued.  no progress was made.

so, i decided to randomly try the docker minikube driver.  voila!  i'm now able 
to happily mount persistent volumes.

however, when running the k8s integration test, everything passes *except* the 
PVs w/local storage.

from [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-clone:]
{code:java}
- PVs with local storage *** FAILED ***
 The code passed to eventually never returned normally. Attempted 179 times 
over 3.00242447046 minutes. Last failure message: container not found 
("spark-kubernetes-driver"). (PVTestsSuite.scala:117){code}
i've never seen this error before, and apparently there aren't many things 

here's how we launch minikube and create the mount:
{code:java}
minikube --vm-driver=docker start --memory 6000 --cpus 8
minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} 
--9p-version=9p2000.L &
{code}
we're using ZFS on the bare metal, and minikube is complaining:
{code:java}
! docker is currently using the zfs storage driver, consider switching to 
overlay2 for better performance{code}
i'll continue to dig in to this today, but i'm currently blocked...


was (Author: shaneknapp):
alright, sometimes these things go smoothly, sometimes not.

this is firmly in the 'not' camp.

after upgrading minikube and k8s, i was unable to mount a persistent volume 
when using the kvm2 driver.  much debugging ensued.  no progress was made.

so, i decided to randomly try the docker minikube driver.  voila!  i'm now able 
to happily mount persistent volumes.

however, when running the k8s integration test, everything passes *except* the 
PVs w/local storage.

from https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-clone:
{code:java}
- PVs with local storage *** FAILED ***
 The code passed to eventually never returned normally. Attempted 179 times 
over 3.00242447046 minutes. Last failure message: container not found 
("spark-kubernetes-driver"). (PVTestsSuite.scala:117){code}
i've never seen this error before, and apparently there aren't many things 

here's how we launch minikube and create the mount:

 
{code:java}
minikube --vm-driver=docker start --memory 6000 --cpus 8
minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} 
--9p-version=9p2000.L &
{code}
i'll continue to dig in to this today, but i'm currently blocked...

> Upgrade Minikube and kubernetes cluster version on Jenkins
> --
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins

2021-03-31 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312518#comment-17312518
 ] 

Shane Knapp commented on SPARK-34738:
-

[~skonto] any insights?

> Upgrade Minikube and kubernetes cluster version on Jenkins
> --
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins

2021-03-31 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312517#comment-17312517
 ] 

Shane Knapp commented on SPARK-34738:
-

alright, sometimes these things go smoothly, sometimes not.

this is firmly in the 'not' camp.

after upgrading minikube and k8s, i was unable to mount a persistent volume 
when using the kvm2 driver.  much debugging ensued.  no progress was made.

so, i decided to randomly try the docker minikube driver.  voila!  i'm now able 
to happily mount persistent volumes.

however, when running the k8s integration test, everything passes *except* the 
PVs w/local storage.

from https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-clone:
{code:java}
- PVs with local storage *** FAILED ***
 The code passed to eventually never returned normally. Attempted 179 times 
over 3.00242447046 minutes. Last failure message: container not found 
("spark-kubernetes-driver"). (PVTestsSuite.scala:117){code}
i've never seen this error before, and apparently there aren't many things 

here's how we launch minikube and create the mount:

 
{code:java}
minikube --vm-driver=docker start --memory 6000 --cpus 8
minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} 
--9p-version=9p2000.L &
{code}
i'll continue to dig in to this today, but i'm currently blocked...

> Upgrade Minikube and kubernetes cluster version on Jenkins
> --
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34883) Setting CSV reader option "multiLine" to "true" causes URISyntaxException when colon is in file path

2021-03-31 Thread Brady Tello (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312504#comment-17312504
 ] 

Brady Tello commented on SPARK-34883:
-

Sure thing.  This Scala code results in the JVM stack trace below.   Looks like 
an issue with hadoop's Globber class
{code:java}
var csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv"
var tempDF = (spark.read .option("sep", "\t") .option("multiLine", "True") 
.csv(csvFile) )
{code}
{code:java}
IllegalArgumentException: java.net.URISyntaxException: Relative path in 
absolute URI: test:dir
Caused by: URISyntaxException: Relative path in absolute URI: test:dir
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.(Path.java:171)
at org.apache.hadoop.fs.Path.(Path.java:93)
at org.apache.hadoop.fs.Globber.glob(Globber.java:211)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1676)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at 
org.apache.spark.input.StreamFileInputFormat.setMinPartitions(PortableDataStream.scala:51)
at 
org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:51)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:303)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:303)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:307)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:303)
at org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1435)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:419)
at org.apache.spark.rdd.RDD.take(RDD.scala:1429)
at 
org.apache.spark.sql.execution.datasources.csv.MultiLineCSVDataSource$.infer(CSVDataSource.scala:225)
at 
org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:70)
at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:62)
at 
org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$11(DataSource.scala:230)
at scala.Option.orElse(Option.scala:447)
at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:223)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:460)
at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:408)
at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:390)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:390)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:912)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
at 
$lined238ff10de5c41758f3b1477d427b85e25.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-8965899:6)
at 
$lined238ff10de5c41758f3b1477d427b85e25.$read$$iw$$iw$$iw$$iw$$iw.(command-8965899:50)
at 
$lined238ff10de5c41758f3b1477d427b85e25.$read$$iw$$iw$$iw$$iw.(command-8965899:52)
at 
$lined238ff10de5c41758f3b1477d427b85e25.$read$$iw$$iw$$iw.(command-8965899:54)
at 
$lined238ff10de5c41758f3b1477d427b85e25.$read$$iw$$iw.(command-8965899:56)
at 
$lined238ff10de5c41758f3b1477d427b85e25.$read$$iw.(command-8965899:58)
at 
$lined238ff10de5c41758f3b1477d427b85e25.$read.(command-8965899:60)
at 
$lined238ff10de5c41758f3b1477d427b85e25.$read$.(command-8965899:64)
at 
$lined238ff10de5c41758f3b1477d427b85e25.$read$.(command-8965899)
at 
$lined238ff10de5c41758f3b1477d427b85e25.$eval$.$print$lzycompute(:7)
at $lined238ff10de5c41758f3b1477d427b85e25.$eval$.$print(:6)
at $lined238ff10de5c41758f3b1477d427b85e25.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 

[jira] [Commented] (SPARK-34779) ExecutorMetricsPoller should keep stage entry in stageTCMP until a heartbeat occurs

2021-03-31 Thread Baohe Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312498#comment-17312498
 ] 

Baohe Zhang commented on SPARK-34779:
-

Thanks for pointing it out! I didn't aware that task peak metrics will 
contribute to the executor peak metrics.

> ExecutorMetricsPoller should keep stage entry in stageTCMP until a heartbeat 
> occurs
> ---
>
> Key: SPARK-34779
> URL: https://issues.apache.org/jira/browse/SPARK-34779
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Baohe Zhang
>Priority: Major
>
> The current implementation of ExecutoMetricsPoller uses task count in each 
> stage to decide whether to keep a stage entry or not. In the case of the 
> executor only has 1 core, it may have these issues:
>  # Peak metrics missing (due to stage entry being removed within a heartbeat 
> interval)
>  # Unnecessary and frequent hashmap entry removal and insertion.
> Assuming an executor with 1 core has 2 tasks (task1 and task2, both belong to 
> stage (0,0)) to execute in a heartbeat interval, the workflow in current 
> ExecutorMetricsPoller implementation would be:
> 1. task1 start -> stage (0, 0) entry created in stageTCMP, task count 
> increment to1
> 2. 1st poll() -> update peak metrics of stage (0, 0)
> 3. task1 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry 
> removed, peak metrics lost.
> 4. task2 start -> stage (0, 0) entry created in stageTCMP, task count 
> increment to1
> 5. 2nd poll() -> update peak metrics of stage (0, 0)
> 6. task2 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry 
> removed, peak metrics lost
> 7. heartbeat() ->  empty or inaccurate peak metrics for stage(0,0) reported.
> We can fix the issue by keeping entries with task count = 0 in stageTCMP map 
> until a heartbeat occurs. At the heartbeat, after reporting the peak metrics 
> for each stage, we scan each stage in stageTCMP and remove entries with task 
> count = 0.
> After the fix, the workflow would be:
> 1. task1 start -> stage (0, 0) entry created in stageTCMP, task count 
> increment to1
> 2. 1st poll() -> update peak metrics of stage (0, 0)
> 3. task1 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) 
> still remain.
> 4. task2 start -> task count of stage (0,0) increment to1
> 5. 2nd poll() -> update peak metrics of stage (0, 0)
> 6. task2 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) 
> still remain.
> 7. heartbeat() ->  accurate peak metrics for stage (0, 0) reported. Remove 
> entry for stage (0,0) in stageTCMP because its task count is 0.
>  
> How to verify the behavior? 
> Submit a job with a custom polling interval (e.g., 2s) and 
> spark.executor.cores=1 and check the debug logs of ExecutoMetricsPoller.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34684) Hadoop config could not be successfully serilized from driver pods to executor pods

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34684.
--
Resolution: Not A Problem

> Hadoop config could not be successfully serilized from driver pods to 
> executor pods
> ---
>
> Key: SPARK-34684
> URL: https://issues.apache.org/jira/browse/SPARK-34684
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Yue Peng
>Priority: Major
>
> I have set HADOOP_CONF_DIR correctly. And I have verified that hadoop configs 
> have been stored into a configmap and mounted to driver. However, spark pi 
> example job keeps failing where executor do not know how to talk to hdfs. I 
> highly suspect that there is a bug causing it, as I manually create a 
> configmap storing hadoop configs and mounted it to executor in template file, 
> which could fix the error. 
>  
> Spark submit command:
> /opt/spark-3.0/bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --deploy-mode cluster --master k8s://https://10.***.18.96:6443 
> --num-executors 1 --conf spark.kubernetes.namespace=test --conf 
> spark.kubernetes.container.image= --conf 
> spark.kubernetes.driver.podTemplateFile=/opt/spark-3.0/conf/spark-driver.template
>  --conf 
> spark.kubernetes.executor.podTemplateFile=/opt/spark-3.0/conf/spark-executor.template
>   --conf spark.kubernetes.file.upload.path=/opt/spark-3.0/examples/jars 
> hdfs:///tmp/spark-examples_2.12-3.0.125067.jar 1000
>  
>  
> Error log:
>  
> 21/03/10 06:59:58 INFO TransportClientFactory: Successfully created 
> connection to 
> org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078
>  after 608 ms (392 ms spent in bootstraps)
> 21/03/10 06:59:58 INFO SecurityManager: Changing view acls to: root
> 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls to: root
> 21/03/10 06:59:58 INFO SecurityManager: Changing view acls groups to:
> 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls groups to:
> 21/03/10 06:59:58 INFO SecurityManager: SecurityManager: authentication 
> enabled; ui acls disabled; users with view permissions: Set(root); groups 
> with view permissions: Set(); users with modify permissions: Set(root); 
> groups with modify permissions: Set()
> 21/03/10 06:59:59 INFO TransportClientFactory: Successfully created 
> connection to 
> org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078
>  after 130 ms (104 ms spent in bootstraps)
> 21/03/10 06:59:59 INFO DiskBlockManager: Created local directory at 
> /var/data/spark-0f541e3d-994f-4c7a-843f-f7dac57dfc13/blockmgr-981cfb62-5b27-4d1a-8fbd-eddb466faf1d
> 21/03/10 06:59:59 INFO MemoryStore: MemoryStore started with capacity 2047.2 
> MiB
> 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Connecting to driver: 
> spark://coarsegrainedschedu...@org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc:7078
> 21/03/10 06:59:59 INFO ResourceUtils: 
> ==
> 21/03/10 06:59:59 INFO ResourceUtils: Resources for spark.executor:
> 21/03/10 06:59:59 INFO ResourceUtils: 
> ==
> 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Successfully registered 
> with driver
> 21/03/10 06:59:59 INFO Executor: Starting executor ID 1 on host 100.64.0.192
> 21/03/10 07:00:00 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37956.
> 21/03/10 07:00:00 INFO NettyBlockTransferService: Server created on 
> 100.64.0.192:37956
> 21/03/10 07:00:00 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy
> 21/03/10 07:00:00 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:00 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:00 INFO BlockManager: Initialized BlockManager: 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 0
> 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 1
> 21/03/10 07:00:01 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 21/03/10 07:00:01 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 21/03/10 07:00:01 INFO Executor: Fetching 
> spark://org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc:7078/jars/spark-examples_2.12-3.0.125067.jar
>  with timestamp 1615359587432
> 21/03/10 07:00:01 INFO TransportClientFactory: Successfully created 
> connection to 
> 

[jira] [Resolved] (SPARK-34751) Parquet with invalid chars on column name reads double as null when a clean schema is applied

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34751.
--
Resolution: Duplicate

> Parquet with invalid chars on column name reads double as null when a clean 
> schema is applied
> -
>
> Key: SPARK-34751
> URL: https://issues.apache.org/jira/browse/SPARK-34751
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.3, 3.1.1
> Environment: Pyspark 2.4.3
> AWS Glue Dev Endpoint EMR
>Reporter: Nivas Umapathy
>Priority: Major
> Attachments: invalid_columns_double.parquet
>
>
> I have a parquet file that has data with invalid column names on it. 
> [#Reference](https://issues.apache.org/jira/browse/SPARK-27442)  Here is the 
> file attached with this ticket.
> I tried to load this file with 
> {{df = glue_context.read.parquet('invalid_columns_double.parquet')}}
> {{df = df.withColumnRenamed('COL 1', 'COL_1')}}
> {{df = df.withColumnRenamed('COL,2', 'COL_2')}}
> {{df = df.withColumnRenamed('COL;3', 'COL_3') }}
> and so on.
> Now if i call
> {{df.show()}}
> it throws this exception that is still pointing to the old column name.
>  {{pyspark.sql.utils.AnalysisException: 'Attribute name "COL 1" contains 
> invalid character(s) among " ,;{}()}}
> {{n}}
>  {{t=". Please use alias to rename it.;'}}
>  
> When i read about it in some blogs, there was suggestion to re-read the same 
> parquet with new schema applied. So i did 
> {{df = 
> glue_context.read.schema(df.schema).parquet(}}{{'invalid_columns_double.parquet')}}
>  
> and it works, but all the data in the dataframe are null. The same works for 
> String datatypes
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34865) cannot resolve methods when & col in spark java

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34865.
--
Resolution: Duplicate

> cannot resolve methods when & col in spark java
> ---
>
> Key: SPARK-34865
> URL: https://issues.apache.org/jira/browse/SPARK-34865
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.1.1
>Reporter: unical1988
>Priority: Major
>
> Hello,
>  
> I am using Spark 3.1.1 with JAVA and the method when of Column and also col 
> but when i compile it says cannot resolve method when & cannot resolve method 
> col.
>  
> What is the matter ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34864) cannot resolve methods when & col in spark java

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34864.
--
Resolution: Invalid

> cannot resolve methods when & col in spark java
> ---
>
> Key: SPARK-34864
> URL: https://issues.apache.org/jira/browse/SPARK-34864
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.1.1
>Reporter: unical1988
>Priority: Major
>
> Hello,
>  
> I am using Spark 3.1.1 with JAVA and the method when of Column and also col 
> but when i compile it says cannot resolve method when & cannot resolve method 
> col.
>  
> What is the matter ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34866) cannot resolve method when of Column() spark java

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34866.
--
Resolution: Invalid

> cannot resolve method when of Column() spark java
> -
>
> Key: SPARK-34866
> URL: https://issues.apache.org/jira/browse/SPARK-34866
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.1.1
>Reporter: unical1988
>Priority: Major
>
> cannot resolve method when of Column, spark java and also method col, what's 
> the problem ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34867) cannot resolve method when of Column() spark java

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34867.
--
Resolution: Duplicate

> cannot resolve method when of Column() spark java
> -
>
> Key: SPARK-34867
> URL: https://issues.apache.org/jira/browse/SPARK-34867
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.1.1
>Reporter: unical1988
>Priority: Major
>
> The error says cannot resolve method main (of Column()) and col in Spark 3.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34883) Setting CSV reader option "multiLine" to "true" causes URISyntaxException when colon is in file path

2021-03-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312473#comment-17312473
 ] 

Sean R. Owen commented on SPARK-34883:
--

I'm guessing it's somehow that a different code path takes over for multiline 
parsing with univocity, but not sure. Do you have the JVM side stack trace?

> Setting CSV reader option "multiLine" to "true" causes URISyntaxException 
> when colon is in file path
> 
>
> Key: SPARK-34883
> URL: https://issues.apache.org/jira/browse/SPARK-34883
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.1
>Reporter: Brady Tello
>Priority: Major
>
> Setting the CSV reader's "multiLine" option to "True" throws the following 
> exception when a ':' character is in the file path.
>  
> {code:java}
> java.net.URISyntaxException: Relative path in absolute URI: test:dir
> {code}
> I've tested this in both Spark 3.0.0 and Spark 3.1.1 and I get the same error 
> whether I use Scala, Python, or SQL.
> The following code works fine:
>  
> {code:java}
> csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" 
> tempDF = (spark.read.option("sep", "\t").csv(csvFile)
> {code}
> While the following code fails:
>  
> {code:java}
> csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv"
> tempDF = (spark.read.option("sep", "\t").option("multiLine", 
> "True").csv(csvFile)
> {code}
> Full Stack Trace from Python:
>  
> {code:java}
> --- 
> IllegalArgumentException Traceback (most recent call last)  
> in  
> 3 csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" 
> 4 
> > 5  tempDF = (spark.read.option("sep", "\t").option("multiLine", "True") 
> /databricks/spark/python/pyspark/sql/readwriter.py in csv(self, path, schema, 
> sep, encoding, quote, escape, comment, header, inferSchema, 
> ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, 
> positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, 
> maxCharsPerColumn, maxMalformedLogPerPartition, mode, 
> columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, 
> samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, 
> recursiveFileLookup, modifiedBefore, modifiedAfter, unescapedQuoteHandling) 
> 735 path = [path] 
> 736 if type(path) == list: 
> --> 737 return 
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) 
> 738 elif isinstance(path, RDD): 
> 739 def func(iterator): 
> /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in 
> __call__(self, *args) 
> 1302 
> 1303 answer = self.gateway_client.send_command(command) 
> -> 1304 return_value = get_return_value( 
> 1305 answer, self.gateway_client, self.target_id, self.name) 
> 1306 
> /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 
> 114 # Hide where the exception came from that shows a non-Pythonic 
> 115 # JVM exception message. 
> --> 116 raise converted from None 
> 117 else: 
> 118 raise IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:dir
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34904) Old import of LZ4 package inside CompressionCodec.scala

2021-03-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312471#comment-17312471
 ] 

Sean R. Owen commented on SPARK-34904:
--

[~langustav] I don't see any code in the org.lz4 namespace in the project? what 
am I missing?
https://github.com/lz4/lz4-java/tree/master/src/java

> Old import of LZ4 package inside CompressionCodec.scala 
> 
>
> Key: SPARK-34904
> URL: https://issues.apache.org/jira/browse/SPARK-34904
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.1.1
>Reporter: Michal Zeman
>Priority: Minor
>
> This commit should upgrade the version of the LZ4 package: 
> [https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f]
>  
> The dependency was changed. However, inside the file [CompressionCodec.scala 
> |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala]
>  the old import referencing net.jpountz.lz4 (where versions up to 1.3. are 
> held) remains. 
> Because of probably backward compatibility, the newer version of org.lz4 
> package still contains net.jpountz.lz4 
> ([https://github.com/lz4/lz4-java/tree/master/src/java/net/jpountz/lz4)]. 
> Therefore this import does not cause problems at a first sight. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34904) Old import of LZ4 package inside CompressionCodec.scala

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-34904:
-
Priority: Minor  (was: Major)

> Old import of LZ4 package inside CompressionCodec.scala 
> 
>
> Key: SPARK-34904
> URL: https://issues.apache.org/jira/browse/SPARK-34904
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.1.1
>Reporter: Michal Zeman
>Priority: Minor
>
> This commit should upgrade the version of the LZ4 package: 
> [https://github.com/apache/spark/commit/b78cf13bf05f0eadd7ae97df84b6e1505dc5ff9f]
>  
> The dependency was changed. However, inside the file [CompressionCodec.scala 
> |https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala]
>  the old import referencing net.jpountz.lz4 (where versions up to 1.3. are 
> held) remains. 
> Because of probably backward compatibility, the newer version of org.lz4 
> package still contains net.jpountz.lz4 
> ([https://github.com/lz4/lz4-java/tree/master/src/java/net/jpountz/lz4)]. 
> Therefore this import does not cause problems at a first sight. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34910) JDBC - Add an option for different stride orders

2021-03-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312466#comment-17312466
 ] 

Sean R. Owen commented on SPARK-34910:
--

Out of curiosity, when is that advantageous?

> JDBC - Add an option for different stride orders
> 
>
> Key: SPARK-34910
> URL: https://issues.apache.org/jira/browse/SPARK-34910
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jason Yarbrough
>Priority: Trivial
>
> Currently, the JDBCRelation columnPartition function orders the strides in 
> ascending order, starting from the lower bound and working its way towards 
> the upper bound.
> I'm proposing leaving that as the default, but adding an option (such as 
> strideOrder) in JDBCOptions. Since it will default to the current behavior, 
> this will keep people's current code working as expected. However, people who 
> may have data skew closer to the upper bound might appreciate being able to 
> have the strides in descending order, thus filling up the first partition 
> with the last stride and so forth. Also, people with nondeterministic data 
> skew or sporadic data density might be able to benefit from a random ordering 
> of the strides.
> I have the code created to implement this, and it creates a pattern that can 
> be used to add other algorithms that people may want to add (such as counting 
> the rows and ranking each stride, and then ordering from most dense to 
> least). The current two options I have coded is 'descending' and 'random.'
> The original idea was to create something closer to Spark's hash partitioner, 
> but for JDBC and pushed down to the database engine for efficiency. However, 
> that would require adding hashing algorithms for each dialect, and the 
> performance from those algorithms may outweigh the benefit. The method I'm 
> proposing in this ticket avoids those complexities while still giving some of 
> the benefit (in the case of random ordering).
> I'll put a PR in if others feel this is a good idea.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34912) Start spark shell to read file and report an error

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34912.
--
Resolution: Invalid

Very old Spark + something wrong with your config of AWS libs

> Start spark shell to read file and report an error
> --
>
> Key: SPARK-34912
> URL: https://issues.apache.org/jira/browse/SPARK-34912
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.0
>Reporter: czh
>Priority: Major
>
> Start spark shell to read external file and report an error
>  
> Class org.apache.hadoop .fs.s3a.S3AFileSystem not found
>  
> Spark version 1.6 and Hadoop version 2.6.0 have copied aws-java-sdk-1.7.4.jar 
> and hadoop-aws-2.6.0.jar to the Lib package under spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34913) Start spark shell to read file and report an error

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34913.
--
Resolution: Duplicate

> Start spark shell to read file and report an error
> --
>
> Key: SPARK-34913
> URL: https://issues.apache.org/jira/browse/SPARK-34913
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.0
>Reporter: czh
>Priority: Major
> Attachments: 微信截图_20210331110125.png
>
>
> Start spark shell to read external file and report an error
> Class org.apache.hadoop .fs.s3a.S3AFileSystem not found
> Spark version 1.6 and Hadoop version 2.6.0 have copied aws-java-sdk-1.7.4.jar 
> and hadoop-aws-2.6.0.jar to the Lib package under spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34921) Error in spark accessing file

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34921.
--
Resolution: Invalid

Not nearly enough info

> Error in spark accessing file
> -
>
> Key: SPARK-34921
> URL: https://issues.apache.org/jira/browse/SPARK-34921
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: czh
>Priority: Major
> Attachments: 微信截图_20210331183234.png
>
>
> Error in spark accessing file java.lang.NumberFormatException : For input 
> string: "100M"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34510) .foreachPartition command hangs when ran inside Python package but works when ran from Python file outside the package on EMR

2021-03-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312460#comment-17312460
 ] 

Sean R. Owen commented on SPARK-34510:
--

Got it. Without more detail or errors, or a repro against OSS Spark, hard to 
say or do much here. It does seem like something to do with the structure of 
the code or its execution, not Spark.

> .foreachPartition command hangs when ran inside Python package but works when 
> ran from Python file outside the package on EMR
> -
>
> Key: SPARK-34510
> URL: https://issues.apache.org/jira/browse/SPARK-34510
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, PySpark
>Affects Versions: 3.0.0
>Reporter: Yuriy
>Priority: Minor
> Attachments: Code.zip
>
>
> I'm running on EMR Pyspark 3.0.0. with project structure below, process.py is 
> what controls the flow of the application and calls code inside the 
> _file_processor_ package. The command hangs when the .foreachPartition code 
> that is located inside _s3_repo.py_ is called by _process.py_. When the same 
> .foreachPartition code is moved from _s3_repo.py_ and placed inside the 
> _process.py_ it runs just fine.
> {code:java}
> process.py
> file_processor
>   config
> spark.py
>   repository
> s3_repo.py
>   structure
> table_creator.py
> {code}
> *process.py*
> {code:java}
> from file_processor.structure import table_creator
> from file_processor.repository import s3_repo
> def process():
> table_creator.create_table()
> s3_repo.save_to_s3()
> if __name__ == '__main__':
> process()
> {code}
> *spark.py*
> {code:java}
> from pyspark.sql import SparkSession
> spark_session = SparkSession.builder.appName("Test").getOrCreate()
> {code}
> *s3_repo.py* 
> {code:java}
> from file_processor.config.spark import spark_session
> def save_to_s3():
> spark_session.sql('SELECT * FROM 
> rawFileData').toJSON().foreachPartition(_save_to_s3)
> def _save_to_s3(iterator):   
> for record in iterator:
> print(record)
> {code}
>  *table_creator.py*
> {code:java}
> from file_processor.config.spark import spark_session
> from pyspark.sql import Row
> def create_table():
> file_contents = [
> {'line_num': 1, 'contents': 'line 1'},
> {'line_num': 2, 'contents': 'line 2'},
> {'line_num': 3, 'contents': 'line 3'}
> ]
> spark_session.createDataFrame(Row(**row) for row in 
> file_contents).cache().createOrReplaceTempView("rawFileData")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34257) Improve performance for last_value over unbounded window frame

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34257.
--
Resolution: Won't Fix

> Improve performance for last_value over unbounded window frame
> --
>
> Key: SPARK-34257
> URL: https://issues.apache.org/jira/browse/SPARK-34257
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> The current implement of `last_value` over unbounded window frame will 
> execute `updateExpressions` multiple times.
> In fact, `last_value` only execute `updateExpressions` once on the last row.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34806) Helper class for batch Dataset.observe()

2021-03-31 Thread Enrico Minack (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312403#comment-17312403
 ] 

Enrico Minack commented on SPARK-34806:
---

The interaction with {{Observation}} can be split into three steps:
 # create an {{Observation}} instance
 # observe a {{Dataset}}
 # retrieve the metrics

Each of these could be designed in different ways:

*1. create an {{Observation}} instance*
{code:scala}
 // without column expressions
 val observation = Observation("name")
 val observation = spark.observation("name")

// with column expressions
 val observation = Observation("name", count(lit(1)), sum($"id"), mean($"id"))
 val observation = spark.observation("name", 
avg($"id").cast("int").as("avg_val"))
{code}
*2. observe a {{Dataset}}*
{code:scala}
 val observed = df.observe(observation)
 val observed = df.observe(observation, count(lit(1)), sum($"id"), mean($"id"))
 val observed = observation.on(ds)
 val observed = ds.transform(observation.on)
{code}
*3. retrieve the metrics*
{code:scala}
 // ways to retrieve the metrics
 val metrics: Row = observation.get
{code}
So we have these design decisions:
- Are column expressions constant to an observation?
-- I would prefer this: increases immutability of the observation instance
- Create an {{Observation}} through a session method or by simply instantiating 
class {{Observation}}?
-- I would prefer this: created instance can register right away with the 
session listeners, not on "2. observe a {{Dataset}}"
- Extend the {{Dataset}} API by overloading `observe`?
-- I would prefer: {{df.observe(observation)}} as it mimics existing 
{{df.observe(name, col, cols)}}, we could still have 
{{Observation.on(Dataset)}} and users can pick their favourite
- Move the logic that rejects stream datasets from `Dataset.observe` to 
{{Observation.on}}. As it is not the {{Dataset}} that rejects this construct, 
it is the {{Observation}}.
- Make {{Observation}} thread-safe?
-- I'd argue, the user of {{Observation}} has to make it thread safe by calling 
the actions on the observed {{Dataset}} in the same thread as calling 
{{Observation.get}}, or synchronizing the threads that do these bits. I would 
not expect it a general use case to have these two in different threads.
- Make the observation name optional
-- What is the use to give it a name? When optional, it could generate a random 
UUID name internally.

> Helper class for batch Dataset.observe()
> 
>
> Key: SPARK-34806
> URL: https://issues.apache.org/jira/browse/SPARK-34806
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Enrico Minack
>Priority: Minor
>
> The {{observe}} method has been added to the {{Dataset}} API in 3.0.0. It 
> allows to collect aggregate metrics over data of a Dataset while they are 
> being processed during an action.
> These metrics are collected in a separate thread after registering 
> {{QueryExecutionListener}} for batch datasets and {{StreamingQueryListener}} 
> for stream datasets, respectively. While in streaming context it makes 
> perfectly sense to process incremental metrics in an event-based fashion, for 
> simple batch datatset processing, a single result should be retrievable 
> without the need to register listeners or handling threading.
> Introducing an {{Observation}} helper class can hide that complexity for 
> simple use-cases in batch processing.
> Similar to {{AccumulatorV2}} provided by {{SparkContext}} (e.g. 
> {{SparkContext.LongAccumulator}}), the {{SparkSession}} can provide a method 
> to create a new {{Observation}} instance and register it with the session.
> Alternatively, an {{Observation}} instance could be instantiated on its own 
> which on calling {{Observation.on(Dataset)}} registers with 
> {{Dataset.sparkSession}}. This "registration" registers a listener with the 
> session that retrieves the metrics.
> The {{Observation}} class provides methods to retrieve the metrics. This 
> retrieval has to wait for the listener to be called in a separate thread. So 
> all methods will wait for this, optionally with a timeout:
>  - {{Observation.get}} waits without timeout and returns the metric.
>  - {{Observation.option(time, unit)}} waits at most {{time}}, returns the 
> metric as an {{Option}}, or {{None}} when the timeout occurs.
>  - {{Observation.waitCompleted(time, unit)}} waits for the metrics and 
> indicates timeout by returning {{false}}.
> Obviously, an action has to be called on the observed dataset before any of 
> these methods are called, otherwise a timeout will occur.
> With {{Observation.reset}}, another action can be observed. Finally, 
> {{Observation.close}} unregisters the listener from the session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (SPARK-34821) Set up a workflow for developers to run benchmark in their fork

2021-03-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34821:


Assignee: (was: Apache Spark)

> Set up a workflow for developers to run benchmark in their fork
> ---
>
> Key: SPARK-34821
> URL: https://issues.apache.org/jira/browse/SPARK-34821
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Please refer the discussion in https://github.com/apache/spark/pull/31917.
> Currently we don't have a pined environment to run the benchmark, and it's 
> difficult to track and follow the performance differences.
> It would be great if we have an easy way to get the benchmark results like 
> "Running tests in your forked repository using GitHub Actions" in 
> https://spark.apache.org/developer-tools.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34821) Set up a workflow for developers to run benchmark in their fork

2021-03-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312374#comment-17312374
 ] 

Apache Spark commented on SPARK-34821:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32015

> Set up a workflow for developers to run benchmark in their fork
> ---
>
> Key: SPARK-34821
> URL: https://issues.apache.org/jira/browse/SPARK-34821
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Please refer the discussion in https://github.com/apache/spark/pull/31917.
> Currently we don't have a pined environment to run the benchmark, and it's 
> difficult to track and follow the performance differences.
> It would be great if we have an easy way to get the benchmark results like 
> "Running tests in your forked repository using GitHub Actions" in 
> https://spark.apache.org/developer-tools.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34821) Set up a workflow for developers to run benchmark in their fork

2021-03-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34821:


Assignee: Apache Spark

> Set up a workflow for developers to run benchmark in their fork
> ---
>
> Key: SPARK-34821
> URL: https://issues.apache.org/jira/browse/SPARK-34821
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Please refer the discussion in https://github.com/apache/spark/pull/31917.
> Currently we don't have a pined environment to run the benchmark, and it's 
> difficult to track and follow the performance differences.
> It would be great if we have an easy way to get the benchmark results like 
> "Running tests in your forked repository using GitHub Actions" in 
> https://spark.apache.org/developer-tools.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34881) New SQL Function: TRY_CAST

2021-03-31 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-34881.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31982
[https://github.com/apache/spark/pull/31982]

> New SQL Function: TRY_CAST
> --
>
> Key: SPARK-34881
> URL: https://issues.apache.org/jira/browse/SPARK-34881
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> Add a new SQL function try_cast. try_cast is identical to CAST with 
> `spark.sql.ansi.enabled` as true, except it returns NULL instead of raising 
> an error. This expression has one major difference from `cast` with 
> `spark.sql.ansi.enabled` as true: when the source value can't be stored in 
> the target integral(Byte/Short/Int/Long) type, `try_cast` returns null 
> instead of returning the low order bytes of the source value.
> This is learned from Google BigQuery and Snowflake:
> https://docs.snowflake.com/en/sql-reference/functions/try_cast.html
> https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#safe_casting



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34922) Use better CBO cost function

2021-03-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34922:


Assignee: Apache Spark

> Use better CBO cost function
> 
>
> Key: SPARK-34922
> URL: https://issues.apache.org/jira/browse/SPARK-34922
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Assignee: Apache Spark
>Priority: Major
>
> In SPARK-33935 we changed the CBO cost function such that it would be 
> symetric - A.betterThan(B) implies that !B.betterThan(A). Before both could 
> have been true.
> That change introduced a performance regression in some queries. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34922) Use better CBO cost function

2021-03-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312316#comment-17312316
 ] 

Apache Spark commented on SPARK-34922:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/32014

> Use better CBO cost function
> 
>
> Key: SPARK-34922
> URL: https://issues.apache.org/jira/browse/SPARK-34922
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Major
>
> In SPARK-33935 we changed the CBO cost function such that it would be 
> symetric - A.betterThan(B) implies that !B.betterThan(A). Before both could 
> have been true.
> That change introduced a performance regression in some queries. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34922) Use better CBO cost function

2021-03-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34922:


Assignee: (was: Apache Spark)

> Use better CBO cost function
> 
>
> Key: SPARK-34922
> URL: https://issues.apache.org/jira/browse/SPARK-34922
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Major
>
> In SPARK-33935 we changed the CBO cost function such that it would be 
> symetric - A.betterThan(B) implies that !B.betterThan(A). Before both could 
> have been true.
> That change introduced a performance regression in some queries. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34922) Use better CBO cost function

2021-03-31 Thread Tanel Kiis (Jira)
Tanel Kiis created SPARK-34922:
--

 Summary: Use better CBO cost function
 Key: SPARK-34922
 URL: https://issues.apache.org/jira/browse/SPARK-34922
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Tanel Kiis


In SPARK-33935 we changed the CBO cost function such that it would be symetric 
- A.betterThan(B) implies that !B.betterThan(A). Before both could have been 
true.

That change introduced a performance regression in some queries. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34915) Cache Maven, SBT and Scala in all jobs that use them

2021-03-31 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34915:
-
Fix Version/s: 3.0.3
   3.1.2

> Cache Maven, SBT and Scala in all jobs that use them
> 
>
> Key: SPARK-34915
> URL: https://issues.apache.org/jira/browse/SPARK-34915
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> We should cache SBT, Maven and Scala for all jobs that use them. This is 
> currently missing in some jobs such as 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L411-L430



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34921) Error in spark accessing file

2021-03-31 Thread czh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

czh updated SPARK-34921:

Attachment: 微信截图_20210331183234.png

> Error in spark accessing file
> -
>
> Key: SPARK-34921
> URL: https://issues.apache.org/jira/browse/SPARK-34921
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: czh
>Priority: Major
> Attachments: 微信截图_20210331183234.png
>
>
> Error in spark accessing file java.lang.NumberFormatException : For input 
> string: "100M"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34921) Error in spark accessing file

2021-03-31 Thread czh (Jira)
czh created SPARK-34921:
---

 Summary: Error in spark accessing file
 Key: SPARK-34921
 URL: https://issues.apache.org/jira/browse/SPARK-34921
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 3.0.0
Reporter: czh
 Attachments: 微信截图_20210331183234.png

Error in spark accessing file java.lang.NumberFormatException : For input 
string: "100M"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34920) Introduce SQLSTATE and ERRORCODE to SQL Exception

2021-03-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312258#comment-17312258
 ] 

Apache Spark commented on SPARK-34920:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/32013

> Introduce SQLSTATE and ERRORCODE to SQL Exception
> -
>
> Key: SPARK-34920
> URL: https://issues.apache.org/jira/browse/SPARK-34920
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> SQLSTATE is SQL standard state. Please see



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34920) Introduce SQLSTATE and ERRORCODE to SQL Exception

2021-03-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34920:


Assignee: Apache Spark

> Introduce SQLSTATE and ERRORCODE to SQL Exception
> -
>
> Key: SPARK-34920
> URL: https://issues.apache.org/jira/browse/SPARK-34920
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> SQLSTATE is SQL standard state. Please see



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34920) Introduce SQLSTATE and ERRORCODE to SQL Exception

2021-03-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34920:


Assignee: (was: Apache Spark)

> Introduce SQLSTATE and ERRORCODE to SQL Exception
> -
>
> Key: SPARK-34920
> URL: https://issues.apache.org/jira/browse/SPARK-34920
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> SQLSTATE is SQL standard state. Please see



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34920) Introduce SQLSTATE and ERRORCODE to SQL Exception

2021-03-31 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34920:

Description: 
SQLSTATE is SQL standard state. Please see


  was:
SQLSTATE is SQL standard state



> Introduce SQLSTATE and ERRORCODE to SQL Exception
> -
>
> Key: SPARK-34920
> URL: https://issues.apache.org/jira/browse/SPARK-34920
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> SQLSTATE is SQL standard state. Please see



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34920) Introduce SQLSTATE and ERRORCODE to SQL Exception

2021-03-31 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-34920:
---

 Summary: Introduce SQLSTATE and ERRORCODE to SQL Exception
 Key: SPARK-34920
 URL: https://issues.apache.org/jira/browse/SPARK-34920
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


SQLSTATE is SQL standard state




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34915) Cache Maven, SBT and Scala in all jobs that use them

2021-03-31 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-34915.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32011
[https://github.com/apache/spark/pull/32011]

> Cache Maven, SBT and Scala in all jobs that use them
> 
>
> Key: SPARK-34915
> URL: https://issues.apache.org/jira/browse/SPARK-34915
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.2.0
>
>
> We should cache SBT, Maven and Scala for all jobs that use them. This is 
> currently missing in some jobs such as 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L411-L430



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34915) Cache Maven, SBT and Scala in all jobs that use them

2021-03-31 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-34915:
--

Assignee: Hyukjin Kwon

> Cache Maven, SBT and Scala in all jobs that use them
> 
>
> Key: SPARK-34915
> URL: https://issues.apache.org/jira/browse/SPARK-34915
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> We should cache SBT, Maven and Scala for all jobs that use them. This is 
> currently missing in some jobs such as 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L411-L430



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34919) Change partitioning to SinglePartition if partition number is 1

2021-03-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312249#comment-17312249
 ] 

Apache Spark commented on SPARK-34919:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/32012

> Change partitioning to SinglePartition if partition number is 1
> ---
>
> Key: SPARK-34919
> URL: https://issues.apache.org/jira/browse/SPARK-34919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> For node `Repartition` and `RepartitionByExpression`, if partition number is 
> 1 we can use `SinglePartition` instead of other `Partitioning`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34919) Change partitioning to SinglePartition if partition number is 1

2021-03-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17312250#comment-17312250
 ] 

Apache Spark commented on SPARK-34919:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/32012

> Change partitioning to SinglePartition if partition number is 1
> ---
>
> Key: SPARK-34919
> URL: https://issues.apache.org/jira/browse/SPARK-34919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> For node `Repartition` and `RepartitionByExpression`, if partition number is 
> 1 we can use `SinglePartition` instead of other `Partitioning`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34919) Change partitioning to SinglePartition if partition number is 1

2021-03-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34919:


Assignee: (was: Apache Spark)

> Change partitioning to SinglePartition if partition number is 1
> ---
>
> Key: SPARK-34919
> URL: https://issues.apache.org/jira/browse/SPARK-34919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> For node `Repartition` and `RepartitionByExpression`, if partition number is 
> 1 we can use `SinglePartition` instead of other `Partitioning`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34919) Change partitioning to SinglePartition if partition number is 1

2021-03-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34919:


Assignee: Apache Spark

> Change partitioning to SinglePartition if partition number is 1
> ---
>
> Key: SPARK-34919
> URL: https://issues.apache.org/jira/browse/SPARK-34919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>
> For node `Repartition` and `RepartitionByExpression`, if partition number is 
> 1 we can use `SinglePartition` instead of other `Partitioning`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34919) Change partitioning to SinglePartition if partition number is 1

2021-03-31 Thread ulysses you (Jira)
ulysses you created SPARK-34919:
---

 Summary: Change partitioning to SinglePartition if partition 
number is 1
 Key: SPARK-34919
 URL: https://issues.apache.org/jira/browse/SPARK-34919
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: ulysses you


For node `Repartition` and `RepartitionByExpression`, if partition number is 1 
we can use `SinglePartition` instead of other `Partitioning`.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34771) Support UDT for Pandas/Spark convertion with Arrow support Enabled

2021-03-31 Thread Darcy Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darcy Shen updated SPARK-34771:
---
Summary: Support UDT for Pandas/Spark convertion with Arrow support Enabled 
 (was: Support UDT for Pandas/Spark convertion with Arrow Enabled)

> Support UDT for Pandas/Spark convertion with Arrow support Enabled
> --
>
> Key: SPARK-34771
> URL: https://issues.apache.org/jira/browse/SPARK-34771
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Darcy Shen
>Priority: Major
>
> {code:python}
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> from pyspark.testing.sqlutils  import ExamplePoint
> import pandas as pd
> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
> 2)])})
> df = spark.createDataFrame(pdf)
> df.toPandas()
> {code}
> with `spark.sql.execution.arrow.enabled` = false, the above snippet works 
> fine without WARNINGS.
> with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine 
> with WARNINGS. Because of Unsupported type in conversion, the Arrow 
> optimization is actually turned off. 
> Detailed steps to reproduce:
> {code:python}
> $ bin/pyspark
> Python 3.8.8 (default, Feb 24 2021, 13:46:16)
> [Clang 10.0.0 ] :: Anaconda, Inc. on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>   /_/
> Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
> Spark context Web UI available at http://172.30.0.226:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1615994008526).
> SparkSession available as 'spark'.
> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> 21/03/17 23:13:31 WARN SQLConf: The SQL config 
> 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may 
> be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' 
> instead of it.
> >>> from pyspark.testing.sqlutils  import ExamplePoint
> >>> import pandas as pd
> >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), 
> >>> ExamplePoint(2, 2)])})
> >>> df = spark.createDataFrame(pdf)
> /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: 
> UserWarning: createDataFrame attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Could not convert (1,1) with type ExamplePoint: did not recognize Python 
> value type when inferring an Arrow data type
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warnings.warn(msg)
> >>>
> >>> df.show()
> +--+
> | point|
> +--+
> |(0.0, 0.0)|
> |(0.0, 0.0)|
> +--+
> >>> df.schema
> StructType(List(StructField(point,ExamplePointUDT,true)))
> >>> df.toPandas()
> /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: 
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Unsupported type in conversion to Arrow: ExamplePointUDT
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warnings.warn(msg)
>point
> 0  (0.0,0.0)
> 1  (0.0,0.0)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34771) Support UDT for Pandas/Spark conversion with Arrow support Enabled

2021-03-31 Thread Darcy Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darcy Shen updated SPARK-34771:
---
Summary: Support UDT for Pandas/Spark conversion with Arrow support Enabled 
 (was: Support UDT for Pandas/Spark convertion with Arrow support Enabled)

> Support UDT for Pandas/Spark conversion with Arrow support Enabled
> --
>
> Key: SPARK-34771
> URL: https://issues.apache.org/jira/browse/SPARK-34771
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Darcy Shen
>Priority: Major
>
> {code:python}
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> from pyspark.testing.sqlutils  import ExamplePoint
> import pandas as pd
> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
> 2)])})
> df = spark.createDataFrame(pdf)
> df.toPandas()
> {code}
> with `spark.sql.execution.arrow.enabled` = false, the above snippet works 
> fine without WARNINGS.
> with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine 
> with WARNINGS. Because of Unsupported type in conversion, the Arrow 
> optimization is actually turned off. 
> Detailed steps to reproduce:
> {code:python}
> $ bin/pyspark
> Python 3.8.8 (default, Feb 24 2021, 13:46:16)
> [Clang 10.0.0 ] :: Anaconda, Inc. on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>   /_/
> Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
> Spark context Web UI available at http://172.30.0.226:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1615994008526).
> SparkSession available as 'spark'.
> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> 21/03/17 23:13:31 WARN SQLConf: The SQL config 
> 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may 
> be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' 
> instead of it.
> >>> from pyspark.testing.sqlutils  import ExamplePoint
> >>> import pandas as pd
> >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), 
> >>> ExamplePoint(2, 2)])})
> >>> df = spark.createDataFrame(pdf)
> /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: 
> UserWarning: createDataFrame attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Could not convert (1,1) with type ExamplePoint: did not recognize Python 
> value type when inferring an Arrow data type
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warnings.warn(msg)
> >>>
> >>> df.show()
> +--+
> | point|
> +--+
> |(0.0, 0.0)|
> |(0.0, 0.0)|
> +--+
> >>> df.schema
> StructType(List(StructField(point,ExamplePointUDT,true)))
> >>> df.toPandas()
> /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: 
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Unsupported type in conversion to Arrow: ExamplePointUDT
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warnings.warn(msg)
>point
> 0  (0.0,0.0)
> 1  (0.0,0.0)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34771) Support UDT for Pandas/Spark convertion with Arrow Enabled

2021-03-31 Thread Darcy Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darcy Shen updated SPARK-34771:
---
Summary: Support UDT for Pandas/Spark convertion with Arrow Enabled  (was: 
Support UDT for Pandas/Spark convertion with Arrow Optimization)

> Support UDT for Pandas/Spark convertion with Arrow Enabled
> --
>
> Key: SPARK-34771
> URL: https://issues.apache.org/jira/browse/SPARK-34771
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Darcy Shen
>Priority: Major
>
> {code:python}
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> from pyspark.testing.sqlutils  import ExamplePoint
> import pandas as pd
> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
> 2)])})
> df = spark.createDataFrame(pdf)
> df.toPandas()
> {code}
> with `spark.sql.execution.arrow.enabled` = false, the above snippet works 
> fine without WARNINGS.
> with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine 
> with WARNINGS. Because of Unsupported type in conversion, the Arrow 
> optimization is actually turned off. 
> Detailed steps to reproduce:
> {code:python}
> $ bin/pyspark
> Python 3.8.8 (default, Feb 24 2021, 13:46:16)
> [Clang 10.0.0 ] :: Anaconda, Inc. on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>   /_/
> Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
> Spark context Web UI available at http://172.30.0.226:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1615994008526).
> SparkSession available as 'spark'.
> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> 21/03/17 23:13:31 WARN SQLConf: The SQL config 
> 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may 
> be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' 
> instead of it.
> >>> from pyspark.testing.sqlutils  import ExamplePoint
> >>> import pandas as pd
> >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), 
> >>> ExamplePoint(2, 2)])})
> >>> df = spark.createDataFrame(pdf)
> /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: 
> UserWarning: createDataFrame attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Could not convert (1,1) with type ExamplePoint: did not recognize Python 
> value type when inferring an Arrow data type
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warnings.warn(msg)
> >>>
> >>> df.show()
> +--+
> | point|
> +--+
> |(0.0, 0.0)|
> |(0.0, 0.0)|
> +--+
> >>> df.schema
> StructType(List(StructField(point,ExamplePointUDT,true)))
> >>> df.toPandas()
> /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: 
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Unsupported type in conversion to Arrow: ExamplePointUDT
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warnings.warn(msg)
>point
> 0  (0.0,0.0)
> 1  (0.0,0.0)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34771) Support UDT for Pandas/Spark convertion with Arrow Optimization

2021-03-31 Thread Darcy Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darcy Shen updated SPARK-34771:
---
Summary: Support UDT for Pandas/Spark convertion with Arrow Optimization  
(was: Support UDT for Pandas with Arrow Optimization)

> Support UDT for Pandas/Spark convertion with Arrow Optimization
> ---
>
> Key: SPARK-34771
> URL: https://issues.apache.org/jira/browse/SPARK-34771
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Darcy Shen
>Priority: Major
>
> {code:python}
> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> from pyspark.testing.sqlutils  import ExamplePoint
> import pandas as pd
> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
> 2)])})
> df = spark.createDataFrame(pdf)
> df.toPandas()
> {code}
> with `spark.sql.execution.arrow.enabled` = false, the above snippet works 
> fine without WARNINGS.
> with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine 
> with WARNINGS. Because of Unsupported type in conversion, the Arrow 
> optimization is actually turned off. 
> Detailed steps to reproduce:
> {code:python}
> $ bin/pyspark
> Python 3.8.8 (default, Feb 24 2021, 13:46:16)
> [Clang 10.0.0 ] :: Anaconda, Inc. on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
>   /_/
> Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
> Spark context Web UI available at http://172.30.0.226:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1615994008526).
> SparkSession available as 'spark'.
> >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
> 21/03/17 23:13:31 WARN SQLConf: The SQL config 
> 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may 
> be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' 
> instead of it.
> >>> from pyspark.testing.sqlutils  import ExamplePoint
> >>> import pandas as pd
> >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), 
> >>> ExamplePoint(2, 2)])})
> >>> df = spark.createDataFrame(pdf)
> /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: 
> UserWarning: createDataFrame attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Could not convert (1,1) with type ExamplePoint: did not recognize Python 
> value type when inferring an Arrow data type
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warnings.warn(msg)
> >>>
> >>> df.show()
> +--+
> | point|
> +--+
> |(0.0, 0.0)|
> |(0.0, 0.0)|
> +--+
> >>> df.schema
> StructType(List(StructField(point,ExamplePointUDT,true)))
> >>> df.toPandas()
> /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: 
> UserWarning: toPandas attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Unsupported type in conversion to Arrow: ExamplePointUDT
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warnings.warn(msg)
>point
> 0  (0.0,0.0)
> 1  (0.0,0.0)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34911) Fix code close issue in monitoring.md

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-34911:


Assignee: angerszhu

> Fix code close issue in monitoring.md
> -
>
> Key: SPARK-34911
> URL: https://issues.apache.org/jira/browse/SPARK-34911
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Trivial
>
> Fix code close issue in monitoring.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34911) Fix code close issue in monitoring.md

2021-03-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34911.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32008
[https://github.com/apache/spark/pull/32008]

> Fix code close issue in monitoring.md
> -
>
> Key: SPARK-34911
> URL: https://issues.apache.org/jira/browse/SPARK-34911
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Trivial
> Fix For: 3.2.0
>
>
> Fix code close issue in monitoring.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >