date:20210913

[jira] [Created] (SPARK-36751) octet_length/bit_length API is not implemented on Scala/Python/R

2021-09-13 Thread Leona Yoda (Jira)

Leona Yoda created SPARK-36751:
--

 Summary: octet_length/bit_length API is not implemented  on 
Scala/Python/R
 Key: SPARK-36751
 URL: https://issues.apache.org/jira/browse/SPARK-36751
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SparkR, SQL
Affects Versions: 3.3.0
Reporter: Leona Yoda


* octet_length: caliculate the byte length of strings
 * bit_length: caliculate the bit length of strings

Those two string related functions are only implemented on SparkSQL, not on 
Scala, Python and R.

Those functions would be useful for multi-bytes character users, who mainly 
working with those languages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36747) Do not collapse Project with Aggregate when correlated subqueries are present in the project list

2021-09-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414754#comment-17414754
 ] 

Apache Spark commented on SPARK-36747:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/33990

> Do not collapse Project with Aggregate when correlated subqueries are present 
> in the project list
> -
>
> Key: SPARK-36747
> URL: https://issues.apache.org/jira/browse/SPARK-36747
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently CollapseProject combines Project with Aggregate when the shared 
> attributes are deterministic. But if there are correlated scalar subqueries 
> in the project list that uses the output of the aggregate, they cannot be 
> combined. Otherwise, the plan after rewrite will not be valid:
> {code}
> select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) 
> s from t)
> == Optimized Logical Plan ==
> Aggregate [sum(c2)#10L AS scalarsubquery(s)#11L]
> +- Project [sum(c2)#10L]
>+- Join LeftOuter, (c1#2 = cast(sum(c2#3) as int))
>   :- LocalRelation [c2#3]
>   +- Aggregate [c1#2], [sum(c2#3) AS sum(c2)#10L, c1#2]
>  +- LocalRelation [c1#2, c2#3]
> java.lang.UnsupportedOperationException: Cannot generate code for expression: 
> sum(input[0, int, false])
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36750) Use java.util.Objects API instead of Guava API

2021-09-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36750:


Assignee: (was: Apache Spark)

> Use java.util.Objects API instead of Guava API
> --
>
> Key: SPARK-36750
> URL: https://issues.apache.org/jira/browse/SPARK-36750
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> Java 8 provides the java.util.Objects, we can use it to replace some guava 
> API usages with the same semantics.
>  
>  * Preconditions.checkNotNull -> j.u.Objects.requireNonNull
>  * Objects.hashCode -> j.u.Objects.hash
>  * Objects.equal -> j.u.Objects.equals
>  * 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36747) Do not collapse Project with Aggregate when correlated subqueries are present in the project list

2021-09-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36747:


Assignee: (was: Apache Spark)

> Do not collapse Project with Aggregate when correlated subqueries are present 
> in the project list
> -
>
> Key: SPARK-36747
> URL: https://issues.apache.org/jira/browse/SPARK-36747
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently CollapseProject combines Project with Aggregate when the shared 
> attributes are deterministic. But if there are correlated scalar subqueries 
> in the project list that uses the output of the aggregate, they cannot be 
> combined. Otherwise, the plan after rewrite will not be valid:
> {code}
> select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) 
> s from t)
> == Optimized Logical Plan ==
> Aggregate [sum(c2)#10L AS scalarsubquery(s)#11L]
> +- Project [sum(c2)#10L]
>+- Join LeftOuter, (c1#2 = cast(sum(c2#3) as int))
>   :- LocalRelation [c2#3]
>   +- Aggregate [c1#2], [sum(c2#3) AS sum(c2)#10L, c1#2]
>  +- LocalRelation [c1#2, c2#3]
> java.lang.UnsupportedOperationException: Cannot generate code for expression: 
> sum(input[0, int, false])
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36747) Do not collapse Project with Aggregate when correlated subqueries are present in the project list

2021-09-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36747:


Assignee: Apache Spark

> Do not collapse Project with Aggregate when correlated subqueries are present 
> in the project list
> -
>
> Key: SPARK-36747
> URL: https://issues.apache.org/jira/browse/SPARK-36747
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> Currently CollapseProject combines Project with Aggregate when the shared 
> attributes are deterministic. But if there are correlated scalar subqueries 
> in the project list that uses the output of the aggregate, they cannot be 
> combined. Otherwise, the plan after rewrite will not be valid:
> {code}
> select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) 
> s from t)
> == Optimized Logical Plan ==
> Aggregate [sum(c2)#10L AS scalarsubquery(s)#11L]
> +- Project [sum(c2)#10L]
>+- Join LeftOuter, (c1#2 = cast(sum(c2#3) as int))
>   :- LocalRelation [c2#3]
>   +- Aggregate [c1#2], [sum(c2#3) AS sum(c2)#10L, c1#2]
>  +- LocalRelation [c1#2, c2#3]
> java.lang.UnsupportedOperationException: Cannot generate code for expression: 
> sum(input[0, int, false])
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36750) Use java.util.Objects API instead of Guava API

2021-09-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36750:


Assignee: Apache Spark

> Use java.util.Objects API instead of Guava API
> --
>
> Key: SPARK-36750
> URL: https://issues.apache.org/jira/browse/SPARK-36750
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Java 8 provides the java.util.Objects, we can use it to replace some guava 
> API usages with the same semantics.
>  
>  * Preconditions.checkNotNull -> j.u.Objects.requireNonNull
>  * Objects.hashCode -> j.u.Objects.hash
>  * Objects.equal -> j.u.Objects.equals
>  * 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36750) Use java.util.Objects API instead of Guava API

2021-09-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414753#comment-17414753
 ] 

Apache Spark commented on SPARK-36750:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/33991

> Use java.util.Objects API instead of Guava API
> --
>
> Key: SPARK-36750
> URL: https://issues.apache.org/jira/browse/SPARK-36750
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> Java 8 provides the java.util.Objects, we can use it to replace some guava 
> API usages with the same semantics.
>  
>  * Preconditions.checkNotNull -> j.u.Objects.requireNonNull
>  * Objects.hashCode -> j.u.Objects.hash
>  * Objects.equal -> j.u.Objects.equals
>  * 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36743) Backporting SPARK-36327 changes into Spark 2.4 version

2021-09-13 Thread Senthil Kumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414746#comment-17414746
 ] 

Senthil Kumar commented on SPARK-36743:
---

[~hyukjin.kwon], [~dongjoon]. Thanks for the kind and immediate response on 
this.

> Backporting SPARK-36327 changes into Spark 2.4 version
> --
>
> Key: SPARK-36743
> URL: https://issues.apache.org/jira/browse/SPARK-36743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Senthil Kumar
>Priority: Minor
>
> Could we back port changes merged by PR 
> [https://github.com/apache/spark/pull/33577]  into Spark 2.4 too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33772) Build and Run Spark on Java 17

2021-09-13 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414745#comment-17414745
 ] 

Dongjoon Hyun commented on SPARK-33772:
---

To [~lrytz]. Thank you for the tip.

To [~h-vetinari], what make you think like that? It's surprising to me actually.
> I think this should be targeted for 3.2.x instead of 3.3.0...

> Build and Run Spark on Java 17
> --
>
> Key: SPARK-33772
> URL: https://issues.apache.org/jira/browse/SPARK-33772
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache Spark supports Java 8 and Java 11 (LTS). The next Java LTS version is 
> 17.
> ||Version||Release Date||
> |Java 17 (LTS)|September 2021|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36750) Use java.util.Objects API instead of Guava API

2021-09-13 Thread Yang Jie (Jira)

Yang Jie created SPARK-36750:


 Summary: Use java.util.Objects API instead of Guava API
 Key: SPARK-36750
 URL: https://issues.apache.org/jira/browse/SPARK-36750
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Spark Core, SQL
Affects Versions: 3.3.0
Reporter: Yang Jie


Java 8 provides the java.util.Objects, we can use it to replace some guava API 
usages with the same semantics.

 
 * Preconditions.checkNotNull -> j.u.Objects.requireNonNull
 * Objects.hashCode -> j.u.Objects.hash
 * Objects.equal -> j.u.Objects.equals
 * 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36676) Create shaded Hive module and upgrade to higher version of Guava

2021-09-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414744#comment-17414744
 ] 

Apache Spark commented on SPARK-36676:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/33989

> Create shaded Hive module and upgrade to higher version of Guava
> 
>
> Key: SPARK-36676
> URL: https://issues.apache.org/jira/browse/SPARK-36676
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark is tied with Guava from Hive which is of version 14. This 
> proposes to create a separate module {{hive-shaded}} which shades 
> dependencies from Hive and subsequently allows us to upgrade Guava 
> independently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36743) Backporting SPARK-36327 changes into Spark 2.4 version

2021-09-13 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414743#comment-17414743
 ] 

Dongjoon Hyun commented on SPARK-36743:
---

As [~hyukjin.kwon] mentioned, we cannot. Sorry for that, [~senthh].

> Backporting SPARK-36327 changes into Spark 2.4 version
> --
>
> Key: SPARK-36743
> URL: https://issues.apache.org/jira/browse/SPARK-36743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Senthil Kumar
>Priority: Minor
>
> Could we back port changes merged by PR 
> [https://github.com/apache/spark/pull/33577]  into Spark 2.4 too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36676) Create shaded Hive module and upgrade to higher version of Guava

2021-09-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36676:


Assignee: (was: Apache Spark)

> Create shaded Hive module and upgrade to higher version of Guava
> 
>
> Key: SPARK-36676
> URL: https://issues.apache.org/jira/browse/SPARK-36676
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark is tied with Guava from Hive which is of version 14. This 
> proposes to create a separate module {{hive-shaded}} which shades 
> dependencies from Hive and subsequently allows us to upgrade Guava 
> independently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36676) Create shaded Hive module and upgrade to higher version of Guava

2021-09-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36676:


Assignee: Apache Spark

> Create shaded Hive module and upgrade to higher version of Guava
> 
>
> Key: SPARK-36676
> URL: https://issues.apache.org/jira/browse/SPARK-36676
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> Currently Spark is tied with Guava from Hive which is of version 14. This 
> proposes to create a separate module {{hive-shaded}} which shades 
> dependencies from Hive and subsequently allows us to upgrade Guava 
> independently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36676) Create shaded Hive module and upgrade to higher version of Guava

2021-09-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414742#comment-17414742
 ] 

Apache Spark commented on SPARK-36676:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/33989

> Create shaded Hive module and upgrade to higher version of Guava
> 
>
> Key: SPARK-36676
> URL: https://issues.apache.org/jira/browse/SPARK-36676
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently Spark is tied with Guava from Hive which is of version 14. This 
> proposes to create a separate module {{hive-shaded}} which shades 
> dependencies from Hive and subsequently allows us to upgrade Guava 
> independently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34112) Upgrade ORC

2021-09-13 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414731#comment-17414731
 ] 

Dongjoon Hyun commented on SPARK-34112:
---

Yep. ORC 1.7 is developed to be align with this, [~h-vetinari]. :)

> Upgrade ORC
> ---
>
> Key: SPARK-34112
> URL: https://issues.apache.org/jira/browse/SPARK-34112
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache ORC doesn't support Java 14 yet. We need to upgrade it when it's ready.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36696) spark.read.parquet loads empty dataset

2021-09-13 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414721#comment-17414721
 ] 

Micah Kornfield commented on SPARK-36696:
-

What [~gershinsky]  wrote seems to make sense from my reading of the code. I 
think the issue here PARQUET-2089.

> spark.read.parquet loads empty dataset
> --
>
> Key: SPARK-36696
> URL: https://issues.apache.org/jira/browse/SPARK-36696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Blocker
> Attachments: example.parquet
>
>
> Here's a parquet file Spark 3.2/master can't read properly.
> The file was stored by pandas and must contain 3650 rows, but Spark 
> 3.2/master returns an empty dataset.
> {code:python}
> >>> import pandas as pd
> >>> len(pd.read_parquet('/path/to/example.parquet'))
> 3650
> >>> spark.read.parquet('/path/to/example.parquet').count()
> 0
> {code}
> I guess it's caused by the parquet 1.12.0.
> When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
>  - 
> [https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa]
>  - 
> [https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da]
> it reads the data successfully.
> We need to add some workaround, or revert the commits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36706) OverwriteByExpression conversion in DataSourceV2Strategy use wrong deleteExpr translation

2021-09-13 Thread Huaxin Gao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414717#comment-17414717
 ] 

Huaxin Gao commented on SPARK-36706:


I will fix this. Thanks for pinging me [~hyukjin.kwon]

> OverwriteByExpression conversion in DataSourceV2Strategy use wrong deleteExpr 
> translation
> -
>
> Key: SPARK-36706
> URL: https://issues.apache.org/jira/browse/SPARK-36706
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: suheng.cloud
>Priority: Major
>
> spark version release-3.1.2
> we develop a hive datasource v2 plugin to support join among multiple hive 
> clusters.
> find that there maybe a bug in OverwriteByExpression conversion
> code debug at 
> https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala#L216
> where wrong param `deletExpr` used, which will result in duplicate filters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36683) Support secant and cosecant

2021-09-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36683:


Assignee: Apache Spark

> Support secant and cosecant
> ---
>
> Key: SPARK-36683
> URL: https://issues.apache.org/jira/browse/SPARK-36683
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Yuto Akutsu
>Assignee: Apache Spark
>Priority: Major
>
> Cotangent is supported in Spark SQL but Secant and Cosecant are missing as 
> discussed [here|https://github.com/apache/spark/pull/33906].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36683) Support secant and cosecant

2021-09-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414712#comment-17414712
 ] 

Apache Spark commented on SPARK-36683:
--

User 'yutoacts' has created a pull request for this issue:
https://github.com/apache/spark/pull/33988

> Support secant and cosecant
> ---
>
> Key: SPARK-36683
> URL: https://issues.apache.org/jira/browse/SPARK-36683
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Yuto Akutsu
>Priority: Major
>
> Cotangent is supported in Spark SQL but Secant and Cosecant are missing as 
> discussed [here|https://github.com/apache/spark/pull/33906].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36683) Support secant and cosecant

2021-09-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36683:


Assignee: (was: Apache Spark)

> Support secant and cosecant
> ---
>
> Key: SPARK-36683
> URL: https://issues.apache.org/jira/browse/SPARK-36683
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Yuto Akutsu
>Priority: Major
>
> Cotangent is supported in Spark SQL but Secant and Cosecant are missing as 
> discussed [here|https://github.com/apache/spark/pull/33906].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34208) Upgrade ORC to 1.6.7

2021-09-13 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414708#comment-17414708
 ] 

Dongjoon Hyun commented on SPARK-34208:
---

It's already reported, [~holden]. The fix landed the first commit after RC cut.

 !Screen Shot 2021-09-13 at 9.15.01 PM.png! 

That's the reason why Genliang mentioned this already as a known issue.
{code}
SPARK-36629: Upgrade aircompressor to 1.21
{code}

> Upgrade ORC to 1.6.7
> 
>
> Key: SPARK-34208
> URL: https://issues.apache.org/jira/browse/SPARK-34208
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Screen Shot 2021-09-13 at 9.15.01 PM.png
>
>
> Apache ORC 1.6.7 has the following fixes including ORC-711 Support 
> CryptoExtension in create/decryptLocalKey.
>  * 
> [https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320&version=12349470]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34208) Upgrade ORC to 1.6.7

2021-09-13 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414708#comment-17414708
 ] 

Dongjoon Hyun edited comment on SPARK-34208 at 9/14/21, 4:17 AM:
-

It's already reported, [~holden]. The fix landed as the first commit after RC 
cut.

 !Screen Shot 2021-09-13 at 9.15.01 PM.png! 

That's the reason why Genliang mentioned this already as a known issue.
{code}
SPARK-36629: Upgrade aircompressor to 1.21
{code}


was (Author: dongjoon):
It's already reported, [~holden]. The fix landed the first commit after RC cut.

 !Screen Shot 2021-09-13 at 9.15.01 PM.png! 

That's the reason why Genliang mentioned this already as a known issue.
{code}
SPARK-36629: Upgrade aircompressor to 1.21
{code}

> Upgrade ORC to 1.6.7
> 
>
> Key: SPARK-34208
> URL: https://issues.apache.org/jira/browse/SPARK-34208
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Screen Shot 2021-09-13 at 9.15.01 PM.png
>
>
> Apache ORC 1.6.7 has the following fixes including ORC-711 Support 
> CryptoExtension in create/decryptLocalKey.
>  * 
> [https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320&version=12349470]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34208) Upgrade ORC to 1.6.7

2021-09-13 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34208:
--
Attachment: Screen Shot 2021-09-13 at 9.15.01 PM.png

> Upgrade ORC to 1.6.7
> 
>
> Key: SPARK-34208
> URL: https://issues.apache.org/jira/browse/SPARK-34208
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Attachments: Screen Shot 2021-09-13 at 9.15.01 PM.png
>
>
> Apache ORC 1.6.7 has the following fixes including ORC-711 Support 
> CryptoExtension in create/decryptLocalKey.
>  * 
> [https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320&version=12349470]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36706) OverwriteByExpression conversion in DataSourceV2Strategy use wrong deleteExpr translation

2021-09-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414704#comment-17414704
 ] 

Hyukjin Kwon commented on SPARK-36706:
--

cc [~huaxingao] FYI

> OverwriteByExpression conversion in DataSourceV2Strategy use wrong deleteExpr 
> translation
> -
>
> Key: SPARK-36706
> URL: https://issues.apache.org/jira/browse/SPARK-36706
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: suheng.cloud
>Priority: Major
>
> spark version release-3.1.2
> we develop a hive datasource v2 plugin to support join among multiple hive 
> clusters.
> find that there maybe a bug in OverwriteByExpression conversion
> code debug at 
> https://github.com/apache/spark/blob/de351e30a90dd988b133b3d00fa6218bfcaba8b8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala#L216
> where wrong param `deletExpr` used, which will result in duplicate filters



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36749) The count result of the dimension table filed changes as `exector.memory` changes.

2021-09-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414700#comment-17414700
 ] 

Hyukjin Kwon commented on SPARK-36749:
--

Is this bug still reproducible in Spark 3.x?

> The count result of the dimension table filed changes as `exector.memory` 
> changes.
> --
>
> Key: SPARK-36749
> URL: https://issues.apache.org/jira/browse/SPARK-36749
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.3
> Environment: hadoop version is:
> 2.7.5
> spark version is:
> 2.1.3
> *job default parameters:*
> spark.driver.cores=1
> spark.driver.memory=512m
> spark.executor.instances=1
> spark.executor.cores=1
> spark.executor.memory=512m
>Reporter: LanYang
>Priority: Major
> Attachments: corrent_result.log, wrong_result.log
>
>
> hi~, every one!
> Here‘s a very strange questions!!! 
> The meaning of this sql is count the number of the specified columns in each 
> table after joining the table。as follows:
>  
> {quote}SELECT cast(COUNT(DISTINCT tps.prod_siginst_id) AS STRING) AS 
> siginst_cnt,
>  cast(COUNT(DISTINCT qpl.list_id) AS STRING) AS list_cnt,
>  cast(count(DISTINCT if(tb.brand_source=1,tps.prod_siginst_id,NULL)) AS 
> STRING) AS domestic_siginst_cnt,
>  cast(count(DISTINCT if(tb.brand_source=2,tps.prod_siginst_id,NULL)) AS 
> STRING) AS import_siginst_cnt,
>  cast(count(DISTINCT if(qpl.list_name NOT 
> LIKE'%un_normal%',tps.prod_siginst_id,NULL)) AS STRING) AS standard_cnt,
>  cast(count(DISTINCT if(qpl.list_name 
> LIKE'%un_normal%',tps.prod_siginst_id,NULL)) AS STRING) AS nostandard_cnt
> FROM tableA tbi
> LEFT JOIN tableB tps ON tbi.prod_inst_id=tps.prod_inst_id
> LEFT JOIN tableC qpl ON tbi.prod_type_id=qpl.list_id
> LEFT JOIN tableD ON tps.brand_id=tb.brand_id
> WHERE tbi.prod_status=1
>  AND tbi.prod_sell_status=1
>  AND tb.recommend_flag=1;
> {quote}
>  
> and the phenomenon of the question is if i add memory for executor, the count 
> result of the tableC field(list_id,list_name) will changes as well. until the 
> executor‘s memory is big enough, the result doesn't change.
>  
> TableC is a dimensional table and the amount of data is fixed.
>  
> In my opinions, this job should failed rather than output an incorrect count 
> result if executor is insufficient memory.
> Could you please help me check whether this is a bug of spark itself or 
> something wrong with my sql writing?
>  
> here is log of this job.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36701) Structured streaming maxOffsetsPerTrigger Invalidation

2021-09-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414701#comment-17414701
 ] 

Hyukjin Kwon commented on SPARK-36701:
--

Can you see if it works w Spark 3.x?

> Structured streaming  maxOffsetsPerTrigger Invalidation
> ---
>
> Key: SPARK-36701
> URL: https://issues.apache.org/jira/browse/SPARK-36701
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.5
>Reporter: liujinhui
>Priority: Major
> Attachments: image-2021-09-09-13-57-21-175.png
>
>
> Why does maxOffsetsPerTrigger not work when consuming Kafka, the task fails, 
> and the yarn retries.
> 
>  org.apache.spark
>  spark-sql-kafka-0-10_2.11
>  2.4.5
>  
> [https://stackoverflow.com/questions/55476504/restarting-spark-structured-streaming-job-consumes-millions-of-kafka-messages-an]
>  There is a similar question here
> !image-2021-09-09-13-57-21-175.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36749) The count result of the dimension table filed changes as `exector.memory` changes.

2021-09-13 Thread LanYang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LanYang updated SPARK-36749:

Description: 
hi~, every one!

Here‘s a very strange questions!!! 

The meaning of this sql is count the number of the specified columns in each 
table after joining the table。as follows:

 
{quote}SELECT cast(COUNT(DISTINCT tps.prod_siginst_id) AS STRING) AS 
siginst_cnt,
 cast(COUNT(DISTINCT qpl.list_id) AS STRING) AS list_cnt,
 cast(count(DISTINCT if(tb.brand_source=1,tps.prod_siginst_id,NULL)) AS STRING) 
AS domestic_siginst_cnt,
 cast(count(DISTINCT if(tb.brand_source=2,tps.prod_siginst_id,NULL)) AS STRING) 
AS import_siginst_cnt,
 cast(count(DISTINCT if(qpl.list_name NOT 
LIKE'%un_normal%',tps.prod_siginst_id,NULL)) AS STRING) AS standard_cnt,
 cast(count(DISTINCT if(qpl.list_name 
LIKE'%un_normal%',tps.prod_siginst_id,NULL)) AS STRING) AS nostandard_cnt
FROM tableA tbi
LEFT JOIN tableB tps ON tbi.prod_inst_id=tps.prod_inst_id
LEFT JOIN tableC qpl ON tbi.prod_type_id=qpl.list_id
LEFT JOIN tableD ON tps.brand_id=tb.brand_id
WHERE tbi.prod_status=1
 AND tbi.prod_sell_status=1
 AND tb.recommend_flag=1;
{quote}
 

and the phenomenon of the question is if i add memory for executor, the count 
result of the tableC field(list_id,list_name) will changes as well. until the 
executor‘s memory is big enough, the result doesn't change.

 

TableC is a dimensional table and the amount of data is fixed.

 

In my opinions, this job should failed rather than output an incorrect count 
result if executor is insufficient memory.

Could you please help me check whether this is a bug of spark itself or 
something wrong with my sql writing?

 

here is log of this job.

 

  was:
hi~, every one!

Here‘s a very strange questions!!! 

The meaning of this sql is count the number of the specified columns in each 
table after joining the table。as follows:

 
{quote}{{SELECT cast(COUNT(DISTINCT tps.prod_siginst_id) AS STRING) AS 
siginst_cnt,}}
{{ cast(COUNT(DISTINCT qpl.list_id) AS STRING) AS list_cnt,}}
{{ cast(count(DISTINCT if(tb.brand_source=1,tps.prod_siginst_id,NULL)) AS 
STRING) AS domestic_siginst_cnt,}}
{{ cast(count(DISTINCT if(tb.brand_source=2,tps.prod_siginst_id,NULL)) AS 
STRING) AS import_siginst_cnt,}}
{{ cast(count(DISTINCT if(qpl.list_name NOT 
LIKE'%un_normal%',tps.prod_siginst_id,NULL)) AS STRING) AS standard_cnt,}}
{{ cast(count(DISTINCT if(qpl.list_name 
LIKE'%un_normal%',tps.prod_siginst_id,NULL)) AS STRING) AS nostandard_cnt}}
{{FROM tableA tbi}}
{{LEFT JOIN tableB tps ON tbi.prod_inst_id=tps.prod_inst_id}}
{{LEFT JOIN tableC qpl ON tbi.prod_type_id=qpl.list_id}}
{{LEFT JOIN tableD ON tps.brand_id=tb.brand_id}}
{{WHERE tbi.prod_status=1}}
{{ AND tbi.prod_sell_status=1}}
{{ AND tb.recommend_flag=1;}}
{quote}
 

and the phenomenon of the question is if i add memory for executor, the count 
result of the tableC field(list_id,list_name) will changes as well. until the 
executor‘s memory is big enough, the result doesn't change.

 

TableC is a dimensional table and the amount of data is fixed.

 

In my opinions, this job should failed rather than output an incorrect count 
result if executor is insufficient memory.

Could you please help me check whether this is a bug of spark itself or 
something wrong with my sql writing?

 

here is log of this job.

 


> The count result of the dimension table filed changes as `exector.memory` 
> changes.
> --
>
> Key: SPARK-36749
> URL: https://issues.apache.org/jira/browse/SPARK-36749
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.3
> Environment: hadoop version is:
> 2.7.5
> spark version is:
> 2.1.3
> *job default parameters:*
> spark.driver.cores=1
> spark.driver.memory=512m
> spark.executor.instances=1
> spark.executor.cores=1
> spark.executor.memory=512m
>Reporter: LanYang
>Priority: Major
> Attachments: corrent_result.log, wrong_result.log
>
>
> hi~, every one!
> Here‘s a very strange questions!!! 
> The meaning of this sql is count the number of the specified columns in each 
> table after joining the table。as follows:
>  
> {quote}SELECT cast(COUNT(DISTINCT tps.prod_siginst_id) AS STRING) AS 
> siginst_cnt,
>  cast(COUNT(DISTINCT qpl.list_id) AS STRING) AS list_cnt,
>  cast(count(DISTINCT if(tb.brand_source=1,tps.prod_siginst_id,NULL)) AS 
> STRING) AS domestic_siginst_cnt,
>  cast(count(DISTINCT if(tb.brand_source=2,tps.prod_siginst_id,NULL)) AS 
> STRING) AS import_siginst_cnt,
>  cast(count(DISTINCT if(qpl.list_name NOT 
> LIKE'%un_normal%',tps.prod_siginst_id,NULL)) AS STRING) AS standard_cnt,
>  cast(count(DISTINCT if(qpl.list_name 
> LIKE'%un_normal%',tps.prod_siginst_id,NULL)) AS STRING) AS nostandard_cnt
> FROM tableA tbi
> LEFT JOIN tableB

[jira] [Created] (SPARK-36749) The count result of the dimension table filed changes as `exector.memory` changes.

2021-09-13 Thread LanYang (Jira)

LanYang created SPARK-36749:
---

 Summary: The count result of the dimension table filed changes as 
`exector.memory` changes.
 Key: SPARK-36749
 URL: https://issues.apache.org/jira/browse/SPARK-36749
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.1.3
 Environment: hadoop version is:

2.7.5

spark version is:

2.1.3

*job default parameters:*

spark.driver.cores=1

spark.driver.memory=512m

spark.executor.instances=1

spark.executor.cores=1

spark.executor.memory=512m
Reporter: LanYang
 Attachments: corrent_result.log, wrong_result.log

hi~, every one!

Here‘s a very strange questions!!! 

The meaning of this sql is count the number of the specified columns in each 
table after joining the table。as follows:

 
{quote}{{SELECT cast(COUNT(DISTINCT tps.prod_siginst_id) AS STRING) AS 
siginst_cnt,}}
{{ cast(COUNT(DISTINCT qpl.list_id) AS STRING) AS list_cnt,}}
{{ cast(count(DISTINCT if(tb.brand_source=1,tps.prod_siginst_id,NULL)) AS 
STRING) AS domestic_siginst_cnt,}}
{{ cast(count(DISTINCT if(tb.brand_source=2,tps.prod_siginst_id,NULL)) AS 
STRING) AS import_siginst_cnt,}}
{{ cast(count(DISTINCT if(qpl.list_name NOT 
LIKE'%un_normal%',tps.prod_siginst_id,NULL)) AS STRING) AS standard_cnt,}}
{{ cast(count(DISTINCT if(qpl.list_name 
LIKE'%un_normal%',tps.prod_siginst_id,NULL)) AS STRING) AS nostandard_cnt}}
{{FROM tableA tbi}}
{{LEFT JOIN tableB tps ON tbi.prod_inst_id=tps.prod_inst_id}}
{{LEFT JOIN tableC qpl ON tbi.prod_type_id=qpl.list_id}}
{{LEFT JOIN tableD ON tps.brand_id=tb.brand_id}}
{{WHERE tbi.prod_status=1}}
{{ AND tbi.prod_sell_status=1}}
{{ AND tb.recommend_flag=1;}}
{quote}
 

and the phenomenon of the question is if i add memory for executor, the count 
result of the tableC field(list_id,list_name) will changes as well. until the 
executor‘s memory is big enough, the result doesn't change.

 

TableC is a dimensional table and the amount of data is fixed.

 

In my opinions, this job should failed rather than output an incorrect count 
result if executor is insufficient memory.

Could you please help me check whether this is a bug of spark itself or 
something wrong with my sql writing?

 

here is log of this job.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36749) The count result of the dimension table filed changes as `exector.memory` changes.

2021-09-13 Thread LanYang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LanYang updated SPARK-36749:

Attachment: wrong_result.log
corrent_result.log

> The count result of the dimension table filed changes as `exector.memory` 
> changes.
> --
>
> Key: SPARK-36749
> URL: https://issues.apache.org/jira/browse/SPARK-36749
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.1.3
> Environment: hadoop version is:
> 2.7.5
> spark version is:
> 2.1.3
> *job default parameters:*
> spark.driver.cores=1
> spark.driver.memory=512m
> spark.executor.instances=1
> spark.executor.cores=1
> spark.executor.memory=512m
>Reporter: LanYang
>Priority: Major
> Attachments: corrent_result.log, wrong_result.log
>
>
> hi~, every one!
> Here‘s a very strange questions!!! 
> The meaning of this sql is count the number of the specified columns in each 
> table after joining the table。as follows:
>  
> {quote}{{SELECT cast(COUNT(DISTINCT tps.prod_siginst_id) AS STRING) AS 
> siginst_cnt,}}
> {{ cast(COUNT(DISTINCT qpl.list_id) AS STRING) AS list_cnt,}}
> {{ cast(count(DISTINCT if(tb.brand_source=1,tps.prod_siginst_id,NULL)) AS 
> STRING) AS domestic_siginst_cnt,}}
> {{ cast(count(DISTINCT if(tb.brand_source=2,tps.prod_siginst_id,NULL)) AS 
> STRING) AS import_siginst_cnt,}}
> {{ cast(count(DISTINCT if(qpl.list_name NOT 
> LIKE'%un_normal%',tps.prod_siginst_id,NULL)) AS STRING) AS standard_cnt,}}
> {{ cast(count(DISTINCT if(qpl.list_name 
> LIKE'%un_normal%',tps.prod_siginst_id,NULL)) AS STRING) AS nostandard_cnt}}
> {{FROM tableA tbi}}
> {{LEFT JOIN tableB tps ON tbi.prod_inst_id=tps.prod_inst_id}}
> {{LEFT JOIN tableC qpl ON tbi.prod_type_id=qpl.list_id}}
> {{LEFT JOIN tableD ON tps.brand_id=tb.brand_id}}
> {{WHERE tbi.prod_status=1}}
> {{ AND tbi.prod_sell_status=1}}
> {{ AND tb.recommend_flag=1;}}
> {quote}
>  
> and the phenomenon of the question is if i add memory for executor, the count 
> result of the tableC field(list_id,list_name) will changes as well. until the 
> executor‘s memory is big enough, the result doesn't change.
>  
> TableC is a dimensional table and the amount of data is fixed.
>  
> In my opinions, this job should failed rather than output an incorrect count 
> result if executor is insufficient memory.
> Could you please help me check whether this is a bug of spark itself or 
> something wrong with my sql writing?
>  
> here is log of this job.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36596) Review and fix issues in 3.2.0 Documents

2021-09-13 Thread Gengliang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414681#comment-17414681
 ] 

Gengliang Wang commented on SPARK-36596:


[~holden] Yes, marking this one as fixed. Thanks!

> Review and fix issues in 3.2.0 Documents
> 
>
> Key: SPARK-36596
> URL: https://issues.apache.org/jira/browse/SPARK-36596
> Project: Spark
>  Issue Type: Task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Critical
>
> Compare the 3.2.0 doc with the latest release version 3.1.2. Fix the 
> following issues:
> * Add missing `Since` annotation for new APIs
> * Remove the leaking class/object in API doc
> * Revise SQL doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36596) Review and fix issues in 3.2.0 Documents

2021-09-13 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-36596.

Resolution: Fixed

> Review and fix issues in 3.2.0 Documents
> 
>
> Key: SPARK-36596
> URL: https://issues.apache.org/jira/browse/SPARK-36596
> Project: Spark
>  Issue Type: Task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Critical
>
> Compare the 3.2.0 doc with the latest release version 3.1.2. Fix the 
> following issues:
> * Add missing `Since` annotation for new APIs
> * Remove the leaking class/object in API doc
> * Revise SQL doc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36705) Disable push based shuffle when IO encryption is enabled or serializer is not relocatable

2021-09-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414678#comment-17414678
 ] 

Apache Spark commented on SPARK-36705:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/33987

> Disable push based shuffle when IO encryption is enabled or serializer is not 
> relocatable
> -
>
> Key: SPARK-36705
> URL: https://issues.apache.org/jira/browse/SPARK-36705
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Mridul Muralidharan
>Assignee: Minchu Yang
>Priority: Blocker
> Fix For: 3.2.0
>
>
> Push based shuffle is not compatible with io encryption or non-relocatable 
> serialization.
> This is similar to SPARK-34790
> We have to disable push based shuffle if either of these two are true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36705) Disable push based shuffle when IO encryption is enabled or serializer is not relocatable

2021-09-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414677#comment-17414677
 ] 

Apache Spark commented on SPARK-36705:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/33987

> Disable push based shuffle when IO encryption is enabled or serializer is not 
> relocatable
> -
>
> Key: SPARK-36705
> URL: https://issues.apache.org/jira/browse/SPARK-36705
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Mridul Muralidharan
>Assignee: Minchu Yang
>Priority: Blocker
> Fix For: 3.2.0
>
>
> Push based shuffle is not compatible with io encryption or non-relocatable 
> serialization.
> This is similar to SPARK-34790
> We have to disable push based shuffle if either of these two are true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36705) Disable push based shuffle when IO encryption is enabled or serializer is not relocatable

2021-09-13 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-36705:
---

Assignee: Minchu Yang

> Disable push based shuffle when IO encryption is enabled or serializer is not 
> relocatable
> -
>
> Key: SPARK-36705
> URL: https://issues.apache.org/jira/browse/SPARK-36705
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Mridul Muralidharan
>Assignee: Minchu Yang
>Priority: Blocker
> Fix For: 3.2.0
>
>
> Push based shuffle is not compatible with io encryption or non-relocatable 
> serialization.
> This is similar to SPARK-34790
> We have to disable push based shuffle if either of these two are true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36748) Introduce the 'compute.isin_limit' option

2021-09-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36748.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33982
[https://github.com/apache/spark/pull/33982]

> Introduce the 'compute.isin_limit' option
> -
>
> Key: SPARK-36748
> URL: https://issues.apache.org/jira/browse/SPARK-36748
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36748) Introduce the 'compute.isin_limit' option

2021-09-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36748:


Assignee: Xinrong Meng

> Introduce the 'compute.isin_limit' option
> -
>
> Key: SPARK-36748
> URL: https://issues.apache.org/jira/browse/SPARK-36748
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36727) Support sql overwrite a path that is also being read from when partitionOverwriteMode is dynamic

2021-09-13 Thread Tongwei (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tongwei updated SPARK-36727:

External issue URL: https://github.com/apache/spark/pull/33986

> Support sql overwrite a path that is also being read from when 
> partitionOverwriteMode is dynamic
> 
>
> Key: SPARK-36727
> URL: https://issues.apache.org/jira/browse/SPARK-36727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Tongwei
>Priority: Minor
>
> {code:java}
> // non-partitioned table overwrite
> CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET;
> INSERT OVERWRITE TABLE tbl SELECT 0,1;
> INSERT OVERWRITE TABLE tbl SELECT * FROM tbl;
> // partitioned table static overwrite
> CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 
> INT);
> INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2;
> INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE 
> p1=2021;
> {code}
> When we run the above query, an error will be throwed "Cannot overwrite a 
> path that is also being read from"
> We need to support this operation when the 
> spark.sql.sources.partitionOverwriteMode is dynamic



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36727) Support sql overwrite a path that is also being read from when partitionOverwriteMode is dynamic

2021-09-13 Thread Tongwei (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tongwei updated SPARK-36727:

External issue URL:   (was: https://github.com/apache/spark/pull/33986)

> Support sql overwrite a path that is also being read from when 
> partitionOverwriteMode is dynamic
> 
>
> Key: SPARK-36727
> URL: https://issues.apache.org/jira/browse/SPARK-36727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Tongwei
>Priority: Minor
>
> {code:java}
> // non-partitioned table overwrite
> CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET;
> INSERT OVERWRITE TABLE tbl SELECT 0,1;
> INSERT OVERWRITE TABLE tbl SELECT * FROM tbl;
> // partitioned table static overwrite
> CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 
> INT);
> INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2;
> INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE 
> p1=2021;
> {code}
> When we run the above query, an error will be throwed "Cannot overwrite a 
> path that is also being read from"
> We need to support this operation when the 
> spark.sql.sources.partitionOverwriteMode is dynamic



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36745) Cleanup pattern ExtractEquiJoinKeys

2021-09-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414660#comment-17414660
 ] 

Apache Spark commented on SPARK-36745:
--

User 'YannisSismanis' has created a pull request for this issue:
https://github.com/apache/spark/pull/33985

> Cleanup pattern ExtractEquiJoinKeys
> ---
>
> Key: SPARK-36745
> URL: https://issues.apache.org/jira/browse/SPARK-36745
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Yannis Sismanis
>Priority: Minor
>
> The join condition returned from ExtractEquiJoinKeys does not correspond to 
> the equi-join on the extracted left and right keys. A call site can be risky 
> if it is not aware of that.
> The pattern extractor should extract the rest of original join condition as 
> well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36745) Cleanup pattern ExtractEquiJoinKeys

2021-09-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36745:


Assignee: (was: Apache Spark)

> Cleanup pattern ExtractEquiJoinKeys
> ---
>
> Key: SPARK-36745
> URL: https://issues.apache.org/jira/browse/SPARK-36745
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Yannis Sismanis
>Priority: Minor
>
> The join condition returned from ExtractEquiJoinKeys does not correspond to 
> the equi-join on the extracted left and right keys. A call site can be risky 
> if it is not aware of that.
> The pattern extractor should extract the rest of original join condition as 
> well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36745) Cleanup pattern ExtractEquiJoinKeys

2021-09-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36745:


Assignee: Apache Spark

> Cleanup pattern ExtractEquiJoinKeys
> ---
>
> Key: SPARK-36745
> URL: https://issues.apache.org/jira/browse/SPARK-36745
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Yannis Sismanis
>Assignee: Apache Spark
>Priority: Minor
>
> The join condition returned from ExtractEquiJoinKeys does not correspond to 
> the equi-join on the extracted left and right keys. A call site can be risky 
> if it is not aware of that.
> The pattern extractor should extract the rest of original join condition as 
> well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36705) Disable push based shuffle when IO encryption is enabled or serializer is not relocatable

2021-09-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414658#comment-17414658
 ] 

Apache Spark commented on SPARK-36705:
--

User 'rmcyang' has created a pull request for this issue:
https://github.com/apache/spark/pull/33984

> Disable push based shuffle when IO encryption is enabled or serializer is not 
> relocatable
> -
>
> Key: SPARK-36705
> URL: https://issues.apache.org/jira/browse/SPARK-36705
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Mridul Muralidharan
>Priority: Blocker
> Fix For: 3.2.0
>
>
> Push based shuffle is not compatible with io encryption or non-relocatable 
> serialization.
> This is similar to SPARK-34790
> We have to disable push based shuffle if either of these two are true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35930) Upgrade kinesis-client to 1.14.4

2021-09-13 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-35930:
---
Priority: Major  (was: Minor)

> Upgrade kinesis-client to 1.14.4
> 
>
> Key: SPARK-35930
> URL: https://issues.apache.org/jira/browse/SPARK-35930
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Upgrading to 1.14.1 or newer is recommended by the community for users who 
> use kinesis-client 1.14.0 due to a bug.
> https://github.com/awslabs/amazon-kinesis-client/tree/master#recommended-upgrade-for-all-users-of-the-1x-amazon-kinesis-client
> https://github.com/awslabs/amazon-kinesis-client/issues/778



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35930) Upgrade kinesis-client to 1.14.4

2021-09-13 Thread Kousuke Saruta (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414656#comment-17414656
 ] 

Kousuke Saruta commented on SPARK-35930:


[~holden]
Yes, I didn't think it's a common case so I set the minor priority but as you 
imply, it seems a correctness issue so I'll change the priority.


> Upgrade kinesis-client to 1.14.4
> 
>
> Key: SPARK-35930
> URL: https://issues.apache.org/jira/browse/SPARK-35930
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> Upgrading to 1.14.1 or newer is recommended by the community for users who 
> use kinesis-client 1.14.0 due to a bug.
> https://github.com/awslabs/amazon-kinesis-client/tree/master#recommended-upgrade-for-all-users-of-the-1x-amazon-kinesis-client
> https://github.com/awslabs/amazon-kinesis-client/issues/778



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36715) explode(UDF) throw an exception

2021-09-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36715.
--
Fix Version/s: 3.1.3
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 33956
[https://github.com/apache/spark/pull/33956]

> explode(UDF) throw an exception
> ---
>
> Key: SPARK-36715
> URL: https://issues.apache.org/jira/browse/SPARK-36715
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Fu Chen
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>
> Code to reproduce:
>  
> {code:java}
> spark.udf.register("vec", (i: Int) => (0 until i).toArray)
> sql("select explode(vec(8)) as c1").show{code}
> {code:java}
> Once strategy's idempotence is broken for batch Infer FiltersOnce strategy's 
> idempotence is broken for batch Infer Filters GlobalLimit 21                  
>                                       GlobalLimit 21 +- LocalLimit 21         
>                                              +- LocalLimit 21    +- Project 
> [cast(c1#3 as string) AS c1#12]                            +- Project 
> [cast(c1#3 as string) AS c1#12]       +- Generate explode(vec(8)), false, 
> [c1#3]                            +- Generate explode(vec(8)), false, [c1#3]  
>         +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))            
> +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))!            +- 
> OneRowRelation                                                     +- Filter 
> ((size(vec(8), true) > 0) AND isnotnull(vec(8)))!                             
>                                                         +- OneRowRelation     
>   java.lang.RuntimeException: Once strategy's idempotence is broken for batch 
> Infer Filters GlobalLimit 21                                                  
>       GlobalLimit 21 +- LocalLimit 21                                         
>              +- LocalLimit 21    +- Project [cast(c1#3 as string) AS c1#12]   
>                          +- Project [cast(c1#3 as string) AS c1#12]       +- 
> Generate explode(vec(8)), false, [c1#3]                            +- 
> Generate explode(vec(8)), false, [c1#3]          +- Filter ((size(vec(8), 
> true) > 0) AND isnotnull(vec(8)))            +- Filter ((size(vec(8), true) > 
> 0) AND isnotnull(vec(8)))!            +- OneRowRelation                       
>                               +- Filter ((size(vec(8), true) > 0) AND 
> isnotnull(vec(8)))!                                                           
>                           +- OneRowRelation        at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.onceStrategyIdempotenceIsBrokenForBatchError(QueryExecutionErrors.scala:1200)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.checkBatchIdempotence(RuleExecutor.scala:168)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:254)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
>  at scala.collection.immutable.List.foreach(List.scala:431) at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
>  at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
>  at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:138)
>  at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>  at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196)
>  at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:134)
>  at 
> org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:130)
>  at 
> org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:148)
>  at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:166)
>  at 
> org.apache.spark.sql.execution.QueryExecution.withCteMap(QueryExecution.scala:73)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:163)
>  at 
> org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:214)

[jira] [Resolved] (SPARK-36739) Add Apache license header to makefiles of python documents

2021-09-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36739.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33979
[https://github.com/apache/spark/pull/33979]

> Add Apache license header to makefiles of python documents
> --
>
> Key: SPARK-36739
> URL: https://issues.apache.org/jira/browse/SPARK-36739
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 3.2.0
>Reporter: Leona Yoda
>Assignee: Leona Yoda
>Priority: Minor
> Fix For: 3.2.0
>
>
> * {{python/docs/make.bat}}
>  * {{python/docs/make2.bat}}
>  * python/docs/Makefile
> lack apache license headers on their source code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36739) Add Apache license header to makefiles of python documents

2021-09-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36739:


Assignee: Leona Yoda

> Add Apache license header to makefiles of python documents
> --
>
> Key: SPARK-36739
> URL: https://issues.apache.org/jira/browse/SPARK-36739
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, PySpark
>Affects Versions: 3.2.0
>Reporter: Leona Yoda
>Assignee: Leona Yoda
>Priority: Minor
>
> * {{python/docs/make.bat}}
>  * {{python/docs/make2.bat}}
>  * python/docs/Makefile
> lack apache license headers on their source code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

2021-09-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414645#comment-17414645
 ] 

Hyukjin Kwon commented on SPARK-33782:
--

Thanks [~holden]!

> Place spark.files, spark.jars and spark.files under the current working 
> directory on the driver in K8S cluster mode
> ---
>
> Key: SPARK-33782
> URL: https://issues.apache.org/jira/browse/SPARK-33782
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In Yarn cluster modes, the passed files are able to be accessed in the 
> current working directory. Looks like this is not the case in Kubernates 
> cluset mode.
> By doing this, users can, for example, leverage PEX to manage Python 
> dependences in Apache Spark:
> {code}
> pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex
> PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex
> {code}
> See also https://github.com/apache/spark/pull/30735/files#r540935585.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35834) Use the same cleanup logic as Py4J in inheritable thread API

2021-09-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35834.
--
Fix Version/s: 3.2.0
 Assignee: Hyukjin Kwon
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/32989

> Use the same cleanup logic as Py4J in inheritable thread API
> 
>
> Key: SPARK-35834
> URL: https://issues.apache.org/jira/browse/SPARK-35834
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0
>
>
> After 
> https://github.com/apache/spark/commit/6d309914df422d9f0c96edfd37924ecb8f29e3a9,
>  the test became flaky:
> {code}
> ==
> ERROR [71.813s]: test_save_load_pipeline_estimator 
> (pyspark.ml.tests.test_tuning.CrossValidatorTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, 
> in test_save_load_pipeline_estimator
> self._run_test_save_load_pipeline_estimator(DummyLogisticRegression)
>   File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, 
> in _run_test_save_load_pipeline_estimator
> cvModel2 = crossval2.fit(training)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit
> bestModel = est.fit(dataset, epm[bestIndex])
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit
> return self.copy(params)._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
> model = stage.fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
> model = stage.fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in 
> _fit
> models = pool.map(inheritable_thread_target(trainSingleClass), 
> range(numClasses))
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 266, in map
> return self._map_async(func, iterable, mapstar, chunksize).get()
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 644, in get
> raise self._value
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 119, in worker
> result = (True, func(*args, **kwds))
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 44, in mapstar
> return list(map(*args))
>   File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped
> InheritableThread._clean_py4j_conn_for_current_thread()
>   File "/__w/spark/spark/python/pyspark/util.py", line 389, in 
> _clean_py4j_conn_for_current_thread
> del connections[i]
> IndexError: deque index out of range
> --
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33152) Constraint Propagation code causes OOM issues or increasing compilation time to hours

2021-09-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414642#comment-17414642
 ] 

Apache Spark commented on SPARK-33152:
--

User 'ahshahid' has created a pull request for this issue:
https://github.com/apache/spark/pull/33983

> Constraint Propagation code causes OOM issues or increasing compilation time 
> to hours
> -
>
> Key: SPARK-33152
> URL: https://issues.apache.org/jira/browse/SPARK-33152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Asif
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We encountered this issue at Workday. 
> The issue is that current Constraints Propagation code pessimistically 
> generates all the possible permutations of base constraint for the aliases in 
> the project node.
> This causes blow up of the number of constraints generated causing OOM issues 
> at compile time of sql query, or queries taking 18 min to 2 hrs to compile.
> The problematic piece of code is in LogicalPlan.getAliasedConstraints
> projectList.foreach {
>  case a @ Alias(l: Literal, _) =>
>  allConstraints += EqualNullSafe(a.toAttribute, l)
>  case a @ Alias(e, _) =>
>  // For every alias in `projectList`,replace the reference in
>  // constraints by its attribute.
>  allConstraints ++= allConstraints.map(_ transform {
>  case expr: Expression if expr.semanticEquals(e) =>
>  a.toAttribute
>  })
>  allConstraints += EqualNullSafe(e, a.toAttribute)
>  case _ => // Don't change.
>  }
> so consider a hypothetical plan
>  
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b as b2, c, c as c1, c as 
> c2 , c as c3)
>    |
> Filter f(a, b, c)
> |
> Base Relation (a, b, c)
> and so we have projection as
> a, a1, a2, a3
> b, b1, b2
> c, c1, c2, c3
> Lets say hypothetically f(a, b, c) has a occurring 1 times, b occurring 2 
> times, and C occurring 3 times.
> So at project node the number of constraints for a single base constraint 
> f(a, b, c) will be
> 4C1 * 3C2 * 4C3 = 48
> In our case, we have seen number of constraints going up to > 3 or more, 
> as there are complex case statements in the projection.
> Spark generates all these constraints pessimistically for pruning filters or 
> push down predicates for join , it may encounter when the optimizer traverses 
> up the tree.
>  
> This is issue is solved at our end by modifying the spark code to use a 
> different logic.
> The idea is simple. 
> Instead of generating pessimistically all possible combinations of base 
> constraint, just store the original base constraints & track the aliases at 
> each level.
> The principal followed is this:
> 1) Store the base constraint and keep the track of the aliases for the 
> underlying attribute.
> 2) If the base attribute composing the constraint is not in the output set, 
> see if the constraint survives by substituting the attribute getting removed 
> with the next available alias's attribute.
>  
> For checking if a filter can be pruned , just canonicalize the filter with 
> the attribute at 0th position of the tracking list & compare with the 
> underlying base constraint.
> To elaborate using  the plan above.
> At project node
> We have constraint f(a,b,c)
> we keep track of alias
> List 1  : a, a1.attribute, a2.attribute, a3.attribute
> List2 :  b, b1.attribute, b2.attribute 
> List3: c, c1.attribute, c2.attribute, c3.attribute
> Lets say above the project node, we encounter a filter
> f(a1, b2, c3)
> So canonicalize the filter by using the above list data, to convert it to 
> f(a,b c) & compare it with the stored base constraints.
>  
> For predicate push down , instead of generating all the redundant 
> combinations of constraints , just generate one constraint per element of the 
> alias.
> In the current spark code , in any case, filter push down happens only for 1 
> variable at a time.
> So just expanding the filter (a,b,c) to
> f(a, b, c), f(a1, b, c), f(a2, b, c), f(a3, , b ,c), f (a, b1, c), f(a, b2, 
> c) , f(a, b, c1), f(a, b, c2), f(a, b, c3) 
> would suffice, rather than generating all the redundant combinations.
> In fact the code can be easily modified to generate only those constraints 
> which involve variables forming the join condition. so the number of  
> constraints generated on expand are further reduced.
> We already have code to generate compound filters for push down ( join on 
> multiple conditions), which can be used for single variable condition, push 
> down too.
> Just to elaborate the logic further, if we consider the above hypothetical 
> plan (assume collapse project rule is not there)
>  
> Project (a1, a1. as a4, b,  c1, c1 as c4)
>   |
> Project (a, a as a1, a. as a2, a as a3, b, b as b1, b

[jira] [Commented] (SPARK-34943) Upgrade flake8 to 3.8.0 or above in Jenkins

2021-09-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414644#comment-17414644
 ] 

Hyukjin Kwon commented on SPARK-34943:
--

Thx!

> Upgrade flake8 to 3.8.0 or above in Jenkins
> ---
>
> Key: SPARK-34943
> URL: https://issues.apache.org/jira/browse/SPARK-34943
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Shane Knapp
>Priority: Major
>
> In flake8 < 3.8.0, F401 error occurs for imports in *if* statements when 
> TYPE_CHECKING is True. However, TYPE_CHECKING is always False at runtime, so 
> there is no need to treat it as an error in static analysis.
> Since this behavior is fixed In flake8 >= 3.8.0, we should upgrade the flake8 
> installed in Jenkins to 3.8.0 or above. Otherwise, it occurs F401 error for 
> several lines in pandas-on-PySpark that uses TYPE_CHECKING.
> And also we might update the {{MINIMUM_FLAKE8}} in the {{lint-python}} from 
> 3.5.0 to 3.8.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36251) Cover GitHub Actions runs without SHA in testing script

2021-09-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36251.
--
Fix Version/s: 3.2.0
 Assignee: Hyukjin Kwon
   Resolution: Fixed

This is actually fixed in https://github.com/apache/spark/pull/33472.

> Cover GitHub Actions runs without SHA in testing script
> ---
>
> Key: SPARK-36251
> URL: https://issues.apache.org/jira/browse/SPARK-36251
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0
>
>
> SPARK-36204 added the periodical jobs for branch-3.2 too but the job runs 
> without SHA being set.
> The test script should be able to handle this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35834) Use the same cleanup logic as Py4J in inheritable thread API

2021-09-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414641#comment-17414641
 ] 

Hyukjin Kwon commented on SPARK-35834:
--

This is actually fixed too. I wonder why it wasn't resolved. It's not a 
regression, just a bug fix.

> Use the same cleanup logic as Py4J in inheritable thread API
> 
>
> Key: SPARK-35834
> URL: https://issues.apache.org/jira/browse/SPARK-35834
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After 
> https://github.com/apache/spark/commit/6d309914df422d9f0c96edfd37924ecb8f29e3a9,
>  the test became flaky:
> {code}
> ==
> ERROR [71.813s]: test_save_load_pipeline_estimator 
> (pyspark.ml.tests.test_tuning.CrossValidatorTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, 
> in test_save_load_pipeline_estimator
> self._run_test_save_load_pipeline_estimator(DummyLogisticRegression)
>   File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, 
> in _run_test_save_load_pipeline_estimator
> cvModel2 = crossval2.fit(training)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit
> bestModel = est.fit(dataset, epm[bestIndex])
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit
> return self.copy(params)._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
> model = stage.fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
> model = stage.fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in 
> _fit
> models = pool.map(inheritable_thread_target(trainSingleClass), 
> range(numClasses))
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 266, in map
> return self._map_async(func, iterable, mapstar, chunksize).get()
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 644, in get
> raise self._value
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 119, in worker
> result = (True, func(*args, **kwds))
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 44, in mapstar
> return list(map(*args))
>   File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped
> InheritableThread._clean_py4j_conn_for_current_thread()
>   File "/__w/spark/spark/python/pyspark/util.py", line 389, in 
> _clean_py4j_conn_for_current_thread
> del connections[i]
> IndexError: deque index out of range
> --
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24943) Convert a SQL Struct to StructType

2021-09-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414639#comment-17414639
 ] 

Hyukjin Kwon commented on SPARK-24943:
--

uniontype is not supported in Spark at all. For varchar and char, they are not 
supported from Spark 3.0:

{code}
scala> spark.createDataFrame(Seq("a").toDS.rdd.map(r => 
org.apache.spark.sql.Row(r)), 
org.apache.spark.sql.types.StructType.fromDDL("fullName varchar(10)"))
org.apache.spark.sql.AnalysisException: char/varchar type can only be used in 
the table schema. You can set spark.sql.legacy.charVarcharAsString to true, so 
that Spark treat them as string type as same as Spark 3.0 and earlier
  at 
org.apache.spark.sql.errors.QueryCompilationErrors$.charOrVarcharTypeAsStringUnsupportedError(QueryCompilationErrors.scala:1614)
  at 
org.apache.spark.sql.catalyst.util.CharVarcharUtils$.failIfHasCharVarchar(CharVarcharUtils.scala:64)
  at 
org.apache.spark.sql.SparkSession.$anonfun$createDataFrame$3(SparkSession.scala:354)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
  at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:353)
  ... 47 elided
{code}

> Convert a SQL Struct to StructType
> --
>
> Key: SPARK-24943
> URL: https://issues.apache.org/jira/browse/SPARK-24943
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: mahmoud mehdi
>Priority: Minor
> Fix For: 2.4.0
>
>
> The main goal of this User Story is to add a method to StructType which does 
> the opposite to what does the sql method.
> For example, for the following SQL Struct : 
> {code:java}
> df.schema.sql
> //STRUCT<`price`: STRUCT<`amount`: BIGINT, `currency`: STRING>>{code}
>  We'll have the following output : 
> {code:java}
> StructType.fromSql(df.schema.sql)
> //StructType(StructField(price,StructType(StructField(amount,LongType,true), 
> //StructField(currency,StringType,true)),true))
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36743) Backporting SPARK-36327 changes into Spark 2.4 version

2021-09-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36743:
-
Fix Version/s: (was: 3.3.0)

> Backporting SPARK-36327 changes into Spark 2.4 version
> --
>
> Key: SPARK-36743
> URL: https://issues.apache.org/jira/browse/SPARK-36743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Senthil Kumar
>Priority: Minor
>
> Could we back port changes merged by PR 
> [https://github.com/apache/spark/pull/33577]  into Spark 2.4 too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36743) Backporting SPARK-36327 changes into Spark 2.4 version

2021-09-13 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414638#comment-17414638
 ] 

Hyukjin Kwon commented on SPARK-36743:
--

Spark 2.x is EOL. so the backport won't likely happen.

> Backporting SPARK-36327 changes into Spark 2.4 version
> --
>
> Key: SPARK-36743
> URL: https://issues.apache.org/jira/browse/SPARK-36743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Senthil Kumar
>Priority: Minor
> Fix For: 3.3.0
>
>
> Could we back port changes merged by PR 
> [https://github.com/apache/spark/pull/33577]  into Spark 2.4 too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36743) Backporting SPARK-36327 changes into Spark 2.4 version

2021-09-13 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36743.
--
Resolution: Incomplete

> Backporting SPARK-36327 changes into Spark 2.4 version
> --
>
> Key: SPARK-36743
> URL: https://issues.apache.org/jira/browse/SPARK-36743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Senthil Kumar
>Priority: Minor
>
> Could we back port changes merged by PR 
> [https://github.com/apache/spark/pull/33577]  into Spark 2.4 too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36748) Introduce the 'compute.isin_limit' option

2021-09-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36748:


Assignee: (was: Apache Spark)

> Introduce the 'compute.isin_limit' option
> -
>
> Key: SPARK-36748
> URL: https://issues.apache.org/jira/browse/SPARK-36748
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36748) Introduce the 'compute.isin_limit' option

2021-09-13 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36748:


Assignee: Apache Spark

> Introduce the 'compute.isin_limit' option
> -
>
> Key: SPARK-36748
> URL: https://issues.apache.org/jira/browse/SPARK-36748
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36748) Introduce the 'compute.isin_limit' option

2021-09-13 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414632#comment-17414632
 ] 

Apache Spark commented on SPARK-36748:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/33982

> Introduce the 'compute.isin_limit' option
> -
>
> Key: SPARK-36748
> URL: https://issues.apache.org/jira/browse/SPARK-36748
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36748) Introduce the 'compute.isin_limit' option

2021-09-13 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-36748:


 Summary: Introduce the 'compute.isin_limit' option
 Key: SPARK-36748
 URL: https://issues.apache.org/jira/browse/SPARK-36748
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36748) Introduce the 'compute.isin_limit' option

2021-09-13 Thread Xinrong Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414629#comment-17414629
 ] 

Xinrong Meng commented on SPARK-36748:
--

I am working on that.

> Introduce the 'compute.isin_limit' option
> -
>
> Key: SPARK-36748
> URL: https://issues.apache.org/jira/browse/SPARK-36748
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

2021-09-13 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-33782:
-
Target Version/s: 3.3.0

> Place spark.files, spark.jars and spark.files under the current working 
> directory on the driver in K8S cluster mode
> ---
>
> Key: SPARK-33782
> URL: https://issues.apache.org/jira/browse/SPARK-33782
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In Yarn cluster modes, the passed files are able to be accessed in the 
> current working directory. Looks like this is not the case in Kubernates 
> cluset mode.
> By doing this, users can, for example, leverage PEX to manage Python 
> dependences in Apache Spark:
> {code}
> pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex
> PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex
> {code}
> See also https://github.com/apache/spark/pull/30735/files#r540935585.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33782) Place spark.files, spark.jars and spark.files under the current working directory on the driver in K8S cluster mode

2021-09-13 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414624#comment-17414624
 ] 

Holden Karau commented on SPARK-33782:
--

I think this missed the window for Spark 3.2, but I'm happy to pick this up for 
3.3

> Place spark.files, spark.jars and spark.files under the current working 
> directory on the driver in K8S cluster mode
> ---
>
> Key: SPARK-33782
> URL: https://issues.apache.org/jira/browse/SPARK-33782
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> In Yarn cluster modes, the passed files are able to be accessed in the 
> current working directory. Looks like this is not the case in Kubernates 
> cluset mode.
> By doing this, users can, for example, leverage PEX to manage Python 
> dependences in Apache Spark:
> {code}
> pex pyspark==3.0.1 pyarrow==0.15.1 pandas==0.25.3 -o myarchive.pex
> PYSPARK_PYTHON=./myarchive.pex spark-submit --files myarchive.pex
> {code}
> See also https://github.com/apache/spark/pull/30735/files#r540935585.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33885) The position of unresolved identifier for DDL commands should be respected..

2021-09-13 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-33885.
--
Fix Version/s: 3.2.0
 Assignee: Terry Kim
   Resolution: Fixed

> The position of unresolved identifier for DDL commands should be respected..
> 
>
> Key: SPARK-33885
> URL: https://issues.apache.org/jira/browse/SPARK-33885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.2.0
>
>
> Currently, there are many DDL commands where the position of the unresolved 
> identifiers are incorrect:
> {code:java}
> scala> sql("DESCRIBE TABLE abc")
> org.apache.spark.sql.AnalysisException: Table or view not found: abc; line 1 
> pos 0;
> {code}
> Note that the pos should be 15 in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34019) Keep same quantiles of UI and restful API

2021-09-13 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414622#comment-17414622
 ] 

Holden Karau commented on SPARK-34019:
--

This is targeting 4 since it's backwards incompat change.

> Keep same quantiles of UI and restful API
> -
>
> Key: SPARK-34019
> URL: https://issues.apache.org/jira/browse/SPARK-34019
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Keep same quantiles of UI and restful API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34064) Broadcast job is not aborted even the SQL statement canceled

2021-09-13 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414621#comment-17414621
 ] 

Holden Karau commented on SPARK-34064:
--

[~inetfuture]it's hard to say since the initial fix was reverted, if you want 
to pick it up your self that's an option.

> Broadcast job is not aborted even the SQL statement canceled
> 
>
> Key: SPARK-34064
> URL: https://issues.apache.org/jira/browse/SPARK-34064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.1, 3.2.0
>Reporter: Lantao Jin
>Priority: Minor
> Attachments: Screen Shot 2021-01-11 at 12.03.13 PM.png
>
>
> SPARK-27036 introduced a runId for BroadcastExchangeExec to resolve the 
> problem that a broadcast job is not aborted when broadcast timeout happens. 
> Since the runId is a random UUID, when a SQL statement is cancelled, these 
> broadcast sub-jobs still not canceled as a whole.
>  !Screen Shot 2021-01-11 at 12.03.13 PM.png|width=100%! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34156) Unify the output of DDL and pass output attributes properly

2021-09-13 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-34156.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

> Unify the output of DDL and pass output attributes properly
> ---
>
> Key: SPARK-34156
> URL: https://issues.apache.org/jira/browse/SPARK-34156
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
> Fix For: 3.2.0
>
>
> The current implement of some DDL not unify the output and not pass the 
> output properly to physical command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34208) Upgrade ORC to 1.6.7

2021-09-13 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414618#comment-17414618
 ] 

Holden Karau commented on SPARK-34208:
--

Is ORC-965 a regression and if so should we switch this to blocker?

> Upgrade ORC to 1.6.7
> 
>
> Key: SPARK-34208
> URL: https://issues.apache.org/jira/browse/SPARK-34208
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache ORC 1.6.7 has the following fixes including ORC-711 Support 
> CryptoExtension in create/decryptLocalKey.
>  * 
> [https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320&version=12349470]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34156) Unify the output of DDL and pass output attributes properly

2021-09-13 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414619#comment-17414619
 ] 

Holden Karau commented on SPARK-34156:
--

All of the sub issues are resolved so I'm going to go ahead and resolve this.

> Unify the output of DDL and pass output attributes properly
> ---
>
> Key: SPARK-34156
> URL: https://issues.apache.org/jira/browse/SPARK-34156
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Priority: Major
>
> The current implement of some DDL not unify the output and not pass the 
> output properly to physical command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34329) When hit ApplicationAttemptNotFoundException, we can't just stop app for all case

2021-09-13 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414617#comment-17414617
 ] 

Holden Karau commented on SPARK-34329:
--

Is this a regresion or has this behaviour been around in previous versions?

> When hit  ApplicationAttemptNotFoundException, we can't just stop app for all 
> case
> --
>
> Key: SPARK-34329
> URL: https://issues.apache.org/jira/browse/SPARK-34329
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> We always meet case that because yarn queue's setting, some app's container 
> is preempted by a higher level request due to the scheduling framework.
>  
> In this case Spark just stoped immediately. This case we can have a retry. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34208) Upgrade ORC to 1.6.7

2021-09-13 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-34208:
-
Description: 
Apache ORC 1.6.7 has the following fixes including ORC-711 Support 
CryptoExtension in create/decryptLocalKey.
 * 
[https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320&version=12349470]

> Upgrade ORC to 1.6.7
> 
>
> Key: SPARK-34208
> URL: https://issues.apache.org/jira/browse/SPARK-34208
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache ORC 1.6.7 has the following fixes including ORC-711 Support 
> CryptoExtension in create/decryptLocalKey.
>  * 
> [https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320&version=12349470]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34478) Ignore or reject wrong config when start sparksession

2021-09-13 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-34478:
-
Priority: Minor  (was: Trivial)

> Ignore or reject wrong config when start sparksession
> -
>
> Key: SPARK-34478
> URL: https://issues.apache.org/jira/browse/SPARK-34478
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Minor
>
> When use 
> {code:java}
> SparkSession.builder().config()
> {code}
> In this method user may config `spark.driver.memory`. But when we run this 
> code, jvm is started,  so this configuration won't work and in Spark UI, it 
> will show as this configuration. 
> So we should ignore such as wrong way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34478) Ignore or reject wrong config when start sparksession

2021-09-13 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-34478:
-
Priority: Trivial  (was: Major)

> Ignore or reject wrong config when start sparksession
> -
>
> Key: SPARK-34478
> URL: https://issues.apache.org/jira/browse/SPARK-34478
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Trivial
>
> When use 
> {code:java}
> SparkSession.builder().config()
> {code}
> In this method user may config `spark.driver.memory`. But when we run this 
> code, jvm is started,  so this configuration won't work and in Spark UI, it 
> will show as this configuration. 
> So we should ignore such as wrong way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34478) Ignore or reject wrong config when start sparksession

2021-09-13 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-34478:
-
Issue Type: Improvement  (was: Bug)

> Ignore or reject wrong config when start sparksession
> -
>
> Key: SPARK-34478
> URL: https://issues.apache.org/jira/browse/SPARK-34478
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Trivial
>
> When use 
> {code:java}
> SparkSession.builder().config()
> {code}
> In this method user may config `spark.driver.memory`. But when we run this 
> code, jvm is started,  so this configuration won't work and in Spark UI, it 
> will show as this configuration. 
> So we should ignore such as wrong way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34943) Upgrade flake8 to 3.8.0 or above in Jenkins

2021-09-13 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414616#comment-17414616
 ] 

Shane Knapp edited comment on SPARK-34943 at 9/13/21, 10:15 PM:


flake8 tests passing w/3.8.0!

from [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143223]
{noformat}

Running Python style checks

starting python compilation test...
python compilation succeeded.

downloading pycodestyle from 
https://raw.githubusercontent.com/PyCQA/pycodestyle/2.7.0/pycodestyle.py...
starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

The mypy command was not found. Skipping for now.

all lint-python tests passed!{noformat}
checking on the jenkins worker directly:
{noformat}
(py36) jenkins@research-jenkins-worker-08:~/workspace$ grep MINIMUM_FLAKE8
SparkPullRequestBuilder/dev/lint-python MINIMUM_FLAKE8="3.8.0"
{noformat}


was (Author: shaneknapp):
flake8 tests passing w/3.8.0!

from [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143223]
{noformat}

Running Python style checks

starting python compilation test...
python compilation succeeded.

downloading pycodestyle from 
https://raw.githubusercontent.com/PyCQA/pycodestyle/2.7.0/pycodestyle.py...
starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

The mypy command was not found. Skipping for now.

all lint-python tests passed!{noformat}
checking on the jenkins worker directly:
{noformat}
(py36) jenkins@research-jenkins-worker-08:~/workspace$ grep MINIMUM_FLAKE8 
SparkPullRequestBuilder/dev/lint-python MINIMUM_FLAKE8="3.8.0"
{noformat}

> Upgrade flake8 to 3.8.0 or above in Jenkins
> ---
>
> Key: SPARK-34943
> URL: https://issues.apache.org/jira/browse/SPARK-34943
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Shane Knapp
>Priority: Major
>
> In flake8 < 3.8.0, F401 error occurs for imports in *if* statements when 
> TYPE_CHECKING is True. However, TYPE_CHECKING is always False at runtime, so 
> there is no need to treat it as an error in static analysis.
> Since this behavior is fixed In flake8 >= 3.8.0, we should upgrade the flake8 
> installed in Jenkins to 3.8.0 or above. Otherwise, it occurs F401 error for 
> several lines in pandas-on-PySpark that uses TYPE_CHECKING.
> And also we might update the {{MINIMUM_FLAKE8}} in the {{lint-python}} from 
> 3.5.0 to 3.8.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34943) Upgrade flake8 to 3.8.0 or above in Jenkins

2021-09-13 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414616#comment-17414616
 ] 

Shane Knapp commented on SPARK-34943:
-

flake8 tests passing w/3.8.0!

from [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/143223]
{noformat}

Running Python style checks

starting python compilation test...
python compilation succeeded.

downloading pycodestyle from 
https://raw.githubusercontent.com/PyCQA/pycodestyle/2.7.0/pycodestyle.py...
starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks passed.

The mypy command was not found. Skipping for now.

all lint-python tests passed!{noformat}
checking on the jenkins worker directly:
{noformat}
(py36) jenkins@research-jenkins-worker-08:~/workspace$ grep MINIMUM_FLAKE8 
SparkPullRequestBuilder/dev/lint-python MINIMUM_FLAKE8="3.8.0"
{noformat}

> Upgrade flake8 to 3.8.0 or above in Jenkins
> ---
>
> Key: SPARK-34943
> URL: https://issues.apache.org/jira/browse/SPARK-34943
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Shane Knapp
>Priority: Major
>
> In flake8 < 3.8.0, F401 error occurs for imports in *if* statements when 
> TYPE_CHECKING is True. However, TYPE_CHECKING is always False at runtime, so 
> there is no need to treat it as an error in static analysis.
> Since this behavior is fixed In flake8 >= 3.8.0, we should upgrade the flake8 
> installed in Jenkins to 3.8.0 or above. Otherwise, it occurs F401 error for 
> several lines in pandas-on-PySpark that uses TYPE_CHECKING.
> And also we might update the {{MINIMUM_FLAKE8}} in the {{lint-python}} from 
> 3.5.0 to 3.8.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36653) Implement Series.xor

2021-09-13 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-36653.
---
Fix Version/s: 3.3.0
 Assignee: dgd_contributor
   Resolution: Fixed

Issue resolved by pull request 33911
https://github.com/apache/spark/pull/33911

> Implement Series.__xor__
> 
>
> Key: SPARK-36653
> URL: https://issues.apache.org/jira/browse/SPARK-36653
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: dgd_contributor
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34530) logError for interrupting block migrations is too high

2021-09-13 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-34530:
-
Affects Version/s: 3.3.0

> logError for interrupting block migrations is too high
> --
>
> Key: SPARK-34530
> URL: https://issues.apache.org/jira/browse/SPARK-34530
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.2.0, 3.3.0
>Reporter: Holden Karau
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36462) Allow Spark on Kube to operate without polling or watchers

2021-09-13 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-36462:
-
Affects Version/s: 3.3.0

> Allow Spark on Kube to operate without polling or watchers
> --
>
> Key: SPARK-36462
> URL: https://issues.apache.org/jira/browse/SPARK-36462
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Holden Karau
>Priority: Minor
>
> Add an option to Spark on Kube to not track the individual executor pods and 
> just assume K8s is doing what it's asked. This would be a developer feature 
> intended for minimizing load on etcd & driver.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36581) Add back transformAllExpressions to AnalysisHelper

2021-09-13 Thread Yingyi Bu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yingyi Bu resolved SPARK-36581.
---
Resolution: Not A Problem

> Add back transformAllExpressions to AnalysisHelper
> --
>
> Key: SPARK-36581
> URL: https://issues.apache.org/jira/browse/SPARK-36581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yingyi Bu
>Priority: Minor
>
> We might still want to keep the function in Spark 3.2 for API compatibility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36581) Add back transformAllExpressions to AnalysisHelper

2021-09-13 Thread Yingyi Bu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414593#comment-17414593
 ] 

Yingyi Bu commented on SPARK-36581:
---

No, we don't need to keep this interface anymore. We were worried about binary 
compatibility issues with libraries compiled with Spark 3.1, but in that case, 
users should either have their jars compiled from 3.2 or stick to 3.1

> Add back transformAllExpressions to AnalysisHelper
> --
>
> Key: SPARK-36581
> URL: https://issues.apache.org/jira/browse/SPARK-36581
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yingyi Bu
>Priority: Minor
>
> We might still want to keep the function in Spark 3.2 for API compatibility.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36705) Disable push based shuffle when IO encryption is enabled or serializer is not relocatable

2021-09-13 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-36705.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33976
[https://github.com/apache/spark/pull/33976]

> Disable push based shuffle when IO encryption is enabled or serializer is not 
> relocatable
> -
>
> Key: SPARK-36705
> URL: https://issues.apache.org/jira/browse/SPARK-36705
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Mridul Muralidharan
>Priority: Blocker
> Fix For: 3.2.0
>
>
> Push based shuffle is not compatible with io encryption or non-relocatable 
> serialization.
> This is similar to SPARK-34790
> We have to disable push based shuffle if either of these two are true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34943) Upgrade flake8 to 3.8.0 or above in Jenkins

2021-09-13 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414585#comment-17414585
 ] 

Shane Knapp commented on SPARK-34943:
-

done:

 
{noformat}
parallel-ssh -h ubuntu_workers.txt -i 
'/home/jenkins/anaconda2/envs/py36/bin/python -c "import flake8; 
print(flake8.__version__)"'
[1] 13:58:53 [SUCCESS] research-jenkins-worker-03
3.8.0
[2] 13:58:53 [SUCCESS] research-jenkins-worker-02
3.8.0
[3] 13:58:53 [SUCCESS] research-jenkins-worker-06
3.8.0
[4] 13:58:53 [SUCCESS] research-jenkins-worker-07
3.8.0
[5] 13:58:53 [SUCCESS] research-jenkins-worker-05
3.8.0
[6] 13:58:53 [SUCCESS] research-jenkins-worker-04
3.8.0
[7] 13:58:53 [SUCCESS] research-jenkins-worker-01
3.8.0
[8] 13:58:54 [SUCCESS] research-jenkins-worker-08
3.8.0{noformat}

> Upgrade flake8 to 3.8.0 or above in Jenkins
> ---
>
> Key: SPARK-34943
> URL: https://issues.apache.org/jira/browse/SPARK-34943
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Shane Knapp
>Priority: Major
>
> In flake8 < 3.8.0, F401 error occurs for imports in *if* statements when 
> TYPE_CHECKING is True. However, TYPE_CHECKING is always False at runtime, so 
> there is no need to treat it as an error in static analysis.
> Since this behavior is fixed In flake8 >= 3.8.0, we should upgrade the flake8 
> installed in Jenkins to 3.8.0 or above. Otherwise, it occurs F401 error for 
> several lines in pandas-on-PySpark that uses TYPE_CHECKING.
> And also we might update the {{MINIMUM_FLAKE8}} in the {{lint-python}} from 
> 3.5.0 to 3.8.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36681) Fail to load Snappy codec

2021-09-13 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414577#comment-17414577
 ] 

L. C. Hsieh commented on SPARK-36681:
-

The possible workaround is to use pure java implementation with snappy-java. So 
it doesn't try to load native library and we can avoid linked error.

Let me try it locally to verify and put it here as release notes.

> Fail to load Snappy codec
> -
>
> Key: SPARK-36681
> URL: https://issues.apache.org/jira/browse/SPARK-36681
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> snappy-java as a native library should not be relocated in Hadoop shaded 
> client libraries. Currently we use Hadoop shaded client libraries in Spark. 
> If trying to use SnappyCodec to write sequence file, we will encounter the 
> following error:
> {code}
> [info]   Cause: java.lang.UnsatisfiedLinkError: 
> org.apache.hadoop.shaded.org.xerial.snappy.SnappyNative.rawCompress(Ljava/nio/ByteBuffer;IILjava/nio/ByteBuffer;I)I
> [info]   at 
> org.apache.hadoop.shaded.org.xerial.snappy.SnappyNative.rawCompress(Native 
> Method)   
>   
> [info]   at 
> org.apache.hadoop.shaded.org.xerial.snappy.Snappy.compress(Snappy.java:151)   
>   
>
> [info]   at 
> org.apache.hadoop.io.compress.snappy.SnappyCompressor.compressDirectBuf(SnappyCompressor.java:282)
> [info]   at 
> org.apache.hadoop.io.compress.snappy.SnappyCompressor.compress(SnappyCompressor.java:210)
> [info]   at 
> org.apache.hadoop.io.compress.BlockCompressorStream.compress(BlockCompressorStream.java:149)
> [info]   at 
> org.apache.hadoop.io.compress.BlockCompressorStream.finish(BlockCompressorStream.java:142)
> [info]   at 
> org.apache.hadoop.io.SequenceFile$BlockCompressWriter.writeBuffer(SequenceFile.java:1589)
>  
> [info]   at 
> org.apache.hadoop.io.SequenceFile$BlockCompressWriter.sync(SequenceFile.java:1605)
> [info]   at 
> org.apache.hadoop.io.SequenceFile$BlockCompressWriter.close(SequenceFile.java:1629)
>  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36747) Do not collapse Project with Aggregate when correlated subqueries are present in the project list

2021-09-13 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-36747:
-
Description: 
Currently CollapseProject combines Project with Aggregate when the shared 
attributes are deterministic. But if there are correlated scalar subqueries in 
the project list that uses the output of the aggregate, they cannot be 
combined. Otherwise, the plan after rewrite will not be valid:

{code}
select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) s 
from t)

== Optimized Logical Plan ==
Aggregate [sum(c2)#10L AS scalarsubquery(s)#11L]
+- Project [sum(c2)#10L]
   +- Join LeftOuter, (c1#2 = cast(sum(c2#3) as int))
  :- LocalRelation [c2#3]
  +- Aggregate [c1#2], [sum(c2#3) AS sum(c2)#10L, c1#2]
 +- LocalRelation [c1#2, c2#3]

java.lang.UnsupportedOperationException: Cannot generate code for expression: 
sum(input[0, int, false])
{code}

  was:
Currently CollapseProject combines Project with Aggregate when the shared 
attributes are deterministic. But if there are correlated scalar subqueries in 
the project list that uses the output of the aggregate, they cannot be 
combined. Otherwise, the plan after rewrite will not be valid:

{code}
select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) s 
from t)

== Optimized Logical Plan ==
Aggregate [sum(b)#28L AS scalarsubquery(s)#29L]
+- Project [sum(b)#28L]
   +- Join LeftOuter, (a#20 = cast(sum(b#21) as int))
  :- LocalRelation [b#21]
  +- Aggregate [a#20], [sum(b#21) AS sum(b)#28L, a#20]
 +- LocalRelation [a#20, b#21]

java.lang.UnsupportedOperationException: Cannot generate code for expression: 
sum(input[0, int, false])
{code}


> Do not collapse Project with Aggregate when correlated subqueries are present 
> in the project list
> -
>
> Key: SPARK-36747
> URL: https://issues.apache.org/jira/browse/SPARK-36747
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently CollapseProject combines Project with Aggregate when the shared 
> attributes are deterministic. But if there are correlated scalar subqueries 
> in the project list that uses the output of the aggregate, they cannot be 
> combined. Otherwise, the plan after rewrite will not be valid:
> {code}
> select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) 
> s from t)
> == Optimized Logical Plan ==
> Aggregate [sum(c2)#10L AS scalarsubquery(s)#11L]
> +- Project [sum(c2)#10L]
>+- Join LeftOuter, (c1#2 = cast(sum(c2#3) as int))
>   :- LocalRelation [c2#3]
>   +- Aggregate [c1#2], [sum(c2#3) AS sum(c2)#10L, c1#2]
>  +- LocalRelation [c1#2, c2#3]
> java.lang.UnsupportedOperationException: Cannot generate code for expression: 
> sum(input[0, int, false])
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36747) Do not collapse Project with Aggregate when correlated subqueries are present in the project list

2021-09-13 Thread Allison Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-36747:
-
Description: 
Currently CollapseProject combines Project with Aggregate when the shared 
attributes are deterministic. But if there are correlated scalar subqueries in 
the project list that uses the output of the aggregate, they cannot be 
combined. Otherwise, the plan after rewrite will not be valid:

{code}
select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) s 
from t)

== Optimized Logical Plan ==
Aggregate [sum(b)#28L AS scalarsubquery(s)#29L]
+- Project [sum(b)#28L]
   +- Join LeftOuter, (a#20 = cast(sum(b#21) as int))
  :- LocalRelation [b#21]
  +- Aggregate [a#20], [sum(b#21) AS sum(b)#28L, a#20]
 +- LocalRelation [a#20, b#21]

java.lang.UnsupportedOperationException: Cannot generate code for expression: 
sum(input[0, int, false])
{code}

  was:
Currently CollapseProject combines Project with Aggregate when the shared 
attributes are deterministic. But if there are correlated scalar subqueries in 
the project list that uses the output of the aggregate, they cannot be 
combined. Otherwise, the plan after rewrite will not be valid:
```
select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) s 
from t)

== Optimized Logical Plan ==
Aggregate [sum(b)#28L AS scalarsubquery(s)#29L]
+- Project [sum(b)#28L]
   +- Join LeftOuter, (a#20 = cast(sum(b#21) as int))
  :- LocalRelation [b#21]
  +- Aggregate [a#20], [sum(b#21) AS sum(b)#28L, a#20]
 +- LocalRelation [a#20, b#21]

java.lang.UnsupportedOperationException: Cannot generate code for expression: 
sum(input[0, int, false])
```


> Do not collapse Project with Aggregate when correlated subqueries are present 
> in the project list
> -
>
> Key: SPARK-36747
> URL: https://issues.apache.org/jira/browse/SPARK-36747
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently CollapseProject combines Project with Aggregate when the shared 
> attributes are deterministic. But if there are correlated scalar subqueries 
> in the project list that uses the output of the aggregate, they cannot be 
> combined. Otherwise, the plan after rewrite will not be valid:
> {code}
> select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) 
> s from t)
> == Optimized Logical Plan ==
> Aggregate [sum(b)#28L AS scalarsubquery(s)#29L]
> +- Project [sum(b)#28L]
>+- Join LeftOuter, (a#20 = cast(sum(b#21) as int))
>   :- LocalRelation [b#21]
>   +- Aggregate [a#20], [sum(b#21) AS sum(b)#28L, a#20]
>  +- LocalRelation [a#20, b#21]
> java.lang.UnsupportedOperationException: Cannot generate code for expression: 
> sum(input[0, int, false])
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36747) Do not collapse Project with Aggregate when correlated subqueries are present in the project list

2021-09-13 Thread Allison Wang (Jira)

Allison Wang created SPARK-36747:


 Summary: Do not collapse Project with Aggregate when correlated 
subqueries are present in the project list
 Key: SPARK-36747
 URL: https://issues.apache.org/jira/browse/SPARK-36747
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Allison Wang


Currently CollapseProject combines Project with Aggregate when the shared 
attributes are deterministic. But if there are correlated scalar subqueries in 
the project list that uses the output of the aggregate, they cannot be 
combined. Otherwise, the plan after rewrite will not be valid:
```
select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) s 
from t)

== Optimized Logical Plan ==
Aggregate [sum(b)#28L AS scalarsubquery(s)#29L]
+- Project [sum(b)#28L]
   +- Join LeftOuter, (a#20 = cast(sum(b#21) as int))
  :- LocalRelation [b#21]
  +- Aggregate [a#20], [sum(b#21) AS sum(b)#28L, a#20]
 +- LocalRelation [a#20, b#21]

java.lang.UnsupportedOperationException: Cannot generate code for expression: 
sum(input[0, int, false])
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34530) logError for interrupting block migrations is too high

2021-09-13 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414565#comment-17414565
 ] 

Holden Karau commented on SPARK-34530:
--

My bad on not describing this enough, I've honestly forgotten. I'll dig back 
into this and update the description this week.

> logError for interrupting block migrations is too high
> --
>
> Key: SPARK-34530
> URL: https://issues.apache.org/jira/browse/SPARK-34530
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.1.1, 3.2.0
>Reporter: Holden Karau
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34943) Upgrade flake8 to 3.8.0 or above in Jenkins

2021-09-13 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-34943:
-
Issue Type: Improvement  (was: Bug)

> Upgrade flake8 to 3.8.0 or above in Jenkins
> ---
>
> Key: SPARK-34943
> URL: https://issues.apache.org/jira/browse/SPARK-34943
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Shane Knapp
>Priority: Major
>
> In flake8 < 3.8.0, F401 error occurs for imports in *if* statements when 
> TYPE_CHECKING is True. However, TYPE_CHECKING is always False at runtime, so 
> there is no need to treat it as an error in static analysis.
> Since this behavior is fixed In flake8 >= 3.8.0, we should upgrade the flake8 
> installed in Jenkins to 3.8.0 or above. Otherwise, it occurs F401 error for 
> several lines in pandas-on-PySpark that uses TYPE_CHECKING.
> And also we might update the {{MINIMUM_FLAKE8}} in the {{lint-python}} from 
> 3.5.0 to 3.8.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema

2021-09-13 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414563#comment-17414563
 ] 

Holden Karau commented on SPARK-35531:
--

Did this use to work?

> Can not insert into hive bucket table if create table with upper case schema
> 
>
> Key: SPARK-35531
> URL: https://issues.apache.org/jira/browse/SPARK-35531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Hongyi Zhang
>Priority: Major
>
>  
>  
> create table TEST1(
>  V1 BIGINT,
>  S1 INT)
>  partitioned by (PK BIGINT)
>  clustered by (V1)
>  sorted by (S1)
>  into 200 buckets
>  STORED AS PARQUET;
>  
> insert into test1
>  select
>  * from values(1,1,1);
>  
>  
> org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not 
> part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), 
> FieldSchema(name:s1, type:int, comment:null)]
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not 
> part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), 
> FieldSchema(name:s1, type:int, comment:null)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35834) Use the same cleanup logic as Py4J in inheritable thread API

2021-09-13 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414562#comment-17414562
 ] 

Holden Karau commented on SPARK-35834:
--

Is this a test only issue or a regression for Python users?

> Use the same cleanup logic as Py4J in inheritable thread API
> 
>
> Key: SPARK-35834
> URL: https://issues.apache.org/jira/browse/SPARK-35834
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> After 
> https://github.com/apache/spark/commit/6d309914df422d9f0c96edfd37924ecb8f29e3a9,
>  the test became flaky:
> {code}
> ==
> ERROR [71.813s]: test_save_load_pipeline_estimator 
> (pyspark.ml.tests.test_tuning.CrossValidatorTests)
> --
> Traceback (most recent call last):
>   File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, 
> in test_save_load_pipeline_estimator
> self._run_test_save_load_pipeline_estimator(DummyLogisticRegression)
>   File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, 
> in _run_test_save_load_pipeline_estimator
> cvModel2 = crossval2.fit(training)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit
> bestModel = est.fit(dataset, epm[bestIndex])
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit
> return self.copy(params)._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
> model = stage.fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
> model = stage.fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
> return self._fit(dataset)
>   File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in 
> _fit
> models = pool.map(inheritable_thread_target(trainSingleClass), 
> range(numClasses))
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 266, in map
> return self._map_async(func, iterable, mapstar, chunksize).get()
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 644, in get
> raise self._value
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 119, in worker
> result = (True, func(*args, **kwds))
>   File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 
> 44, in mapstar
> return list(map(*args))
>   File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped
> InheritableThread._clean_py4j_conn_for_current_thread()
>   File "/__w/spark/spark/python/pyspark/util.py", line 389, in 
> _clean_py4j_conn_for_current_thread
> del connections[i]
> IndexError: deque index out of range
> --
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35930) Upgrade kinesis-client to 1.14.4

2021-09-13 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414560#comment-17414560
 ] 

Holden Karau commented on SPARK-35930:
--

So to be clear, is it minor because we don't normally launch multiple clients 
in the JVM? Otherwise this does seem like a potential correctness issue.

> Upgrade kinesis-client to 1.14.4
> 
>
> Key: SPARK-35930
> URL: https://issues.apache.org/jira/browse/SPARK-35930
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> Upgrading to 1.14.1 or newer is recommended by the community for users who 
> use kinesis-client 1.14.0 due to a bug.
> https://github.com/awslabs/amazon-kinesis-client/tree/master#recommended-upgrade-for-all-users-of-the-1x-amazon-kinesis-client
> https://github.com/awslabs/amazon-kinesis-client/issues/778



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36238) Spark UI load event timeline too slow for huge stage

2021-09-13 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414557#comment-17414557
 ] 

Holden Karau commented on SPARK-36238:
--

hows it going [~angerszhuuu]?

> Spark UI  load event timeline too slow for huge stage
> -
>
> Key: SPARK-36238
> URL: https://issues.apache.org/jira/browse/SPARK-36238
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36251) Cover GitHub Actions runs without SHA in testing script

2021-09-13 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414556#comment-17414556
 ] 

Holden Karau commented on SPARK-36251:
--

Is this a blocker for 3.2 since it might affect release correctness?

> Cover GitHub Actions runs without SHA in testing script
> ---
>
> Key: SPARK-36251
> URL: https://issues.apache.org/jira/browse/SPARK-36251
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-36204 added the periodical jobs for branch-3.2 too but the job runs 
> without SHA being set.
> The test script should be able to handle this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36433) Logs should show correct URL of where HistoryServer is started

2021-09-13 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-36433:
-
Priority: Blocker  (was: Major)

> Logs should show correct URL of where HistoryServer is started
> --
>
> Key: SPARK-36433
> URL: https://issues.apache.org/jira/browse/SPARK-36433
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Thejdeep Gudivada
>Priority: Blocker
>
> Due to a recent refactoring in the WebUI bind() code, the log message to 
> print the bound host and port information got moved and because of this the 
> info printed is incorrect.
>  
> Example log - 21/08/05 10:47:38 INFO HistoryServer: Bound HistoryServer to 
> 0.0.0.0, and started at :-1
>  
> Notice above that the port is incorrect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36433) Logs should show correct URL of where HistoryServer is started

2021-09-13 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414555#comment-17414555
 ] 

Holden Karau commented on SPARK-36433:
--

I think if this is a regression we should make this a blocker since finding the 
history server is an important part of people debugging their applications.

> Logs should show correct URL of where HistoryServer is started
> --
>
> Key: SPARK-36433
> URL: https://issues.apache.org/jira/browse/SPARK-36433
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Thejdeep Gudivada
>Priority: Major
>
> Due to a recent refactoring in the WebUI bind() code, the log message to 
> print the bound host and port information got moved and because of this the 
> info printed is incorrect.
>  
> Example log - 21/08/05 10:47:38 INFO HistoryServer: Bound HistoryServer to 
> 0.0.0.0, and started at :-1
>  
> Notice above that the port is incorrect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36462) Allow Spark on Kube to operate without polling or watchers

2021-09-13 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17414554#comment-17414554
 ] 

Holden Karau commented on SPARK-36462:
--

I'll probably pick this up this week.

> Allow Spark on Kube to operate without polling or watchers
> --
>
> Key: SPARK-36462
> URL: https://issues.apache.org/jira/browse/SPARK-36462
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Minor
>
> Add an option to Spark on Kube to not track the individual executor pods and 
> just assume K8s is doing what it's asked. This would be a developer feature 
> intended for minimizing load on etcd & driver.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36543) Decommission logs too frequent when waiting migration to finish

2021-09-13 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-36543:
-
Shepherd: Holden Karau

> Decommission logs too frequent when waiting migration to finish
> ---
>
> Key: SPARK-36543
> URL: https://issues.apache.org/jira/browse/SPARK-36543
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0, 3.3.0
>Reporter: wuyi
>Priority: Major
>
> {code:java}
> 21/08/18 08:14:31 INFO CoarseGrainedExecutorBackend: Checking to see if we 
> can shutdown.
> 21/08/18 08:14:31 INFO CoarseGrainedExecutorBackend: No running tasks, 
> checking migrations
> 21/08/18 08:14:31 INFO CoarseGrainedExecutorBackend: All blocks not yet 
> migrated.
> 21/08/18 08:14:32 INFO CoarseGrainedExecutorBackend: Checking to see if we 
> can shutdown.
> 21/08/18 08:14:32 INFO CoarseGrainedExecutorBackend: No running tasks, 
> checking migrations
> 21/08/18 08:14:32 INFO CoarseGrainedExecutorBackend: All blocks not yet 
> migrated.
> 21/08/18 08:14:33 INFO CoarseGrainedExecutorBackend: Checking to see if we 
> can shutdown.
> 21/08/18 08:14:33 INFO CoarseGrainedExecutorBackend: No running tasks, 
> checking migrations
> 21/08/18 08:14:33 INFO CoarseGrainedExecutorBackend: All blocks not yet 
> migrated.
> 21/08/18 08:14:34 INFO CoarseGrainedExecutorBackend: Checking to see if we 
> can shutdown.
> 21/08/18 08:14:34 INFO CoarseGrainedExecutorBackend: No running tasks, 
> checking migrations
> 21/08/18 08:14:34 INFO CoarseGrainedExecutorBackend: All blocks not yet 
> migrated.
> 21/08/18 08:14:35 INFO CoarseGrainedExecutorBackend: Checking to see if we 
> can shutdown.
> 21/08/18 08:14:35 INFO CoarseGrainedExecutorBackend: No running tasks, 
> checking migrations
> 21/08/18 08:14:35 INFO CoarseGrainedExecutorBackend: All blocks not yet 
> migrated.
> 21/08/18 08:14:36 INFO CoarseGrainedExecutorBackend: Checking to see if we 
> can shutdown.
> 21/08/18 08:14:36 INFO CoarseGrainedExecutorBackend: No running tasks, 
> checking migrations
> 21/08/18 08:14:36 INFO CoarseGrainedExecutorBackend: All blocks not yet 
> migrated.
> 21/08/18 08:14:37 INFO CoarseGrainedExecutorBackend: Checking to see if we 
> can shutdown.
> 21/08/18 08:14:37 INFO CoarseGrainedExecutorBackend: No running tasks, 
> checking migrations
> 21/08/18 08:14:37 INFO CoarseGrainedExecutorBackend: All blocks not yet 
> migrated.
> 21/08/18 08:14:38 INFO CoarseGrainedExecutorBackend: Checking to see if we 
> can shutdown.
> 21/08/18 08:14:38 INFO CoarseGrainedExecutorBackend: No running tasks, 
> checking migrations
> 21/08/18 08:14:38 INFO CoarseGrainedExecutorBackend: All blocks not yet 
> migrated.
> 21/08/18 08:14:39 INFO CoarseGrainedExecutorBackend: Checking to see if we 
> can shutdown.
> 21/08/18 08:14:39 INFO CoarseGrainedExecutorBackend: No running tasks, 
> checking migrations
> 21/08/18 08:14:39 INFO CoarseGrainedExecutorBackend: All blocks not yet 
> migrated.
> 21/08/18 08:14:40 INFO CoarseGrainedExecutorBackend: Checking to see if we 
> can shutdown.
> 21/08/18 08:14:40 INFO CoarseGrainedExecutorBackend: No running tasks, 
> checking migrations
> 21/08/18 08:14:40 INFO CoarseGrainedExecutorBackend: All blocks not yet 
> migrated.
> 21/08/18 08:14:41 INFO CoarseGrainedExecutorBackend: Checking to see if we 
> can shutdown.
> 21/08/18 08:14:41 INFO CoarseGrainedExecutorBackend: No running tasks, 
> checking migrations
> 21/08/18 08:14:41 INFO CoarseGrainedExecutorBackend: All blocks not yet 
> migrated.
> 21/08/18 08:14:42 INFO CoarseGrainedExecutorBackend: Checking to see if we 
> can shutdown.
> 21/08/18 08:14:42 INFO CoarseGrainedExecutorBackend: No running tasks, 
> checking migrations
> 21/08/18 08:14:42 INFO CoarseGrainedExecutorBackend: All blocks not yet 
> migrated.
> 21/08/18 08:14:43 INFO CoarseGrainedExecutorBackend: Checking to see if we 
> can shutdown.
> 21/08/18 08:14:43 INFO CoarseGrainedExecutorBackend: No running tasks, 
> checking migrations
> 21/08/18 08:14:43 INFO CoarseGrainedExecutorBackend: All blocks not yet 
> migrated.
> 21/08/18 08:14:44 INFO CoarseGrainedExecutorBackend: Checking to see if we 
> can shutdown.
> 21/08/18 08:14:44 INFO CoarseGrainedExecutorBackend: No running tasks, 
> checking migrations
> 21/08/18 08:14:44 INFO CoarseGrainedExecutorBackend: All blocks not yet 
> migrated.
> ...{code}
> It takes some time to migrate data (shuffle or rdd). Logging per second is 
> too frequent. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 179 matches

Mail list logo