[jira] [Created] (SPARK-37266) Optimize the analysis for view text of persist view and fix security vulnerabilities caused by sql tampering
jiaan.geng created SPARK-37266: -- Summary: Optimize the analysis for view text of persist view and fix security vulnerabilities caused by sql tampering Key: SPARK-37266 URL: https://issues.apache.org/jira/browse/SPARK-37266 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: jiaan.geng The current implementation of persist view is create hive table with view text. The view text is just a query string, so the hackers may tamper with it through various means. Such as: {code:java} select * from tab1 {code} tampered with {code:java} drop table tab1 {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37266) Optimize the analysis for view text of persistent view and fix security vulnerabilities caused by sql tampering
[ https://issues.apache.org/jira/browse/SPARK-37266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-37266: --- Description: The current implementation of persistent view is create hive table with view text. The view text is just a query string, so the hackers may tamper with it through various means. Such as: {code:java} select * from tab1 {code} tampered with {code:java} drop table tab1 {code} was: The current implementation of persist view is create hive table with view text. The view text is just a query string, so the hackers may tamper with it through various means. Such as: {code:java} select * from tab1 {code} tampered with {code:java} drop table tab1 {code} > Optimize the analysis for view text of persistent view and fix security > vulnerabilities caused by sql tampering > > > Key: SPARK-37266 > URL: https://issues.apache.org/jira/browse/SPARK-37266 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > The current implementation of persistent view is create hive table with view > text. > The view text is just a query string, so the hackers may tamper with it > through various means. > Such as: > {code:java} > select * from tab1 > {code} > tampered with > > {code:java} > drop table tab1 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37266) Optimize the analysis for view text of persistent view and fix security vulnerabilities caused by sql tampering
[ https://issues.apache.org/jira/browse/SPARK-37266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-37266: --- Summary: Optimize the analysis for view text of persistent view and fix security vulnerabilities caused by sql tampering (was: Optimize the analysis for view text of persist view and fix security vulnerabilities caused by sql tampering ) > Optimize the analysis for view text of persistent view and fix security > vulnerabilities caused by sql tampering > > > Key: SPARK-37266 > URL: https://issues.apache.org/jira/browse/SPARK-37266 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > The current implementation of persist view is create hive table with view > text. > The view text is just a query string, so the hackers may tamper with it > through various means. > Such as: > {code:java} > select * from tab1 > {code} > tampered with > > {code:java} > drop table tab1 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37267) OptimizeSkewInRebalancePartitions support optimize non-root node
XiDuo You created SPARK-37267: - Summary: OptimizeSkewInRebalancePartitions support optimize non-root node Key: SPARK-37267 URL: https://issues.apache.org/jira/browse/SPARK-37267 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: XiDuo You `OptimizeSkewInRebalancePartitions` now is applied if the `RebalancePartitions` is the root node, but sometimes, we expect a local sort after do RebalancePartitions that can improve the compression ratio. After SPARK-36184, we make validate easy that whether the rule introduces extra shuffle or not and the output partitioning is ensured by `AQEShuffleReadExec.outputPartitioning`. So it is safe to make `OptimizeSkewInRebalancePartitions` support optimize non-root node. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37267) OptimizeSkewInRebalancePartitions support optimize non-root node
[ https://issues.apache.org/jira/browse/SPARK-37267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441578#comment-17441578 ] Apache Spark commented on SPARK-37267: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/34542 > OptimizeSkewInRebalancePartitions support optimize non-root node > > > Key: SPARK-37267 > URL: https://issues.apache.org/jira/browse/SPARK-37267 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > `OptimizeSkewInRebalancePartitions` now is applied if the > `RebalancePartitions` is the root node, but sometimes, we expect a local sort > after do RebalancePartitions that can improve the compression ratio. > After SPARK-36184, we make validate easy that whether the rule introduces > extra shuffle or not and the output partitioning is ensured by > `AQEShuffleReadExec.outputPartitioning`. > So it is safe to make `OptimizeSkewInRebalancePartitions` support optimize > non-root node. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37267) OptimizeSkewInRebalancePartitions support optimize non-root node
[ https://issues.apache.org/jira/browse/SPARK-37267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37267: Assignee: (was: Apache Spark) > OptimizeSkewInRebalancePartitions support optimize non-root node > > > Key: SPARK-37267 > URL: https://issues.apache.org/jira/browse/SPARK-37267 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > > `OptimizeSkewInRebalancePartitions` now is applied if the > `RebalancePartitions` is the root node, but sometimes, we expect a local sort > after do RebalancePartitions that can improve the compression ratio. > After SPARK-36184, we make validate easy that whether the rule introduces > extra shuffle or not and the output partitioning is ensured by > `AQEShuffleReadExec.outputPartitioning`. > So it is safe to make `OptimizeSkewInRebalancePartitions` support optimize > non-root node. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37267) OptimizeSkewInRebalancePartitions support optimize non-root node
[ https://issues.apache.org/jira/browse/SPARK-37267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37267: Assignee: Apache Spark > OptimizeSkewInRebalancePartitions support optimize non-root node > > > Key: SPARK-37267 > URL: https://issues.apache.org/jira/browse/SPARK-37267 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > `OptimizeSkewInRebalancePartitions` now is applied if the > `RebalancePartitions` is the root node, but sometimes, we expect a local sort > after do RebalancePartitions that can improve the compression ratio. > After SPARK-36184, we make validate easy that whether the rule introduces > extra shuffle or not and the output partitioning is ensured by > `AQEShuffleReadExec.outputPartitioning`. > So it is safe to make `OptimizeSkewInRebalancePartitions` support optimize > non-root node. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37266) Optimize the analysis for view text of persistent view and fix security vulnerabilities caused by sql tampering
[ https://issues.apache.org/jira/browse/SPARK-37266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441580#comment-17441580 ] Apache Spark commented on SPARK-37266: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/34543 > Optimize the analysis for view text of persistent view and fix security > vulnerabilities caused by sql tampering > > > Key: SPARK-37266 > URL: https://issues.apache.org/jira/browse/SPARK-37266 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > The current implementation of persistent view is create hive table with view > text. > The view text is just a query string, so the hackers may tamper with it > through various means. > Such as: > {code:java} > select * from tab1 > {code} > tampered with > > {code:java} > drop table tab1 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37266) Optimize the analysis for view text of persistent view and fix security vulnerabilities caused by sql tampering
[ https://issues.apache.org/jira/browse/SPARK-37266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37266: Assignee: Apache Spark > Optimize the analysis for view text of persistent view and fix security > vulnerabilities caused by sql tampering > > > Key: SPARK-37266 > URL: https://issues.apache.org/jira/browse/SPARK-37266 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > The current implementation of persistent view is create hive table with view > text. > The view text is just a query string, so the hackers may tamper with it > through various means. > Such as: > {code:java} > select * from tab1 > {code} > tampered with > > {code:java} > drop table tab1 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37266) Optimize the analysis for view text of persistent view and fix security vulnerabilities caused by sql tampering
[ https://issues.apache.org/jira/browse/SPARK-37266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37266: Assignee: (was: Apache Spark) > Optimize the analysis for view text of persistent view and fix security > vulnerabilities caused by sql tampering > > > Key: SPARK-37266 > URL: https://issues.apache.org/jira/browse/SPARK-37266 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > The current implementation of persistent view is create hive table with view > text. > The view text is just a query string, so the hackers may tamper with it > through various means. > Such as: > {code:java} > select * from tab1 > {code} > tampered with > > {code:java} > drop table tab1 > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
[ https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37022: Assignee: Maciej Szymkiewicz > Use black as a formatter for the whole PySpark codebase. > > > Key: SPARK-37022 > URL: https://issues.apache.org/jira/browse/SPARK-37022 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Attachments: black-diff-stats.txt, pyproject.toml > > > [{{black}}|https://github.com/psf/black] is a popular Python code formatter. > It is used by a number of projects, both small and large, including prominent > ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used > to format a {{pyspark.pandas}} and (though not enforced) stubs files. > We should consider using black to enforce formatting of all PySpark files. > There are multiple reasons to do that: > - Consistency: black is already used across existing codebase and black > formatted chunks of code are already added to modules other than > pyspark.pandas as a result of type hints inlining (SPARK-36845). > - Lower cost of contributing and reviewing: Formatting can be automatically > enforced and applied. > - Simplify reviews: In general, black formatted code, produces small and > highly readable diffs. > - Reduce effort required to maintain patched forks: smaller diffs + > predictable formatting. > Risks: > - Initial reformatting requires quite significant changes. > - Applying black will break blame in GitHub UI (for git in general see > [Avoiding ruining git > blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). > Additional steps: > - To simplify backporting, black will have to be applied to all active > branches. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
[ https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37022. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34297 [https://github.com/apache/spark/pull/34297] > Use black as a formatter for the whole PySpark codebase. > > > Key: SPARK-37022 > URL: https://issues.apache.org/jira/browse/SPARK-37022 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > Attachments: black-diff-stats.txt, pyproject.toml > > > [{{black}}|https://github.com/psf/black] is a popular Python code formatter. > It is used by a number of projects, both small and large, including prominent > ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used > to format a {{pyspark.pandas}} and (though not enforced) stubs files. > We should consider using black to enforce formatting of all PySpark files. > There are multiple reasons to do that: > - Consistency: black is already used across existing codebase and black > formatted chunks of code are already added to modules other than > pyspark.pandas as a result of type hints inlining (SPARK-36845). > - Lower cost of contributing and reviewing: Formatting can be automatically > enforced and applied. > - Simplify reviews: In general, black formatted code, produces small and > highly readable diffs. > - Reduce effort required to maintain patched forks: smaller diffs + > predictable formatting. > Risks: > - Initial reformatting requires quite significant changes. > - Applying black will break blame in GitHub UI (for git in general see > [Avoiding ruining git > blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). > Additional steps: > - To simplify backporting, black will have to be applied to all active > branches. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
[ https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441610#comment-17441610 ] Apache Spark commented on SPARK-37022: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34544 > Use black as a formatter for the whole PySpark codebase. > > > Key: SPARK-37022 > URL: https://issues.apache.org/jira/browse/SPARK-37022 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > Attachments: black-diff-stats.txt, pyproject.toml > > > [{{black}}|https://github.com/psf/black] is a popular Python code formatter. > It is used by a number of projects, both small and large, including prominent > ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used > to format a {{pyspark.pandas}} and (though not enforced) stubs files. > We should consider using black to enforce formatting of all PySpark files. > There are multiple reasons to do that: > - Consistency: black is already used across existing codebase and black > formatted chunks of code are already added to modules other than > pyspark.pandas as a result of type hints inlining (SPARK-36845). > - Lower cost of contributing and reviewing: Formatting can be automatically > enforced and applied. > - Simplify reviews: In general, black formatted code, produces small and > highly readable diffs. > - Reduce effort required to maintain patched forks: smaller diffs + > predictable formatting. > Risks: > - Initial reformatting requires quite significant changes. > - Applying black will break blame in GitHub UI (for git in general see > [Avoiding ruining git > blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). > Additional steps: > - To simplify backporting, black will have to be applied to all active > branches. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.
[ https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441612#comment-17441612 ] Apache Spark commented on SPARK-37022: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/34544 > Use black as a formatter for the whole PySpark codebase. > > > Key: SPARK-37022 > URL: https://issues.apache.org/jira/browse/SPARK-37022 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > Attachments: black-diff-stats.txt, pyproject.toml > > > [{{black}}|https://github.com/psf/black] is a popular Python code formatter. > It is used by a number of projects, both small and large, including prominent > ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used > to format a {{pyspark.pandas}} and (though not enforced) stubs files. > We should consider using black to enforce formatting of all PySpark files. > There are multiple reasons to do that: > - Consistency: black is already used across existing codebase and black > formatted chunks of code are already added to modules other than > pyspark.pandas as a result of type hints inlining (SPARK-36845). > - Lower cost of contributing and reviewing: Formatting can be automatically > enforced and applied. > - Simplify reviews: In general, black formatted code, produces small and > highly readable diffs. > - Reduce effort required to maintain patched forks: smaller diffs + > predictable formatting. > Risks: > - Initial reformatting requires quite significant changes. > - Applying black will break blame in GitHub UI (for git in general see > [Avoiding ruining git > blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]). > Additional steps: > - To simplify backporting, black will have to be applied to all active > branches. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37268) Remove unused method call in FileScanRDD
Junfan Zhang created SPARK-37268: Summary: Remove unused method call in FileScanRDD Key: SPARK-37268 URL: https://issues.apache.org/jira/browse/SPARK-37268 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.0 Reporter: Junfan Zhang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37268) Remove unused method call in FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-37268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37268: Assignee: (was: Apache Spark) > Remove unused method call in FileScanRDD > > > Key: SPARK-37268 > URL: https://issues.apache.org/jira/browse/SPARK-37268 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Junfan Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37268) Remove unused method call in FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-37268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37268: Assignee: Apache Spark > Remove unused method call in FileScanRDD > > > Key: SPARK-37268 > URL: https://issues.apache.org/jira/browse/SPARK-37268 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Junfan Zhang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37268) Remove unused method call in FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-37268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441669#comment-17441669 ] Apache Spark commented on SPARK-37268: -- User 'zuston' has created a pull request for this issue: https://github.com/apache/spark/pull/34545 > Remove unused method call in FileScanRDD > > > Key: SPARK-37268 > URL: https://issues.apache.org/jira/browse/SPARK-37268 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Junfan Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37269) The partitionOverwriteMode option is not respected when using insertInto
David Szakallas created SPARK-37269: --- Summary: The partitionOverwriteMode option is not respected when using insertInto Key: SPARK-37269 URL: https://issues.apache.org/jira/browse/SPARK-37269 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: David Szakallas >From the documentation of the {{spark.sql.sources.partitionOverwriteMode}} >configuration option: {quote}This can also be set as an output option for a data source using key partitionOverwriteMode (which takes precedence over this setting), e.g. dataframe.write.option("partitionOverwriteMode", "dynamic").save(path). {quote} This is true when using .save(), however .insertInto() does not respect the output option. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37270) Incorect result of filter using isNull condition
Tomasz Kus created SPARK-37270: -- Summary: Incorect result of filter using isNull condition Key: SPARK-37270 URL: https://issues.apache.org/jira/browse/SPARK-37270 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.0 Reporter: Tomasz Kus Simple code that allows to reproduce this issue: {code:java} val frame = Seq((false, 1)).toDF("bool", "number") frame .checkpoint() .withColumn("conditions", when(col("bool"), "I am not null")) .filter(col("conditions").isNull) .show(false){code} Although "conditions" column is null {code:java} +-+--+--+ |bool |number|conditions| +-+--+--+ |false|1 |null | +-+--+--+{code} empty result is shown. Execution plans: {code:java} == Parsed Logical Plan == 'Filter isnull('conditions) +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#252] +- LogicalRDD [bool#124, number#125], false == Analyzed Logical Plan == bool: boolean, number: int, conditions: string Filter isnull(conditions#252) +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#252] +- LogicalRDD [bool#124, number#125], false == Optimized Logical Plan == LocalRelation , [bool#124, number#125, conditions#252] == Physical Plan == LocalTableScan , [bool#124, number#125, conditions#252] {code} After removing checkpoint proper result is returned and execution plans are as follow: {code:java} == Parsed Logical Plan == 'Filter isnull('conditions) +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#256] +- Project [_1#119 AS bool#124, _2#120 AS number#125] +- LocalRelation [_1#119, _2#120] == Analyzed Logical Plan == bool: boolean, number: int, conditions: string Filter isnull(conditions#256) +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END AS conditions#256] +- Project [_1#119 AS bool#124, _2#120 AS number#125] +- LocalRelation [_1#119, _2#120] == Optimized Logical Plan == LocalRelation [bool#124, number#125, conditions#256] == Physical Plan == LocalTableScan [bool#124, number#125, conditions#256] {code} It seems that the most important difference is LogicalRDD -> LocalRelation There are following ways (workarounds) to retrieve correct result: 1) remove checkpoint 2) add explicit .otherwise(null) to when 3) add checkpoint() or cache() just before filter 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37261) Check adding partitions with ANSI intervals
[ https://issues.apache.org/jira/browse/SPARK-37261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-37261. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34537 [https://github.com/apache/spark/pull/34537] > Check adding partitions with ANSI intervals > --- > > Key: SPARK-37261 > URL: https://issues.apache.org/jira/browse/SPARK-37261 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.3.0 > > > Add tests that should check adding partitions with ANSI intervals via the > ALTER TABLE command. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37270) Incorect result of filter using isNull condition
[ https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomasz Kus updated SPARK-37270: --- Component/s: SQL > Incorect result of filter using isNull condition > > > Key: SPARK-37270 > URL: https://issues.apache.org/jira/browse/SPARK-37270 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Tomasz Kus >Priority: Major > > Simple code that allows to reproduce this issue: > {code:java} > val frame = Seq((false, 1)).toDF("bool", "number") > frame > .checkpoint() > .withColumn("conditions", when(col("bool"), "I am not null")) > .filter(col("conditions").isNull) > .show(false){code} > Although "conditions" column is null > {code:java} > +-+--+--+ > |bool |number|conditions| > +-+--+--+ > |false|1 |null | > +-+--+--+{code} > empty result is shown. > Execution plans: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#252) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Optimized Logical Plan == > LocalRelation , [bool#124, number#125, conditions#252] > == Physical Plan == > LocalTableScan , [bool#124, number#125, conditions#252] > {code} > After removing checkpoint proper result is returned and execution plans are > as follow: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#256) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Optimized Logical Plan == > LocalRelation [bool#124, number#125, conditions#256] > == Physical Plan == > LocalTableScan [bool#124, number#125, conditions#256] > {code} > It seems that the most important difference is LogicalRDD -> LocalRelation > There are following ways (workarounds) to retrieve correct result: > 1) remove checkpoint > 2) add explicit .otherwise(null) to when > 3) add checkpoint() or cache() just before filter > 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37236) Inline type hints for KernelDensity.pyi, test.py in python/pyspark/mllib/stat/
[ https://issues.apache.org/jira/browse/SPARK-37236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz resolved SPARK-37236. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34510 [https://github.com/apache/spark/pull/34510] > Inline type hints for KernelDensity.pyi, test.py in python/pyspark/mllib/stat/ > -- > > Key: SPARK-37236 > URL: https://issues.apache.org/jira/browse/SPARK-37236 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37236) Inline type hints for KernelDensity.pyi, test.py in python/pyspark/mllib/stat/
[ https://issues.apache.org/jira/browse/SPARK-37236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz reassigned SPARK-37236: -- Assignee: dch nguyen > Inline type hints for KernelDensity.pyi, test.py in python/pyspark/mllib/stat/ > -- > > Key: SPARK-37236 > URL: https://issues.apache.org/jira/browse/SPARK-37236 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37045) Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests
[ https://issues.apache.org/jira/browse/SPARK-37045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441755#comment-17441755 ] Max Gekk commented on SPARK-37045: -- I am working on this. > Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests > > > Key: SPARK-37045 > URL: https://issues.apache.org/jira/browse/SPARK-37045 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Priority: Major > > Extract ALTER TABLE .. ADD COLUMNS tests to the common place to run them for > V1 and v2 datasources. Some tests can be places to V1 and V2 specific test > suites. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37045) Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests
[ https://issues.apache.org/jira/browse/SPARK-37045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-37045: Assignee: Max Gekk > Unify v1 and v2 ALTER TABLE .. ADD COLUMNS tests > > > Key: SPARK-37045 > URL: https://issues.apache.org/jira/browse/SPARK-37045 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Max Gekk >Priority: Major > > Extract ALTER TABLE .. ADD COLUMNS tests to the common place to run them for > V1 and v2 datasources. Some tests can be places to V1 and V2 specific test > suites. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36575) Executor lost may cause spark stage to hang
[ https://issues.apache.org/jira/browse/SPARK-36575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441796#comment-17441796 ] wuyi commented on SPARK-36575: -- FYI: the fix is reverted due to test issues. > Executor lost may cause spark stage to hang > --- > > Key: SPARK-36575 > URL: https://issues.apache.org/jira/browse/SPARK-36575 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.3.3 >Reporter: hujiahua >Assignee: hujiahua >Priority: Major > Fix For: 3.3.0 > > > When a executor finished a task of some stage, the driver will receive a > `StatusUpdate` event to handle it. At the same time the driver found the > executor heartbeat timed out, so the dirver also need handle ExecutorLost > event simultaneously. There was a race condition issues here, which will make > the task never been rescheduled again and the stage hang over. > The problem is that `TaskResultGetter.enqueueSuccessfulTask` use > asynchronous thread to handle successful task, that mean the synchronized > lock of `TaskSchedulerImpl` was released prematurely during midway > [https://github.com/apache/spark/blob/branch-2.3/core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala#L61]. > So `TaskSchedulerImpl` may handle executorLost first, then the asynchronous > thread will go on to handle successful task. It cause > `TaskSetManager.successful` and `TaskSetManager.tasksSuccessful` wrong > result. > Then `HeartbeatReceiver.expireDeadHosts` executed `killAndReplaceExecutor`, > which make `TaskSchedulerImpl.executorLost` was executed twice. > `copiesRunning(index) -= 1` were processed in `executorLost`, twice > `executorLost` made `copiesRunning(index)` to -1, which lead stage to hang. > related log when the issue produce: > 21/08/05 02:58:14,784 INFO [dispatcher-event-loop-8] TaskSetManager: > Starting task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor > 366724, partition 4004, ANY, 7994 bytes) > 21/08/05 03:00:24,126 ERROR [dispatcher-event-loop-4] TaskSchedulerImpl: > Lost executor 366724 on 10.109.89.3: Executor heartbeat timed out after > 140830 ms > 21/08/05 03:00:24,218 WARN [dispatcher-event-loop-4] TaskSetManager: Lost > task 4004.0 in stage 1328625.0 (TID 347212402, 10.109.89.3, executor 366724): > ExecutorLostFailure (executor 366724 exited caused by one of the running > tasks) Reason: Executor heartbeat timed out after 140830 ms > 21/08/05 03:00:24,542 INFO [task-result-getter-2] TaskSetManager: Finished > task 4004.0 in stage 1328625.0 (TID 347212402) in 129758 ms on 10.109.89.3 > (executor 366724) (3047/5400) > 21/08/05 03:00:34,621 INFO [dispatcher-event-loop-8] TaskSchedulerImpl: > Executor 366724 on 10.109.89.3 killed by driver. > 21/08/05 03:00:34,771 INFO [spark-listener-group-executorManagement] > ExecutorMonitor: Executor 366724 removed (new total is 793) > 21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Executor > lost: 366724 (epoch 417416) > 21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] > BlockManagerMasterEndpoint: Trying to remove executor 366724 from > BlockManagerMaster. > 21/08/05 03:00:42,360 INFO [dispatcher-event-loop-14] > BlockManagerMasterEndpoint: Removing block manager BlockManagerId(366724, > 10.109.89.3, 43402, None) > 21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] BlockManagerMaster: > Removed 366724 successfully in removeExecutor > 21/08/05 03:00:42,360 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle > files lost for executor: 366724 (epoch 417416) > 21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Executor > lost: 366724 (epoch 417473) > 21/08/05 03:00:44,584 INFO [dispatcher-event-loop-15] > BlockManagerMasterEndpoint: Trying to remove executor 366724 from > BlockManagerMaster. > 21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] BlockManagerMaster: > Removed 366724 successfully in removeExecutor > 21/08/05 03:00:44,584 INFO [dag-scheduler-event-loop] DAGScheduler: Shuffle > files lost for executor: 366724 (epoch 417473) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37271) Spark OOM issue
M Shadab created SPARK-37271: Summary: Spark OOM issue Key: SPARK-37271 URL: https://issues.apache.org/jira/browse/SPARK-37271 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 3.1.0 Reporter: M Shadab -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37271) Spark OOM issue
[ https://issues.apache.org/jira/browse/SPARK-37271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M Shadab updated SPARK-37271: - Shepherd: M Shadab > Spark OOM issue > --- > > Key: SPARK-37271 > URL: https://issues.apache.org/jira/browse/SPARK-37271 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 3.1.0 >Reporter: M Shadab >Priority: Critical > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37271) Spark OOM issue
[ https://issues.apache.org/jira/browse/SPARK-37271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441805#comment-17441805 ] M Shadab commented on SPARK-37271: -- Memory increased for the container > Spark OOM issue > --- > > Key: SPARK-37271 > URL: https://issues.apache.org/jira/browse/SPARK-37271 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 3.1.0 >Reporter: M Shadab >Priority: Critical > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37271) Spark OOM issue
[ https://issues.apache.org/jira/browse/SPARK-37271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] M Shadab resolved SPARK-37271. -- Resolution: Fixed done > Spark OOM issue > --- > > Key: SPARK-37271 > URL: https://issues.apache.org/jira/browse/SPARK-37271 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 3.1.0 >Reporter: M Shadab >Priority: Critical > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37265) Support Java 17 in `dev/test-dependencies.sh`
[ https://issues.apache.org/jira/browse/SPARK-37265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37265. --- Resolution: Invalid Let me close this Invalid. > Support Java 17 in `dev/test-dependencies.sh` > - > > Key: SPARK-37265 > URL: https://issues.apache.org/jira/browse/SPARK-37265 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35557) Adapt uses of JDK 17 Internal APIs
[ https://issues.apache.org/jira/browse/SPARK-35557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-35557. --- Resolution: Duplicate This is superseded by SPARK-36796 via adding `--add-open` options. > Adapt uses of JDK 17 Internal APIs > -- > > Key: SPARK-35557 > URL: https://issues.apache.org/jira/browse/SPARK-35557 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Ismaël Mejía >Priority: Major > > I tried to run a Spark pipeline using the most recent 3.2.0-SNAPSHOT with > Spark 2.12.4 on Java 17 and I found this exception: > {code:java} > java.lang.ExceptionInInitializerError > at org.apache.spark.unsafe.array.ByteArrayMethods. > (ByteArrayMethods.java:54) > at org.apache.spark.internal.config.package$. (package.scala:1149) > at org.apache.spark.SparkConf$. (SparkConf.scala:654) > at org.apache.spark.SparkConf.contains (SparkConf.scala:455) > ... > Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make > private java.nio.DirectByteBuffer(long,int) accessible: module java.base does > not "opens java.nio" to unnamed module @110df513 > at java.lang.reflect.AccessibleObject.checkCanSetAccessible > (AccessibleObject.java:357) > at java.lang.reflect.AccessibleObject.checkCanSetAccessible > (AccessibleObject.java:297) > at java.lang.reflect.Constructor.checkCanSetAccessible (Constructor.java:188) > at java.lang.reflect.Constructor.setAccessible (Constructor.java:181) > at org.apache.spark.unsafe.Platform. (Platform.java:56) > at org.apache.spark.unsafe.array.ByteArrayMethods. > (ByteArrayMethods.java:54) > at org.apache.spark.internal.config.package$. (package.scala:1149) > at org.apache.spark.SparkConf$. (SparkConf.scala:654) > at org.apache.spark.SparkConf.contains (SparkConf.scala:455)}} > {code} > It seems that Java 17 will be more strict about uses of JDK Internals > [https://openjdk.java.net/jeps/403] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33502) Large number of SELECT columns causes StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-33502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236434#comment-17236434 ] Arwin S Tio edited comment on SPARK-33502 at 11/10/21, 7:22 PM: Note, running my program with "-Xss3072k" fixed it. Giving Spark a bigger stack lets you hold more columns in memory. was (Author: cozos): Note, running my program with "-Xss3072k" fixed it > Large number of SELECT columns causes StackOverflowError > > > Key: SPARK-33502 > URL: https://issues.apache.org/jira/browse/SPARK-33502 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7 >Reporter: Arwin S Tio >Priority: Minor > > On Spark 2.4.7 Standalone Mode on my laptop (Macbook Pro 2015), I ran the > following: > {code:java} > public class TestSparkStackOverflow { > public static void main(String [] args) { > SparkSession spark = SparkSession > .builder() > .config("spark.master", "local[8]") > .appName(TestSparkStackOverflow.class.getSimpleName()) > .getOrCreate(); > StructType inputSchema = new StructType(); > inputSchema = inputSchema.add("foo", DataTypes.StringType); > > Dataset inputDf = spark.createDataFrame( > Arrays.asList( > RowFactory.create("1"), > RowFactory.create("2"), > RowFactory.create("3") > ), > inputSchema > ); > > List lotsOfColumns = new ArrayList<>(); > for (int i = 0; i < 3000; i++) { > lotsOfColumns.add(lit("").as("field" + i).cast(DataTypes.StringType)); > } > lotsOfColumns.add(new Column("foo")); > inputDf > > .select(JavaConverters.collectionAsScalaIterableConverter(lotsOfColumns).asScala().toSeq()) > .write() > .format("csv") > .mode(SaveMode.Append) > .save("file:///tmp/testoutput"); > } > } > {code} > > And I get a StackOverflowError: > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Job > aborted.Exception in thread "main" org.apache.spark.SparkException: Job > aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291) at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249) at > udp.task.TestSparkStackOverflow.main(TestSparkStackOverflow.java:52)Caused > by: java.lang.StackOverflowError at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1522) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.j
[jira] [Created] (SPARK-37272) Add ExtendedRocksDBTest
Dongjoon Hyun created SPARK-37272: - Summary: Add ExtendedRocksDBTest Key: SPARK-37272 URL: https://issues.apache.org/jira/browse/SPARK-37272 Project: Spark Issue Type: Improvement Components: SQL, Tests Affects Versions: 3.3.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37273) Hidden File Metadata Support for Spark SQL
Yaohua Zhao created SPARK-37273: --- Summary: Hidden File Metadata Support for Spark SQL Key: SPARK-37273 URL: https://issues.apache.org/jira/browse/SPARK-37273 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yaohua Zhao Provide a new interface in Spark SQL that allows users to query the metadata of the input files for all file formats, expose them as *built-in hidden columns* meaning *users can only see them when they explicitly reference them* (e.g. file path, file name) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37272) Add ExtendedRocksDBTest
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442020#comment-17442020 ] Apache Spark commented on SPARK-37272: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/34547 > Add ExtendedRocksDBTest > --- > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37272) Add ExtendedRocksDBTest
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37272: Assignee: (was: Apache Spark) > Add ExtendedRocksDBTest > --- > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37272) Add ExtendedRocksDBTest
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37272: Assignee: Apache Spark > Add ExtendedRocksDBTest > --- > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37260) PYSPARK Arrow 3.2.0 docs link invalid
[ https://issues.apache.org/jira/browse/SPARK-37260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442044#comment-17442044 ] Hyukjin Kwon commented on SPARK-37260: -- oh yeah. that's fixed via #34475. There are some more ongoing issues on the docs. I will fix them up and probably we could initiate spark 3.2.1. > PYSPARK Arrow 3.2.0 docs link invalid > - > > Key: SPARK-37260 > URL: https://issues.apache.org/jira/browse/SPARK-37260 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Thomas Graves >Priority: Major > > [http://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html] > links to: > [https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html] > which links to: > [https://spark.apache.org/docs/latest/api/python/sql/arrow_pandas.rst] > But that is an invalid link. > I assume its supposed to point to: > https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37260) PYSPARK Arrow 3.2.0 docs link invalid
[ https://issues.apache.org/jira/browse/SPARK-37260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37260. -- Resolution: Fixed > PYSPARK Arrow 3.2.0 docs link invalid > - > > Key: SPARK-37260 > URL: https://issues.apache.org/jira/browse/SPARK-37260 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Thomas Graves >Priority: Major > > [http://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html] > links to: > [https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html] > which links to: > [https://spark.apache.org/docs/latest/api/python/sql/arrow_pandas.rst] > But that is an invalid link. > I assume its supposed to point to: > https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37260) PYSPARK Arrow 3.2.0 docs link invalid
[ https://issues.apache.org/jira/browse/SPARK-37260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37260: - Fix Version/s: 3.2.1 > PYSPARK Arrow 3.2.0 docs link invalid > - > > Key: SPARK-37260 > URL: https://issues.apache.org/jira/browse/SPARK-37260 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Thomas Graves >Priority: Major > Fix For: 3.2.1 > > > [http://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html] > links to: > [https://spark.apache.org/docs/latest/api/python/user_guide/arrow_pandas.html] > which links to: > [https://spark.apache.org/docs/latest/api/python/sql/arrow_pandas.rst] > But that is an invalid link. > I assume its supposed to point to: > https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37233) Inline type hints for files in python/pyspark/mllib
[ https://issues.apache.org/jira/browse/SPARK-37233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37233: Assignee: dch nguyen > Inline type hints for files in python/pyspark/mllib > --- > > Key: SPARK-37233 > URL: https://issues.apache.org/jira/browse/SPARK-37233 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37254) 100% CPU usage on Spark Thrift Server.
[ https://issues.apache.org/jira/browse/SPARK-37254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442046#comment-17442046 ] Hyukjin Kwon commented on SPARK-37254: -- it would be much easier to investigate the issue if there're reproducible steps. > 100% CPU usage on Spark Thrift Server. > -- > > Key: SPARK-37254 > URL: https://issues.apache.org/jira/browse/SPARK-37254 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: ramakrishna chilaka >Priority: Major > > We are trying to use Spark thrift server as a distributed sql query engine, > the queries work when the resident memory occupied by Spark thrift server > identified through HTOP is comparatively less than the driver memory. The > same queries result in 100% cpu usage when the resident memory occupied by > spark thrift server is greater than the configured driver memory and keeps > running at 100% cpu usage. I am using incremental collect as false, as i need > faster responses for exploratory queries. I am trying to understand the > following points > * Why isn't spark thrift server releasing back the memory, when there are no > queries. > * What is causing spark thrift server to go into 100% cpu usage on all the > cores, when spark thrift server's memory is greater than the driver memory > (by 10% usually) and why are queries just stuck. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37270) Incorect result of filter using isNull condition
[ https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37270: - Labels: correctness (was: ) > Incorect result of filter using isNull condition > > > Key: SPARK-37270 > URL: https://issues.apache.org/jira/browse/SPARK-37270 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Tomasz Kus >Priority: Major > Labels: correctness > > Simple code that allows to reproduce this issue: > {code:java} > val frame = Seq((false, 1)).toDF("bool", "number") > frame > .checkpoint() > .withColumn("conditions", when(col("bool"), "I am not null")) > .filter(col("conditions").isNull) > .show(false){code} > Although "conditions" column is null > {code:java} > +-+--+--+ > |bool |number|conditions| > +-+--+--+ > |false|1 |null | > +-+--+--+{code} > empty result is shown. > Execution plans: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#252) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Optimized Logical Plan == > LocalRelation , [bool#124, number#125, conditions#252] > == Physical Plan == > LocalTableScan , [bool#124, number#125, conditions#252] > {code} > After removing checkpoint proper result is returned and execution plans are > as follow: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#256) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Optimized Logical Plan == > LocalRelation [bool#124, number#125, conditions#256] > == Physical Plan == > LocalTableScan [bool#124, number#125, conditions#256] > {code} > It seems that the most important difference is LogicalRDD -> LocalRelation > There are following ways (workarounds) to retrieve correct result: > 1) remove checkpoint > 2) add explicit .otherwise(null) to when > 3) add checkpoint() or cache() just before filter > 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37272) Add ExtendedRocksDBTest
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-37272: - Assignee: Dongjoon Hyun > Add ExtendedRocksDBTest > --- > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37272) Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37272: -- Summary: Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon (was: Add ExtendedRocksDBTest) > Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon > > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37272) Add ExtendedRocksDBTest
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-37272. --- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34547 [https://github.com/apache/spark/pull/34547] > Add ExtendedRocksDBTest > --- > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37272) Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37272: -- Description: Javava 17 officially support Apple Silicon - JEP 391: macOS/AArch64 Port - https://bugs.openjdk.java.net/browse/JDK-8251280 Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon natively. {code} /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable arm64 /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable arm64 {code} Since RocksDBJNI still doesn't support Apple Silicon natively, the following failures occur on M1. {code} $ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite" ... [info] Run completed in 23 seconds, 281 milliseconds. [info] Total number of tests run: 32 [info] Suites: completed 2, aborted 2 [info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0 [info] *** 2 SUITES ABORTED *** [info] *** 10 TESTS FAILED *** [error] Failed tests: [error] org.apache.spark.sql.streaming.StreamingSessionWindowSuite [error] org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite [error] Error during tests: [error] org.apache.spark.sql.execution.streaming.state.RocksDBSuite [error] org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreSuite [error] (sql / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 43 s, completed Nov 10, 2021 4:29:50 PM {code} This issue aims to add ExtendedRocksDBTest to disable RocksDB selectively on Apple Silicon. > Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon > > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > > Javava 17 officially support Apple Silicon > - JEP 391: macOS/AArch64 Port > - https://bugs.openjdk.java.net/browse/JDK-8251280 > Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon > natively. > {code} > /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable > arm64 > /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 > /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable > arm64 > {code} > Since RocksDBJNI still doesn't support Apple Silicon natively, the following > failures occur on M1. > {code} > $ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite" > ... > [info] Run completed in 23 seconds, 281 milliseconds. > [info] Total number of tests run: 32 > [info] Suites: completed 2, aborted 2 > [info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0 > [info] *** 2 SUITES ABORTED *** > [info] *** 10 TESTS FAILED *** > [error] Failed tests: > [error] org.apache.spark.sql.streaming.StreamingSessionWindowSuite > [error] > org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite > [error] Error during tests: > [error] org.apache.spark.sql.execution.streaming.state.RocksDBSuite > [error] > org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreSuite > [error] (sql / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful > [error] Total time: 43 s, completed Nov 10, 2021 4:29:50 PM > {code} > This issue aims to add ExtendedRocksDBTest to disable RocksDB selectively on > Apple Silicon. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37272) Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37272: -- Parent: SPARK-33772 Issue Type: Sub-task (was: Improvement) > Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon > > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > > Javava 17 officially support Apple Silicon > - JEP 391: macOS/AArch64 Port > - https://bugs.openjdk.java.net/browse/JDK-8251280 > Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon > natively. > {code} > /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable > arm64 > /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 > /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable > arm64 > {code} > Since RocksDBJNI still doesn't support Apple Silicon natively, the following > failures occur on M1. > {code} > $ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite" > ... > [info] Run completed in 23 seconds, 281 milliseconds. > [info] Total number of tests run: 32 > [info] Suites: completed 2, aborted 2 > [info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0 > [info] *** 2 SUITES ABORTED *** > [info] *** 10 TESTS FAILED *** > [error] Failed tests: > [error] org.apache.spark.sql.streaming.StreamingSessionWindowSuite > [error] > org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite > [error] Error during tests: > [error] org.apache.spark.sql.execution.streaming.state.RocksDBSuite > [error] > org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreSuite > [error] (sql / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful > [error] Total time: 43 s, completed Nov 10, 2021 4:29:50 PM > {code} > This issue aims to add ExtendedRocksDBTest to disable RocksDB selectively on > Apple Silicon. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37272) Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon
[ https://issues.apache.org/jira/browse/SPARK-37272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37272: -- Description: Java 17 officially support Apple Silicon - JEP 391: macOS/AArch64 Port - https://bugs.openjdk.java.net/browse/JDK-8251280 Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon natively. {code} /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable arm64 /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable arm64 {code} Since RocksDBJNI still doesn't support Apple Silicon natively, the following failures occur on M1. {code} $ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite" ... [info] Run completed in 23 seconds, 281 milliseconds. [info] Total number of tests run: 32 [info] Suites: completed 2, aborted 2 [info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0 [info] *** 2 SUITES ABORTED *** [info] *** 10 TESTS FAILED *** [error] Failed tests: [error] org.apache.spark.sql.streaming.StreamingSessionWindowSuite [error] org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite [error] Error during tests: [error] org.apache.spark.sql.execution.streaming.state.RocksDBSuite [error] org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreSuite [error] (sql / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 43 s, completed Nov 10, 2021 4:29:50 PM {code} This issue aims to add ExtendedRocksDBTest to disable RocksDB selectively on Apple Silicon. was: Javava 17 officially support Apple Silicon - JEP 391: macOS/AArch64 Port - https://bugs.openjdk.java.net/browse/JDK-8251280 Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon natively. {code} /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable arm64 /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable arm64 {code} Since RocksDBJNI still doesn't support Apple Silicon natively, the following failures occur on M1. {code} $ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite" ... [info] Run completed in 23 seconds, 281 milliseconds. [info] Total number of tests run: 32 [info] Suites: completed 2, aborted 2 [info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0 [info] *** 2 SUITES ABORTED *** [info] *** 10 TESTS FAILED *** [error] Failed tests: [error] org.apache.spark.sql.streaming.StreamingSessionWindowSuite [error] org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite [error] Error during tests: [error] org.apache.spark.sql.execution.streaming.state.RocksDBSuite [error] org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreSuite [error] (sql / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 43 s, completed Nov 10, 2021 4:29:50 PM {code} This issue aims to add ExtendedRocksDBTest to disable RocksDB selectively on Apple Silicon. > Add `ExtendedRocksDBTest` and disable RocksDB tests on Apple Silicon > > > Key: SPARK-37272 > URL: https://issues.apache.org/jira/browse/SPARK-37272 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > > Java 17 officially support Apple Silicon > - JEP 391: macOS/AArch64 Port > - https://bugs.openjdk.java.net/browse/JDK-8251280 > Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon > natively. > {code} > /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable > arm64 > /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 > /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable > arm64 > {code} > Since RocksDBJNI still doesn't support Apple Silicon natively, the following > failures occur on M1. > {code} > $ build/sbt "sql/testOnly *RocksDB* *.StreamingSessionWindowSuite" > ... > [info] Run completed in 23 seconds, 281 milliseconds. > [info] Total number of tests run: 32 > [info] Suites: completed 2, aborted 2 > [info] Tests: succeeded 22, failed 10, canceled 0, ignored 0, pending 0 > [info] *** 2 SUITES ABORTED *** > [info] *** 10 TESTS FAILED *** > [error] Failed tests: > [error] org.apache.spark.sql.streaming.StreamingSessionWindowSuite > [error] > org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreIntegrationSuite > [error] Error during tests: > [err
[jira] [Closed] (SPARK-37109) Install Java 17 on all of the Jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-37109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-37109. - > Install Java 17 on all of the Jenkins workers > - > > Key: SPARK-37109 > URL: https://issues.apache.org/jira/browse/SPARK-37109 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17
[ https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-36900: - Assignee: Yang Jie > "SPARK-36464: size returns correct positive number even with over 2GB data" > will oom with JDK17 > > > Key: SPARK-36900 > URL: https://issues.apache.org/jira/browse/SPARK-36900 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > > Execute > > {code:java} > build/mvn clean install -pl core -am -Dtest=none > -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite > {code} > with JDK 17, > {code:java} > ChunkedByteBufferOutputStreamSuite: > - empty output > - write a single byte > - write a single near boundary > - write a single at boundary > - single chunk output > - single chunk output at boundary size > - multiple chunk output > - multiple chunk output at boundary size > *** RUN ABORTED *** > java.lang.OutOfMemoryError: Java heap space > at java.base/java.lang.Integer.valueOf(Integer.java:1081) > at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75) > at java.base/java.io.OutputStream.write(OutputStream.java:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127) > at > org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown > Source) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37109) Install Java 17 on all of the Jenkins workers
[ https://issues.apache.org/jira/browse/SPARK-37109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37109: -- Parent: (was: SPARK-33772) Issue Type: Bug (was: Sub-task) > Install Java 17 on all of the Jenkins workers > - > > Key: SPARK-37109 > URL: https://issues.apache.org/jira/browse/SPARK-37109 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37264) Exclude hadoop-client-api transitive dependency from orc-core
[ https://issues.apache.org/jira/browse/SPARK-37264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-37264: -- Summary: Exclude hadoop-client-api transitive dependency from orc-core (was: [SPARK-37264][BUILD] Exclude hadoop-client-api transitive dependency from orc-core) > Exclude hadoop-client-api transitive dependency from orc-core > - > > Key: SPARK-37264 > URL: https://issues.apache.org/jira/browse/SPARK-37264 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.3.0 > > > Like hadoop-common and hadoop-hdfs, this PR proposes to exclude > hadoop-client-api transitive dependency from orc-core. > Why are the changes needed? > Since Apache Hadoop 2.7 doesn't work on Java 17, Apache ORC has a dependency > on Hadoop 3.3.1. > This causes test-dependencies.sh failure on Java 17. As a result, > run-tests.py also fails. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37273) Hidden File Metadata Support for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-37273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37273. -- Resolution: Duplicate > Hidden File Metadata Support for Spark SQL > -- > > Key: SPARK-37273 > URL: https://issues.apache.org/jira/browse/SPARK-37273 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yaohua Zhao >Priority: Major > > Provide a new interface in Spark SQL that allows users to query the metadata > of the input files for all file formats, expose them as *built-in hidden > columns* meaning *users can only see them when they explicitly reference > them* (e.g. file path, file name) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37273) Hidden File Metadata Support for Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-37273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442057#comment-17442057 ] Hyukjin Kwon commented on SPARK-37273: -- Don't we already have this in DSv2? e.g.) SPARK-31255 > Hidden File Metadata Support for Spark SQL > -- > > Key: SPARK-37273 > URL: https://issues.apache.org/jira/browse/SPARK-37273 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yaohua Zhao >Priority: Major > > Provide a new interface in Spark SQL that allows users to query the metadata > of the input files for all file formats, expose them as *built-in hidden > columns* meaning *users can only see them when they explicitly reference > them* (e.g. file path, file name) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37255) When Used with PyHive (by dropbox) query timeout doesn't result in propagation to the UI
[ https://issues.apache.org/jira/browse/SPARK-37255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37255. -- Resolution: Invalid > When Used with PyHive (by dropbox) query timeout doesn't result in > propagation to the UI > > > Key: SPARK-37255 > URL: https://issues.apache.org/jira/browse/SPARK-37255 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: ramakrishna chilaka >Priority: Major > > When we run a large query and it is timed out by spark thrift server and when > it is cancelled, PyHive doesn't show that query is cancelled. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37255) When Used with PyHive (by dropbox) query timeout doesn't result in propagation to the UI
[ https://issues.apache.org/jira/browse/SPARK-37255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442058#comment-17442058 ] Hyukjin Kwon commented on SPARK-37255: -- That's very likely an issue in PyHive. > When Used with PyHive (by dropbox) query timeout doesn't result in > propagation to the UI > > > Key: SPARK-37255 > URL: https://issues.apache.org/jira/browse/SPARK-37255 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: ramakrishna chilaka >Priority: Major > > When we run a large query and it is timed out by spark thrift server and when > it is cancelled, PyHive doesn't show that query is cancelled. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37274) These parameters should be of type long, not int
hao created SPARK-37274: --- Summary: These parameters should be of type long, not int Key: SPARK-37274 URL: https://issues.apache.org/jira/browse/SPARK-37274 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: hao These parameters [spark.sql.orc.columnarReaderBatchSize], [spark.sql.inMemoryColumnarStorage.batchSize], [spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of type int. when the user sets the value to be greater than the maximum value of type int, an error will be thrown -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37275) Support ANSI intervals in PySpark
Hyukjin Kwon created SPARK-37275: Summary: Support ANSI intervals in PySpark Key: SPARK-37275 URL: https://issues.apache.org/jira/browse/SPARK-37275 Project: Spark Issue Type: Umbrella Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon This JIRA targets to implement ANSI interval types in PySpark: - https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DayTimeIntervalType.scala - https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/YearMonthIntervalType.scala -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37276) Support YearMonthIntervalType in Arrow
Hyukjin Kwon created SPARK-37276: Summary: Support YearMonthIntervalType in Arrow Key: SPARK-37276 URL: https://issues.apache.org/jira/browse/SPARK-37276 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon Implements the support of YearMonthIntervalType in Arrow code path: - pandas UDFs - pandas functions APIs - createDataFrame/toPandas when Arrow is enabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37277) Support DayTimeIntervalType in Arrow
Hyukjin Kwon created SPARK-37277: Summary: Support DayTimeIntervalType in Arrow Key: SPARK-37277 URL: https://issues.apache.org/jira/browse/SPARK-37277 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon Implements the support of DayTimeIntervalType in Arrow code path: - pandas UDFs - pandas functions APIs - createDataFrame/toPandas when Arrow is enabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37278) Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs
Hyukjin Kwon created SPARK-37278: Summary: Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs Key: SPARK-37278 URL: https://issues.apache.org/jira/browse/SPARK-37278 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon Implements the support of YearMonthIntervalType in Arrow code path: - Python UDFs - createDataFrame/toPandas when Arrow is disabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37278) Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-37278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37278: - Description: Implements the support of YearMonthIntervalType in: - Python UDFs - createDataFrame/toPandas when Arrow is disabled was: Implements the support of YearMonthIntervalType in Arrow code path: - Python UDFs - createDataFrame/toPandas when Arrow is disabled > Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs > - > > Key: SPARK-37278 > URL: https://issues.apache.org/jira/browse/SPARK-37278 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of YearMonthIntervalType in: > - Python UDFs > - createDataFrame/toPandas when Arrow is disabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37279) Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs
Hyukjin Kwon created SPARK-37279: Summary: Support DayTimeIntervalType in createDataFrame/toPandas and Python UDFs Key: SPARK-37279 URL: https://issues.apache.org/jira/browse/SPARK-37279 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon Implements the support of DayTimeIntervalType in: - Python UDFs - createDataFrame/toPandas when Arrow is disabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37281) Support DayTimeIntervalType in Py4J
Hyukjin Kwon created SPARK-37281: Summary: Support DayTimeIntervalType in Py4J Key: SPARK-37281 URL: https://issues.apache.org/jira/browse/SPARK-37281 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37280) Support YearMonthIntervalType in Py4J
Hyukjin Kwon created SPARK-37280: Summary: Support YearMonthIntervalType in Py4J Key: SPARK-37280 URL: https://issues.apache.org/jira/browse/SPARK-37280 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 3.3.0 Reporter: Hyukjin Kwon This PR adds the support of YearMonthIntervalType in Py4J. For example, functions.lit(YearMonthIntervalType) should work. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37281) Support DayTimeIntervalType in Py4J
[ https://issues.apache.org/jira/browse/SPARK-37281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37281: - Description: This PR adds the support of YearMonthIntervalType in Py4J. For example, functions.lit(DayTimeIntervalType) should work. > Support DayTimeIntervalType in Py4J > --- > > Key: SPARK-37281 > URL: https://issues.apache.org/jira/browse/SPARK-37281 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > This PR adds the support of YearMonthIntervalType in Py4J. For example, > functions.lit(DayTimeIntervalType) should work. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37275) Support ANSI intervals in PySpark
[ https://issues.apache.org/jira/browse/SPARK-37275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442066#comment-17442066 ] Hyukjin Kwon commented on SPARK-37275: -- cc [~maxgekk] FYI > Support ANSI intervals in PySpark > - > > Key: SPARK-37275 > URL: https://issues.apache.org/jira/browse/SPARK-37275 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > This JIRA targets to implement ANSI interval types in PySpark: > - > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DayTimeIntervalType.scala > - > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types/YearMonthIntervalType.scala -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37278) Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-37278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442068#comment-17442068 ] Hyukjin Kwon commented on SPARK-37278: -- I am working on this. > Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs > - > > Key: SPARK-37278 > URL: https://issues.apache.org/jira/browse/SPARK-37278 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of YearMonthIntervalType in: > - Python UDFs > - createDataFrame/toPandas when Arrow is disabled -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37270) Incorect result of filter using isNull condition
[ https://issues.apache.org/jira/browse/SPARK-37270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442074#comment-17442074 ] Hyukjin Kwon commented on SPARK-37270: -- Hm, I can't reproduce this locally. Are you able to reproduce this with running locally too? e.g.) {code} spark.sparkContext.setCheckpointDir("/tmp/checkpoints") val frame = Seq((false, 1)).toDF("bool", "number") frame .checkpoint() .withColumn("conditions", when(col("bool"), "I am not null")) .filter(col("conditions").isNull) .show(false) {code} > Incorect result of filter using isNull condition > > > Key: SPARK-37270 > URL: https://issues.apache.org/jira/browse/SPARK-37270 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.2.0 >Reporter: Tomasz Kus >Priority: Major > Labels: correctness > > Simple code that allows to reproduce this issue: > {code:java} > val frame = Seq((false, 1)).toDF("bool", "number") > frame > .checkpoint() > .withColumn("conditions", when(col("bool"), "I am not null")) > .filter(col("conditions").isNull) > .show(false){code} > Although "conditions" column is null > {code:java} > +-+--+--+ > |bool |number|conditions| > +-+--+--+ > |false|1 |null | > +-+--+--+{code} > empty result is shown. > Execution plans: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#252) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#252] > +- LogicalRDD [bool#124, number#125], false > == Optimized Logical Plan == > LocalRelation , [bool#124, number#125, conditions#252] > == Physical Plan == > LocalTableScan , [bool#124, number#125, conditions#252] > {code} > After removing checkpoint proper result is returned and execution plans are > as follow: > {code:java} > == Parsed Logical Plan == > 'Filter isnull('conditions) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Analyzed Logical Plan == > bool: boolean, number: int, conditions: string > Filter isnull(conditions#256) > +- Project [bool#124, number#125, CASE WHEN bool#124 THEN I am not null END > AS conditions#256] > +- Project [_1#119 AS bool#124, _2#120 AS number#125] > +- LocalRelation [_1#119, _2#120] > == Optimized Logical Plan == > LocalRelation [bool#124, number#125, conditions#256] > == Physical Plan == > LocalTableScan [bool#124, number#125, conditions#256] > {code} > It seems that the most important difference is LogicalRDD -> LocalRelation > There are following ways (workarounds) to retrieve correct result: > 1) remove checkpoint > 2) add explicit .otherwise(null) to when > 3) add checkpoint() or cache() just before filter > 4) downgrade to Spark 3.1.2 -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36799) Pass queryExecution name in CLI when only select query
[ https://issues.apache.org/jira/browse/SPARK-36799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36799. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34041 [https://github.com/apache/spark/pull/34041] > Pass queryExecution name in CLI when only select query > -- > > Key: SPARK-36799 > URL: https://issues.apache.org/jira/browse/SPARK-36799 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.3.0 > > > Now when in spark-sql CLI, QueryExecutionListener can receive command, but > not select query, because queryExecution Name is not passed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36799) Pass queryExecution name in CLI when only select query
[ https://issues.apache.org/jira/browse/SPARK-36799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36799: --- Assignee: dzcxzl > Pass queryExecution name in CLI when only select query > -- > > Key: SPARK-36799 > URL: https://issues.apache.org/jira/browse/SPARK-36799 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > > Now when in spark-sql CLI, QueryExecutionListener can receive command, but > not select query, because queryExecution Name is not passed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36799) Pass queryExecution name in CLI
[ https://issues.apache.org/jira/browse/SPARK-36799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dzcxzl updated SPARK-36799: --- Summary: Pass queryExecution name in CLI (was: Pass queryExecution name in CLI when only select query) > Pass queryExecution name in CLI > --- > > Key: SPARK-36799 > URL: https://issues.apache.org/jira/browse/SPARK-36799 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: dzcxzl >Assignee: dzcxzl >Priority: Trivial > Fix For: 3.3.0 > > > Now when in spark-sql CLI, QueryExecutionListener can receive command, but > not select query, because queryExecution Name is not passed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36182) Support TimestampNTZ type in Parquet file source
[ https://issues.apache.org/jira/browse/SPARK-36182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36182. - Resolution: Fixed Issue resolved by pull request 34495 [https://github.com/apache/spark/pull/34495] > Support TimestampNTZ type in Parquet file source > > > Key: SPARK-36182 > URL: https://issues.apache.org/jira/browse/SPARK-36182 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > > As per > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp, > Parquet supports both TIMESTAMP_NTZ and TIMESTAMP_LTZ (Spark's current > default timestamp type): > * A TIMESTAMP with isAdjustedToUTC=true => TIMESTAMP_LTZ > * A TIMESTAMP with isAdjustedToUTC=false => TIMESTAMP_NTZ > In Spark 3.1 or prior, the Parquet writer follows the definition and sets > the field `isAdjustedToUTC` as `true`, while the Parquet reader doesn’t > respect the `isAdjustedToUTC` flag and convert any Parquet Timestamp type as > TIMESTAMP_LTZ. > Since 3.2, with the support of timestamp without time zone type: > * Parquet writer follows the definition and sets the field `isAdjustedToUTC` > as `false` on writing TIMESTAMP_NTZ. > * Parquet reader > ** For schema inference, Spark converts the Parquet timestamp type to the > corresponding catalyst timestamp type according to the timestamp annotation > flag `isAdjustedToUTC`. > ** If merge schema is enabled in schema inference and some of the files are > inferred as TIMESTAMP_NTZ while the others are TIMESTAMP_LTZ, the result type > is TIMESTAMP_LTZ which is considered as the “wider” type > ** If a column of a user-provided schema is TIMESTAMP_LTZ and the column was > written as TIMESTAMP_NTZ type, Spark allows the read operation. > ** If a column of a user-provided schema is TIMESTAMP_NTZ and the column was > written as TIMESTAMP_LTZ type, the read operation is not allowed since the > TIMESTAMP_NTZ is considered as narrower than TIMESTAMP_LTZ. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36073) EquivalentExpressions fixes and improvements
[ https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36073: --- Assignee: Peter Toth > EquivalentExpressions fixes and improvements > > > Key: SPARK-36073 > URL: https://issues.apache.org/jira/browse/SPARK-36073 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > > Currently `EquivalentExpressions` has 2 issues: > - identifying common expressions in conditional expressions is not correct in > all cases > - transparently canonicalized expressions (like `PromotePrecision`) are > considered common subexpressions -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36073) EquivalentExpressions fixes and improvements
[ https://issues.apache.org/jira/browse/SPARK-36073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36073. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 33281 [https://github.com/apache/spark/pull/33281] > EquivalentExpressions fixes and improvements > > > Key: SPARK-36073 > URL: https://issues.apache.org/jira/browse/SPARK-36073 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Peter Toth >Assignee: Peter Toth >Priority: Major > Fix For: 3.3.0 > > > Currently `EquivalentExpressions` has 2 issues: > - identifying common expressions in conditional expressions is not correct in > all cases > - transparently canonicalized expressions (like `PromotePrecision`) are > considered common subexpressions -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37282) Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon
Dongjoon Hyun created SPARK-37282: - Summary: Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon Key: SPARK-37282 URL: https://issues.apache.org/jira/browse/SPARK-37282 Project: Spark Issue Type: Sub-task Components: Spark Core, Tests Affects Versions: 3.3.0 Reporter: Dongjoon Hyun Java 17 officially support Apple Silicon. - JEP 391: macOS/AArch64 Port - https://bugs.openjdk.java.net/browse/JDK-8251280 Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon natively. {code} /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable arm64 /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable arm64 {code} Since LevelDBJNI still doesn't support Apple Silicon natively, the test cases fail on M1. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37282) Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon
[ https://issues.apache.org/jira/browse/SPARK-37282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442084#comment-17442084 ] Apache Spark commented on SPARK-37282: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/34548 > Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon > -- > > Key: SPARK-37282 > URL: https://issues.apache.org/jira/browse/SPARK-37282 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > Java 17 officially support Apple Silicon. > - JEP 391: macOS/AArch64 Port > - https://bugs.openjdk.java.net/browse/JDK-8251280 > Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon > natively. > {code} > /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable > arm64 > /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 > /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable > arm64 > {code} > Since LevelDBJNI still doesn't support Apple Silicon natively, the test cases > fail on M1. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37282) Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon
[ https://issues.apache.org/jira/browse/SPARK-37282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37282: Assignee: Apache Spark > Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon > -- > > Key: SPARK-37282 > URL: https://issues.apache.org/jira/browse/SPARK-37282 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > Java 17 officially support Apple Silicon. > - JEP 391: macOS/AArch64 Port > - https://bugs.openjdk.java.net/browse/JDK-8251280 > Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon > natively. > {code} > /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable > arm64 > /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 > /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable > arm64 > {code} > Since LevelDBJNI still doesn't support Apple Silicon natively, the test cases > fail on M1. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37282) Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon
[ https://issues.apache.org/jira/browse/SPARK-37282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37282: Assignee: (was: Apache Spark) > Add ExtendedLevelDBTest and disable LevelDB tests on Apple Silicon > -- > > Key: SPARK-37282 > URL: https://issues.apache.org/jira/browse/SPARK-37282 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Tests >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > Java 17 officially support Apple Silicon. > - JEP 391: macOS/AArch64 Port > - https://bugs.openjdk.java.net/browse/JDK-8251280 > Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 supports Apple Silicon > natively. > {code} > /Users/dongjoon/.jenv/versions/oracle17/bin/java: Mach-O 64-bit executable > arm64 > /Users/dongjoon/.jenv/versions/zulu17/bin/java: Mach-O 64-bit executable arm64 > /Users/dongjoon/.jenv/versions/temurin17/bin/java: Mach-O 64-bit executable > arm64 > {code} > Since LevelDBJNI still doesn't support Apple Silicon natively, the test cases > fail on M1. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37278) Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-37278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37278: - Description: Implements the support of YearMonthIntervalType in: - Python UDFs - createDataFrame/toPandas was: Implements the support of YearMonthIntervalType in: - Python UDFs - createDataFrame/toPandas when Arrow is disabled > Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs > - > > Key: SPARK-37278 > URL: https://issues.apache.org/jira/browse/SPARK-37278 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of YearMonthIntervalType in: > - Python UDFs > - createDataFrame/toPandas -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37276) Support YearMonthIntervalType in Arrow
[ https://issues.apache.org/jira/browse/SPARK-37276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37276: - Description: Implements the support of YearMonthIntervalType in Arrow code path: - pandas UDFs - pandas functions APIs was: Implements the support of YearMonthIntervalType in Arrow code path: - pandas UDFs - pandas functions APIs - createDataFrame/toPandas when Arrow is enabled > Support YearMonthIntervalType in Arrow > -- > > Key: SPARK-37276 > URL: https://issues.apache.org/jira/browse/SPARK-37276 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of YearMonthIntervalType in Arrow code path: > - pandas UDFs > - pandas functions APIs -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37276) Support YearMonthIntervalType in Arrow
[ https://issues.apache.org/jira/browse/SPARK-37276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37276: - Description: Implements the support of YearMonthIntervalType in Arrow code path: - pandas UDFs - pandas functions APIs - createDataFrame/toPandas w/ Arrow was: Implements the support of YearMonthIntervalType in Arrow code path: - pandas UDFs - pandas functions APIs > Support YearMonthIntervalType in Arrow > -- > > Key: SPARK-37276 > URL: https://issues.apache.org/jira/browse/SPARK-37276 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of YearMonthIntervalType in Arrow code path: > - pandas UDFs > - pandas functions APIs > - createDataFrame/toPandas w/ Arrow -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37278) Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-37278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37278: - Description: Implements the support of YearMonthIntervalType in: - Python UDFs - createDataFrame/toPandas without Arrow was: Implements the support of YearMonthIntervalType in: - Python UDFs - createDataFrame/toPandas > Support YearMonthIntervalType in createDataFrame/toPandas and Python UDFs > - > > Key: SPARK-37278 > URL: https://issues.apache.org/jira/browse/SPARK-37278 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Implements the support of YearMonthIntervalType in: > - Python UDFs > - createDataFrame/toPandas without Arrow -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37263) Create an option to silence advice for pandas API on Spark.
[ https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37263: Summary: Create an option to silence advice for pandas API on Spark. (was: Reduce pandas-on-Spark warning for internal usage.) > Create an option to silence advice for pandas API on Spark. > --- > > Key: SPARK-37263 > URL: https://issues.apache.org/jira/browse/SPARK-37263 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Raised from comment > https://github.com/apache/spark/pull/34389#discussion_r741733023. > The advice warning for pandas API on Spark for expensive APIs > ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] > now issuing too much warning message, since it also issuing the warning when > the APIs are used for internal usage. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37263) Add an option to silence advice for pandas API on Spark.
[ https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37263: Summary: Add an option to silence advice for pandas API on Spark. (was: Create an option to silence advice for pandas API on Spark.) > Add an option to silence advice for pandas API on Spark. > > > Key: SPARK-37263 > URL: https://issues.apache.org/jira/browse/SPARK-37263 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Raised from comment > https://github.com/apache/spark/pull/34389#discussion_r741733023. > The advice warning for pandas API on Spark for expensive APIs > ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] > now issuing too much warning message, since it also issuing the warning when > the APIs are used for internal usage. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37263) Add an option to silence advice for pandas API on Spark.
[ https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37263: Description: Raised from comment [https://github.com/apache/spark/pull/34389#discussion_r741733023]. The advice warning for pandas API on Spark for expensive APIs ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] now issuing too much warning message, so it might be good to have option to turn this message on/off. was: Raised from comment https://github.com/apache/spark/pull/34389#discussion_r741733023. The advice warning for pandas API on Spark for expensive APIs ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] now issuing too much warning message, since it also issuing the warning when the APIs are used for internal usage. > Add an option to silence advice for pandas API on Spark. > > > Key: SPARK-37263 > URL: https://issues.apache.org/jira/browse/SPARK-37263 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Raised from comment > [https://github.com/apache/spark/pull/34389#discussion_r741733023]. > The advice warning for pandas API on Spark for expensive APIs > ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] > now issuing too much warning message, so it might be good to have option to > turn this message on/off. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37274) These parameters should be of type long, not int
[ https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37274: Assignee: (was: Apache Spark) > These parameters should be of type long, not int > > > Key: SPARK-37274 > URL: https://issues.apache.org/jira/browse/SPARK-37274 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: hao >Priority: Major > > These parameters [spark.sql.orc.columnarReaderBatchSize], > [spark.sql.inMemoryColumnarStorage.batchSize], > [spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of > type int. when the user sets the value to be greater than the maximum value > of type int, an error will be thrown -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37274) These parameters should be of type long, not int
[ https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37274: Assignee: Apache Spark > These parameters should be of type long, not int > > > Key: SPARK-37274 > URL: https://issues.apache.org/jira/browse/SPARK-37274 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: hao >Assignee: Apache Spark >Priority: Major > > These parameters [spark.sql.orc.columnarReaderBatchSize], > [spark.sql.inMemoryColumnarStorage.batchSize], > [spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of > type int. when the user sets the value to be greater than the maximum value > of type int, an error will be thrown -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37274) These parameters should be of type long, not int
[ https://issues.apache.org/jira/browse/SPARK-37274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442102#comment-17442102 ] Apache Spark commented on SPARK-37274: -- User 'dh20' has created a pull request for this issue: https://github.com/apache/spark/pull/34549 > These parameters should be of type long, not int > > > Key: SPARK-37274 > URL: https://issues.apache.org/jira/browse/SPARK-37274 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: hao >Priority: Major > > These parameters [spark.sql.orc.columnarReaderBatchSize], > [spark.sql.inMemoryColumnarStorage.batchSize], > [spark.sql.parquet.columnarReaderBatchSize] should be of type long, not of > type int. when the user sets the value to be greater than the maximum value > of type int, an error will be thrown -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format
Kousuke Saruta created SPARK-37283: -- Summary: Don't try to store a V1 table which contains ANSI intervals in Hive compatible format Key: SPARK-37283 URL: https://issues.apache.org/jira/browse/SPARK-37283 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta If, a table being created contains a column of ANSI interval types and the underlying file format has a corresponding Hive SerDe (e.g. Parquet), `HiveExternalcatalog` tries to store the table in Hive compatible format. But, as ANSI interval types in Spark and interval type in Hive are not compatible (Hive only supports interval_year_month and interval_day_time), the following warning with stack trace will be logged. {code} spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet; 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format. org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'interval year to month' but 'interval year to month' is found. at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551) at org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499) at org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376) at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$
[jira] [Updated] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format
[ https://issues.apache.org/jira/browse/SPARK-37283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-37283: --- Description: If, a table being created contains a column of ANSI interval types and the underlying file format has a corresponding Hive SerDe (e.g. Parquet), `HiveExternalcatalog` tries to store the table in Hive compatible format. But, as ANSI interval types in Spark and interval type in Hive are not compatible (Hive only supports interval_year_month and interval_day_time), the following warning with stack trace will be logged. {code} spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet; 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format. org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'interval year to month' but 'interval year to month' is found. at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551) at org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499) at org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376) at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWith
[jira] [Updated] (SPARK-37263) Add PandasAPIOnSparkAdviceWarning class
[ https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37263: Summary: Add PandasAPIOnSparkAdviceWarning class (was: Add an option to silence advice for pandas API on Spark.) > Add PandasAPIOnSparkAdviceWarning class > --- > > Key: SPARK-37263 > URL: https://issues.apache.org/jira/browse/SPARK-37263 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Raised from comment > [https://github.com/apache/spark/pull/34389#discussion_r741733023]. > The advice warning for pandas API on Spark for expensive APIs > ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] > now issuing too much warning message, so it might be good to have option to > turn this message on/off. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37263) Add PandasAPIOnSparkAdviceWarning class
[ https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-37263: Description: Raised from comment [https://github.com/apache/spark/pull/34389#discussion_r741733023]. The advice warning for pandas API on Spark for expensive APIs ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] now issuing too much warning message, so it might be good to have pandas-on-Spark specific warning class so that users can manually turn it off by using warning.simplefilter. was: Raised from comment [https://github.com/apache/spark/pull/34389#discussion_r741733023]. The advice warning for pandas API on Spark for expensive APIs ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] now issuing too much warning message, so it might be good to have option to turn this message on/off. > Add PandasAPIOnSparkAdviceWarning class > --- > > Key: SPARK-37263 > URL: https://issues.apache.org/jira/browse/SPARK-37263 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Raised from comment > [https://github.com/apache/spark/pull/34389#discussion_r741733023]. > The advice warning for pandas API on Spark for expensive APIs > ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] > now issuing too much warning message, so it might be good to have > pandas-on-Spark specific warning class so that users can manually turn it off > by using warning.simplefilter. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format
[ https://issues.apache.org/jira/browse/SPARK-37283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17442111#comment-17442111 ] Apache Spark commented on SPARK-37283: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34551 > Don't try to store a V1 table which contains ANSI intervals in Hive > compatible format > - > > Key: SPARK-37283 > URL: https://issues.apache.org/jira/browse/SPARK-37283 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > If, a table being created contains a column of ANSI interval types and the > underlying file format has a corresponding Hive SerDe (e.g. Parquet), > `HiveExternalcatalog` tries to store the table in Hive compatible format. > But, as ANSI interval types in Spark and interval type in Hive are not > compatible (Hive only supports interval_year_month and interval_day_time), > the following warning with stack trace will be logged. > {code} > spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet; > 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist > `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore > in Spark SQL specific format. > org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.IllegalArgumentException: Error: type expected at the position 0 of > 'interval year to month' but 'interval year to month' is found. > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551) > at > org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376) > at > org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands
[jira] [Assigned] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format
[ https://issues.apache.org/jira/browse/SPARK-37283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37283: Assignee: Apache Spark (was: Kousuke Saruta) > Don't try to store a V1 table which contains ANSI intervals in Hive > compatible format > - > > Key: SPARK-37283 > URL: https://issues.apache.org/jira/browse/SPARK-37283 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Major > > If, a table being created contains a column of ANSI interval types and the > underlying file format has a corresponding Hive SerDe (e.g. Parquet), > `HiveExternalcatalog` tries to store the table in Hive compatible format. > But, as ANSI interval types in Spark and interval type in Hive are not > compatible (Hive only supports interval_year_month and interval_day_time), > the following warning with stack trace will be logged. > {code} > spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet; > 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist > `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore > in Spark SQL specific format. > org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.IllegalArgumentException: Error: type expected at the position 0 of > 'interval year to month' but 'interval year to month' is found. > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551) > at > org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376) > at > org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anon
[jira] [Assigned] (SPARK-37263) Add PandasAPIOnSparkAdviceWarning class
[ https://issues.apache.org/jira/browse/SPARK-37263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37263: Assignee: (was: Apache Spark) > Add PandasAPIOnSparkAdviceWarning class > --- > > Key: SPARK-37263 > URL: https://issues.apache.org/jira/browse/SPARK-37263 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > Raised from comment > [https://github.com/apache/spark/pull/34389#discussion_r741733023]. > The advice warning for pandas API on Spark for expensive APIs > ([https://github.com/apache/spark/pull/34389#discussion_r741733023)|https://github.com/apache/spark/pull/34389#discussion_r741733023).] > now issuing too much warning message, so it might be good to have > pandas-on-Spark specific warning class so that users can manually turn it off > by using warning.simplefilter. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37283) Don't try to store a V1 table which contains ANSI intervals in Hive compatible format
[ https://issues.apache.org/jira/browse/SPARK-37283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37283: Assignee: Kousuke Saruta (was: Apache Spark) > Don't try to store a V1 table which contains ANSI intervals in Hive > compatible format > - > > Key: SPARK-37283 > URL: https://issues.apache.org/jira/browse/SPARK-37283 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > If, a table being created contains a column of ANSI interval types and the > underlying file format has a corresponding Hive SerDe (e.g. Parquet), > `HiveExternalcatalog` tries to store the table in Hive compatible format. > But, as ANSI interval types in Spark and interval type in Hive are not > compatible (Hive only supports interval_year_month and interval_day_time), > the following warning with stack trace will be logged. > {code} > spark-sql> CREATE TABLE tbl1(a INTERVAL YEAR TO MONTH) USING Parquet; > 21/11/11 14:39:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, > since hive.security.authorization.manager is set to instance of > HiveAuthorizerFactory. > 21/11/11 14:39:29 WARN HiveExternalCatalog: Could not persist > `default`.`tbl1` in a Hive compatible way. Persisting it into Hive metastore > in Spark SQL specific format. > org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.IllegalArgumentException: Error: type expected at the position 0 of > 'interval year to month' but 'interval year to month' is found. > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869) > at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:553) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:551) > at > org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:499) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:397) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:102) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:376) > at > org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:120) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$an