[jira] [Resolved] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.

2021-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-37022.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34297
[https://github.com/apache/spark/pull/34297]

> Use black as a formatter for the whole PySpark codebase.
> 
>
> Key: SPARK-37022
> URL: https://issues.apache.org/jira/browse/SPARK-37022
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: black-diff-stats.txt, pyproject.toml
>
>
> [{{black}}|https://github.com/psf/black] is a popular Python code formatter. 
> It is used by a number of projects, both small and large, including prominent 
> ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used 
> to format a {{pyspark.pandas}} and (though not enforced) stubs files.
> We should consider using black to enforce formatting of all PySpark files. 
> There are multiple reasons to do that:
>  - Consistency: black is already used across existing codebase and black 
> formatted chunks of code are already added to modules other than 
> pyspark.pandas as a result of type hints inlining (SPARK-36845).
>  - Lower cost of contributing and reviewing: Formatting can be automatically 
> enforced and applied.
>  - Simplify reviews: In general, black formatted code, produces small and 
> highly readable diffs.
>  - Reduce effort required to maintain patched forks: smaller diffs + 
> predictable formatting.
> Risks:
>  - Initial reformatting requires quite significant changes.
>  - Applying black will break blame in GitHub UI (for git in general see 
> [Avoiding ruining git 
> blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).
> Additional steps:
>  - To simplify backporting, black will have to be applied to all active 
> branches.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37022) Use black as a formatter for the whole PySpark codebase.

2021-11-10 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-37022:


Assignee: Maciej Szymkiewicz

> Use black as a formatter for the whole PySpark codebase.
> 
>
> Key: SPARK-37022
> URL: https://issues.apache.org/jira/browse/SPARK-37022
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Attachments: black-diff-stats.txt, pyproject.toml
>
>
> [{{black}}|https://github.com/psf/black] is a popular Python code formatter. 
> It is used by a number of projects, both small and large, including prominent 
> ones, like pandas, scikit-learn, Django or SQLAlchemy. Black is already used 
> to format a {{pyspark.pandas}} and (though not enforced) stubs files.
> We should consider using black to enforce formatting of all PySpark files. 
> There are multiple reasons to do that:
>  - Consistency: black is already used across existing codebase and black 
> formatted chunks of code are already added to modules other than 
> pyspark.pandas as a result of type hints inlining (SPARK-36845).
>  - Lower cost of contributing and reviewing: Formatting can be automatically 
> enforced and applied.
>  - Simplify reviews: In general, black formatted code, produces small and 
> highly readable diffs.
>  - Reduce effort required to maintain patched forks: smaller diffs + 
> predictable formatting.
> Risks:
>  - Initial reformatting requires quite significant changes.
>  - Applying black will break blame in GitHub UI (for git in general see 
> [Avoiding ruining git 
> blame|https://black.readthedocs.io/en/stable/guides/introducing_black_to_your_project.html?highlight=blame#avoiding-ruining-git-blame]).
> Additional steps:
>  - To simplify backporting, black will have to be applied to all active 
> branches.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37266) Optimize the analysis for view text of persistent view and fix security vulnerabilities caused by sql tampering

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37266:


Assignee: Apache Spark

> Optimize the analysis for view text of persistent view and fix security 
> vulnerabilities caused by sql tampering 
> 
>
> Key: SPARK-37266
> URL: https://issues.apache.org/jira/browse/SPARK-37266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> The current implementation of persistent view is create hive table with view 
> text.
> The view text is just a query string, so the hackers may tamper with it 
> through various means.
> Such as:
> {code:java}
> select * from tab1
> {code}
>  tampered with
>  
> {code:java}
> drop table tab1
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37266) Optimize the analysis for view text of persistent view and fix security vulnerabilities caused by sql tampering

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37266:


Assignee: (was: Apache Spark)

> Optimize the analysis for view text of persistent view and fix security 
> vulnerabilities caused by sql tampering 
> 
>
> Key: SPARK-37266
> URL: https://issues.apache.org/jira/browse/SPARK-37266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> The current implementation of persistent view is create hive table with view 
> text.
> The view text is just a query string, so the hackers may tamper with it 
> through various means.
> Such as:
> {code:java}
> select * from tab1
> {code}
>  tampered with
>  
> {code:java}
> drop table tab1
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37266) Optimize the analysis for view text of persistent view and fix security vulnerabilities caused by sql tampering

2021-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441580#comment-17441580
 ] 

Apache Spark commented on SPARK-37266:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/34543

> Optimize the analysis for view text of persistent view and fix security 
> vulnerabilities caused by sql tampering 
> 
>
> Key: SPARK-37266
> URL: https://issues.apache.org/jira/browse/SPARK-37266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> The current implementation of persistent view is create hive table with view 
> text.
> The view text is just a query string, so the hackers may tamper with it 
> through various means.
> Such as:
> {code:java}
> select * from tab1
> {code}
>  tampered with
>  
> {code:java}
> drop table tab1
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37267) OptimizeSkewInRebalancePartitions support optimize non-root node

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37267:


Assignee: Apache Spark

> OptimizeSkewInRebalancePartitions support optimize non-root node
> 
>
> Key: SPARK-37267
> URL: https://issues.apache.org/jira/browse/SPARK-37267
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> `OptimizeSkewInRebalancePartitions` now is applied if the 
> `RebalancePartitions` is the root node, but sometimes, we expect a local sort 
> after do RebalancePartitions that can improve the compression ratio.
> After SPARK-36184, we make validate easy that whether the rule introduces 
> extra shuffle or not and the output partitioning is ensured by 
> `AQEShuffleReadExec.outputPartitioning`.
> So it is safe to make `OptimizeSkewInRebalancePartitions` support optimize 
> non-root node.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37267) OptimizeSkewInRebalancePartitions support optimize non-root node

2021-11-10 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37267:


Assignee: (was: Apache Spark)

> OptimizeSkewInRebalancePartitions support optimize non-root node
> 
>
> Key: SPARK-37267
> URL: https://issues.apache.org/jira/browse/SPARK-37267
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> `OptimizeSkewInRebalancePartitions` now is applied if the 
> `RebalancePartitions` is the root node, but sometimes, we expect a local sort 
> after do RebalancePartitions that can improve the compression ratio.
> After SPARK-36184, we make validate easy that whether the rule introduces 
> extra shuffle or not and the output partitioning is ensured by 
> `AQEShuffleReadExec.outputPartitioning`.
> So it is safe to make `OptimizeSkewInRebalancePartitions` support optimize 
> non-root node.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37267) OptimizeSkewInRebalancePartitions support optimize non-root node

2021-11-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441578#comment-17441578
 ] 

Apache Spark commented on SPARK-37267:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/34542

> OptimizeSkewInRebalancePartitions support optimize non-root node
> 
>
> Key: SPARK-37267
> URL: https://issues.apache.org/jira/browse/SPARK-37267
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Major
>
> `OptimizeSkewInRebalancePartitions` now is applied if the 
> `RebalancePartitions` is the root node, but sometimes, we expect a local sort 
> after do RebalancePartitions that can improve the compression ratio.
> After SPARK-36184, we make validate easy that whether the rule introduces 
> extra shuffle or not and the output partitioning is ensured by 
> `AQEShuffleReadExec.outputPartitioning`.
> So it is safe to make `OptimizeSkewInRebalancePartitions` support optimize 
> non-root node.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37267) OptimizeSkewInRebalancePartitions support optimize non-root node

2021-11-10 Thread XiDuo You (Jira)
XiDuo You created SPARK-37267:
-

 Summary: OptimizeSkewInRebalancePartitions support optimize 
non-root node
 Key: SPARK-37267
 URL: https://issues.apache.org/jira/browse/SPARK-37267
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: XiDuo You


`OptimizeSkewInRebalancePartitions` now is applied if the `RebalancePartitions` 
is the root node, but sometimes, we expect a local sort after do 
RebalancePartitions that can improve the compression ratio.

After SPARK-36184, we make validate easy that whether the rule introduces extra 
shuffle or not and the output partitioning is ensured by 
`AQEShuffleReadExec.outputPartitioning`.

So it is safe to make `OptimizeSkewInRebalancePartitions` support optimize 
non-root node.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37266) Optimize the analysis for view text of persistent view and fix security vulnerabilities caused by sql tampering

2021-11-10 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-37266:
---
Description: 
The current implementation of persistent view is create hive table with view 
text.
The view text is just a query string, so the hackers may tamper with it through 
various means.
Such as:
{code:java}
select * from tab1
{code}
 tampered with
 
{code:java}
drop table tab1
{code}


  was:
The current implementation of persist view is create hive table with view text.
The view text is just a query string, so the hackers may tamper with it through 
various means.
Such as:
{code:java}
select * from tab1
{code}
 tampered with
 
{code:java}
drop table tab1
{code}



> Optimize the analysis for view text of persistent view and fix security 
> vulnerabilities caused by sql tampering 
> 
>
> Key: SPARK-37266
> URL: https://issues.apache.org/jira/browse/SPARK-37266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> The current implementation of persistent view is create hive table with view 
> text.
> The view text is just a query string, so the hackers may tamper with it 
> through various means.
> Such as:
> {code:java}
> select * from tab1
> {code}
>  tampered with
>  
> {code:java}
> drop table tab1
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37266) Optimize the analysis for view text of persistent view and fix security vulnerabilities caused by sql tampering

2021-11-10 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-37266:
---
Summary: Optimize the analysis for view text of persistent view and fix 
security vulnerabilities caused by sql tampering   (was: Optimize the analysis 
for view text of persist view and fix security vulnerabilities caused by sql 
tampering )

> Optimize the analysis for view text of persistent view and fix security 
> vulnerabilities caused by sql tampering 
> 
>
> Key: SPARK-37266
> URL: https://issues.apache.org/jira/browse/SPARK-37266
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Priority: Major
>
> The current implementation of persist view is create hive table with view 
> text.
> The view text is just a query string, so the hackers may tamper with it 
> through various means.
> Such as:
> {code:java}
> select * from tab1
> {code}
>  tampered with
>  
> {code:java}
> drop table tab1
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37266) Optimize the analysis for view text of persist view and fix security vulnerabilities caused by sql tampering

2021-11-10 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-37266:
--

 Summary: Optimize the analysis for view text of persist view and 
fix security vulnerabilities caused by sql tampering 
 Key: SPARK-37266
 URL: https://issues.apache.org/jira/browse/SPARK-37266
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: jiaan.geng


The current implementation of persist view is create hive table with view text.
The view text is just a query string, so the hackers may tamper with it through 
various means.
Such as:
{code:java}
select * from tab1
{code}
 tampered with
 
{code:java}
drop table tab1
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2