date:20220305

[jira] [Commented] (SPARK-38189) Add priority scheduling doc for Spark on K8S

2022-03-05 Thread Yikun Jiang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501867#comment-17501867
 ] 

Yikun Jiang commented on SPARK-38189:
-

[~dongjoon] Thanks for information, I re-greate a JIRA: SPARK-38423 for 
"Support priority scheduling with volcano implementations.”

> Add priority scheduling doc for Spark on K8S
> 
>
> Key: SPARK-38189
> URL: https://issues.apache.org/jira/browse/SPARK-38189
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38423) Support priority scheduling with volcano implementations

2022-03-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501866#comment-17501866
 ] 

Apache Spark commented on SPARK-38423:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35639

> Support priority scheduling with volcano implementations
> 
>
> Key: SPARK-38423
> URL: https://issues.apache.org/jira/browse/SPARK-38423
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38423) Support priority scheduling with volcano implementations

2022-03-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38423:


Assignee: Apache Spark

> Support priority scheduling with volcano implementations
> 
>
> Key: SPARK-38423
> URL: https://issues.apache.org/jira/browse/SPARK-38423
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38423) Support priority scheduling with volcano implementations

2022-03-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501865#comment-17501865
 ] 

Apache Spark commented on SPARK-38423:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35639

> Support priority scheduling with volcano implementations
> 
>
> Key: SPARK-38423
> URL: https://issues.apache.org/jira/browse/SPARK-38423
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38423) Support priority scheduling with volcano implementations

2022-03-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38423:


Assignee: (was: Apache Spark)

> Support priority scheduling with volcano implementations
> 
>
> Key: SPARK-38423
> URL: https://issues.apache.org/jira/browse/SPARK-38423
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38423) Support priority scheduling with volcano implementations

2022-03-05 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-38423:
---

 Summary: Support priority scheduling with volcano implementations
 Key: SPARK-38423
 URL: https://issues.apache.org/jira/browse/SPARK-38423
 Project: Spark
  Issue Type: Sub-task
  Components: Kubernetes
Affects Versions: 3.3.0
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38135) Introduce `spark.kubernetes.job` sheduling related configurations

2022-03-05 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang resolved SPARK-38135.
-
Resolution: Invalid

> Introduce `spark.kubernetes.job`  sheduling related configurations
> --
>
> Key: SPARK-38135
> URL: https://issues.apache.org/jira/browse/SPARK-38135
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>
> spark.kubernetes.job.minCPU:  the minimum cpu resources for running job
> spark.kubernetes.job.minMemory:  the minimum memory resources for running job
> spark.kubernetes.job.minMember: the minimum number of pods for running job
> spark.kubernetes.job.priorityClassName: the priority of the running job
> spark.kubernetes.job.queue: the queue to which the running job belongs



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-03-05 Thread gabrywu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
It's a little inconvenient, so why can't we {color:#ff}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs, of course can also adjust the important configs, such as 
spark.sql.shuffle.partitions

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#ff}add a version number to statistics{color} 
in case of losing efficacy

 

https://docs.google.com/document/d/1L48Dovynboi_ARu-OqQNJCOQqeVUTutLu8fo-w_ZPPA/edit#

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
It's a little inconvenient, so why can't we {color:#ff}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs, of course can also adjust the important configs, such as 
spark.sql.shuffle.partitions

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#ff}add a version number to statistics{color} 
in case of losing efficacy


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
> It's a little inconvenient, so why can't we {color:#ff}collect & update 
> statistics automatically{color} when a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metrics. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQLs, of course can also adjust the important 
> configs, such as spark.sql.shuffle.partitions
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
> Of course, we'd better {color:#ff}add a version number to 
> statistics{color} in case of losing efficacy
>  
> https://docs.google.com/document/d/1L48Dovynboi_ARu-OqQNJCOQqeVUTutLu8fo-w_ZPPA/edit#



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-03-05 Thread gabrywu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501864#comment-17501864
 ] 

gabrywu commented on SPARK-38258:
-

[~yumwang] what do you think of it?

> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
> It's a little inconvenient, so why can't we {color:#ff}collect & update 
> statistics automatically{color} when a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metrics. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQLs, of course can also adjust the important 
> configs, such as spark.sql.shuffle.partitions
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
> Of course, we'd better {color:#ff}add a version number to 
> statistics{color} in case of losing efficacy



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

2022-03-05 Thread gabrywu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gabrywu updated SPARK-38258:

Description: 
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
It's a little inconvenient, so why can't we {color:#ff}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs, of course can also adjust the important configs, such as 
spark.sql.shuffle.partitions

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#ff}add a version number to statistics{color} 
in case of losing efficacy

  was:
As we all know, table & column statistics are very important to spark SQL 
optimizer, however we have to collect & update them using 
{code:java}
analyze table tableName compute statistics{code}
It's a little inconvenient, so why can't we {color:#ff}collect & update 
statistics automatically{color} when a spark stage runs and finishes?

For example, when a insert overwrite table statement finishes, we can update a 
corresponding table statistics using SQL metrics. And in following queries, 
spark sql optimizer can use these statistics.

As we all know, it's a common case that we run daily batches using Spark SQLs, 
so a same SQL can run every day, and the SQL and its corresponding tables data 
change slowly. That means we can use statistics updated on yesterday to 
optimize current SQLs.

So we'd better add a mechanism to store every stage's statistics somewhere, and 
use it in new SQLs. Not just collect statistics after a stage finishes.

Of course, we'd better {color:#ff}add a version number to statistics{color} 
in case of losing efficacy


> [proposal] collect & update statistics automatically when spark SQL is running
> --
>
> Key: SPARK-38258
> URL: https://issues.apache.org/jira/browse/SPARK-38258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core, SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: gabrywu
>Priority: Minor
>
> As we all know, table & column statistics are very important to spark SQL 
> optimizer, however we have to collect & update them using 
> {code:java}
> analyze table tableName compute statistics{code}
> It's a little inconvenient, so why can't we {color:#ff}collect & update 
> statistics automatically{color} when a spark stage runs and finishes?
> For example, when a insert overwrite table statement finishes, we can update 
> a corresponding table statistics using SQL metrics. And in following queries, 
> spark sql optimizer can use these statistics.
> As we all know, it's a common case that we run daily batches using Spark 
> SQLs, so a same SQL can run every day, and the SQL and its corresponding 
> tables data change slowly. That means we can use statistics updated on 
> yesterday to optimize current SQLs, of course can also adjust the important 
> configs, such as spark.sql.shuffle.partitions
> So we'd better add a mechanism to store every stage's statistics somewhere, 
> and use it in new SQLs. Not just collect statistics after a stage finishes.
> Of course, we'd better {color:#ff}add a version number to 
> statistics{color} in case of losing efficacy



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37426) Inline type hints for python/pyspark/mllib/regression.py

2022-03-05 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37426:
--

Assignee: Maciej Szymkiewicz

> Inline type hints for python/pyspark/mllib/regression.py
> 
>
> Key: SPARK-37426
> URL: https://issues.apache.org/jira/browse/SPARK-37426
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/mlib/regression.pyi to 
> python/pyspark/mllib/regression.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37426) Inline type hints for python/pyspark/mllib/regression.py

2022-03-05 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37426.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35585
[https://github.com/apache/spark/pull/35585]

> Inline type hints for python/pyspark/mllib/regression.py
> 
>
> Key: SPARK-37426
> URL: https://issues.apache.org/jira/browse/SPARK-37426
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints from python/pyspark/mlib/regression.pyi to 
> python/pyspark/mllib/regression.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37400) Inline type hints for python/pyspark/mllib/classification.py

2022-03-05 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37400.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35585
[https://github.com/apache/spark/pull/35585]

> Inline type hints for python/pyspark/mllib/classification.py
> 
>
> Key: SPARK-37400
> URL: https://issues.apache.org/jira/browse/SPARK-37400
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints from python/pyspark/mlib/classification.pyi to 
> python/pyspark/mllib/classification.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37400) Inline type hints for python/pyspark/mllib/classification.py

2022-03-05 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37400:
--

Assignee: Maciej Szymkiewicz

> Inline type hints for python/pyspark/mllib/classification.py
> 
>
> Key: SPARK-37400
> URL: https://issues.apache.org/jira/browse/SPARK-37400
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/mlib/classification.pyi to 
> python/pyspark/mllib/classification.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37430) Inline type hints for python/pyspark/mllib/linalg/distributed.py

2022-03-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37430:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/mllib/linalg/distributed.py
> 
>
> Key: SPARK-37430
> URL: https://issues.apache.org/jira/browse/SPARK-37430
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/mlib/linalg/distributed.pyi to 
> python/pyspark/mllib/linalg/distributed.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37430) Inline type hints for python/pyspark/mllib/linalg/distributed.py

2022-03-05 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37430:


Assignee: Apache Spark

> Inline type hints for python/pyspark/mllib/linalg/distributed.py
> 
>
> Key: SPARK-37430
> URL: https://issues.apache.org/jira/browse/SPARK-37430
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints from python/pyspark/mlib/linalg/distributed.pyi to 
> python/pyspark/mllib/linalg/distributed.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37430) Inline type hints for python/pyspark/mllib/linalg/distributed.py

2022-03-05 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17501837#comment-17501837
 ] 

Apache Spark commented on SPARK-37430:
--

User 'hi-zir' has created a pull request for this issue:
https://github.com/apache/spark/pull/35739

> Inline type hints for python/pyspark/mllib/linalg/distributed.py
> 
>
> Key: SPARK-37430
> URL: https://issues.apache.org/jira/browse/SPARK-37430
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/mlib/linalg/distributed.pyi to 
> python/pyspark/mllib/linalg/distributed.py



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38422) Encryption algorithms should be used with secure mode and padding scheme

2022-03-05 Thread Jira

Bjørn Jørgensen created SPARK-38422:
---

 Summary: Encryption algorithms should be used with secure mode and 
padding scheme
 Key: SPARK-38422
 URL: https://issues.apache.org/jira/browse/SPARK-38422
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Bjørn Jørgensen


I have scanned java files with Sonarqube and in 
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/ExpressionImplUtils.java


{code:java}
try {
  if (mode.equalsIgnoreCase("ECB") &&
  (padding.equalsIgnoreCase("PKCS") || 
padding.equalsIgnoreCase("DEFAULT"))) {
Cipher cipher = Cipher.getInstance("AES/ECB/PKCS5Padding");
{code}


Encryption operation mode and the padding scheme should be chosen appropriately 
to guarantee data confidentiality, integrity and authenticity:

For block cipher encryption algorithms (like AES):
The GCM (Galois Counter Mode) mode which works internally with zero/no padding 
scheme, is recommended, as it is designed to provide both data authenticity 
(integrity) and confidentiality. Other similar modes are CCM, CWC, EAX, IAPM 
and OCB.
The CBC (Cipher Block Chaining) mode by itself provides only data 
confidentiality, it’s recommended to use it along with Message Authentication 
Code or similar to achieve data authenticity (integrity) too and thus to 
prevent padding oracle attacks.
The ECB (Electronic Codebook) mode doesn’t provide serious message 
confidentiality: under a given key any given plaintext block always gets 
encrypted to the same ciphertext block. This mode should not be used.
For RSA encryption algorithm, the recommended padding scheme is OAEP.



[OWASP Top 10 2021|https://owasp.org/Top10/A02_2021-Cryptographic_Failures/] 
Category A2 - Cryptographic Failures

[OWASP Top 10 
2017|https://owasp.org/www-project-top-ten/2017/A6_2017-Security_Misconfiguration.html]
 Category A6 - Security Misconfiguration

[Mobile 
AppSec|https://mobile-security.gitbook.io/masvs/security-requirements/0x08-v3-cryptography_verification_requirements]
 Verification Standard - Cryptography Requirements

[OWASP Mobile Top 10 
2016|https://owasp.org/www-project-mobile-top-10/2016-risks/m5-insufficient-cryptography]
 Category M5 - Insufficient Cryptography

[MITRE, CWE-327|https://cwe.mitre.org/data/definitions/327.html]  - Use of a 
Broken or Risky Cryptographic Algorithm

[CERT, 
MSC61-J.|https://wiki.sei.cmu.edu/confluence/display/java/MSC61-J.+Do+not+use+insecure+or+weak+cryptographic+algorithms]
 - Do not use insecure or weak cryptographic algorithms

[SANS Top 25|https://www.sans.org/top25-software-errors/#cat3] - Porous Defenses



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38421) Cipher Block Chaining IVs should be unpredictable

2022-03-05 Thread Jira

Bjørn Jørgensen created SPARK-38421:
---

 Summary: Cipher Block Chaining IVs should be unpredictable
 Key: SPARK-38421
 URL: https://issues.apache.org/jira/browse/SPARK-38421
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 3.3.0
Reporter: Bjørn Jørgensen


I have scanned java files with Sonarqube and in 
https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java
 


{code:java}
 @VisibleForTesting
  CryptoOutputStream createOutputStream(WritableByteChannel ch) throws 
IOException {
return new CryptoOutputStream(cipher, conf, ch, key, new 
IvParameterSpec(outIv));

@VisibleForTesting
  CryptoInputStream createInputStream(ReadableByteChannel ch) throws 
IOException {
return new CryptoInputStream(cipher, conf, ch, key, new 
IvParameterSpec(inIv));
{code}

When encrypting data with the Cipher Block Chaining (CBC) mode an 
Initialization Vector (IV) is used to randomize the encryption, ie under a 
given key the same plaintext doesn’t always produce the same ciphertext. The IV 
doesn’t need to be secret but should be unpredictable to avoid 
"Chosen-Plaintext Attack".

To generate Initialization Vectors, NIST recommends to use a secure random 
number generator.


[OWASP Top 10 2021|https://owasp.org/Top10/A02_2021-Cryptographic_Failures/] 
Category A2 - Cryptographic Failures

[OWASP Top 
10|https://owasp.org/www-project-top-ten/2017/A6_2017-Security_Misconfiguration.html]
 2017 Category A6 - Security Misconfiguration

[MITRE, CWE-329|https://cwe.mitre.org/data/definitions/329.html] - CWE-329: Not 
Using an Unpredictable IV with CBC Mode

[MITRE, CWE-330|https://cwe.mitre.org/data/definitions/330.html] - Use of 
Insufficiently Random Values

[NIST, 
SP-800-38A|https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38a.pdf]
 - Recommendation for Block Cipher Modes of Operation

Derived from FindSecBugs [rule 
STATIC_IV|https://find-sec-bugs.github.io/bugs.htm#STATIC_IV] 






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38393) Clean up deprecated usage of GenSeq/GenMap

2022-03-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-38393.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35713
[https://github.com/apache/spark/pull/35713]

> Clean up deprecated usage of GenSeq/GenMap
> --
>
> Key: SPARK-38393
> URL: https://issues.apache.org/jira/browse/SPARK-38393
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>
> GenSeq/GenMap  is identified as @deprecated since Scala 2.13.0 and Gen* 
> collection types have been removed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38393) Clean up deprecated usage of GenSeq/GenMap

2022-03-05 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-38393:


Assignee: Yang Jie

> Clean up deprecated usage of GenSeq/GenMap
> --
>
> Key: SPARK-38393
> URL: https://issues.apache.org/jira/browse/SPARK-38393
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> GenSeq/GenMap  is identified as @deprecated since Scala 2.13.0 and Gen* 
> collection types have been removed.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38420) Upgrade bcprov-jdk15on from 1.60 to 1.67

2022-03-05 Thread Jira

Bjørn Jørgensen created SPARK-38420:
---

 Summary: Upgrade bcprov-jdk15on from 1.60 to 1.67
 Key: SPARK-38420
 URL: https://issues.apache.org/jira/browse/SPARK-38420
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.3.0
Reporter: Bjørn Jørgensen


Upgrade bcprov-jdk15on from 1.60 to 1.67 
[CVE-2020-15522|https://nvd.nist.gov/vuln/detail/CVE-2020-15522]

[releasenotes.|https://github.com/bcgit/bc-java/blob/master/docs/releasenotes.html]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38189) Add priority scheduling doc for Spark on K8S

[jira] [Commented] (SPARK-38423) Support priority scheduling with volcano implementations

[jira] [Assigned] (SPARK-38423) Support priority scheduling with volcano implementations

[jira] [Commented] (SPARK-38423) Support priority scheduling with volcano implementations

[jira] [Assigned] (SPARK-38423) Support priority scheduling with volcano implementations

[jira] [Created] (SPARK-38423) Support priority scheduling with volcano implementations

[jira] [Resolved] (SPARK-38135) Introduce `spark.kubernetes.job` sheduling related configurations

[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

[jira] [Commented] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

[jira] [Updated] (SPARK-38258) [proposal] collect & update statistics automatically when spark SQL is running

[jira] [Assigned] (SPARK-37426) Inline type hints for python/pyspark/mllib/regression.py

[jira] [Resolved] (SPARK-37426) Inline type hints for python/pyspark/mllib/regression.py

[jira] [Resolved] (SPARK-37400) Inline type hints for python/pyspark/mllib/classification.py

[jira] [Assigned] (SPARK-37400) Inline type hints for python/pyspark/mllib/classification.py

[jira] [Assigned] (SPARK-37430) Inline type hints for python/pyspark/mllib/linalg/distributed.py

[jira] [Assigned] (SPARK-37430) Inline type hints for python/pyspark/mllib/linalg/distributed.py

[jira] [Commented] (SPARK-37430) Inline type hints for python/pyspark/mllib/linalg/distributed.py

[jira] [Created] (SPARK-38422) Encryption algorithms should be used with secure mode and padding scheme

[jira] [Created] (SPARK-38421) Cipher Block Chaining IVs should be unpredictable

[jira] [Resolved] (SPARK-38393) Clean up deprecated usage of GenSeq/GenMap

[jira] [Assigned] (SPARK-38393) Clean up deprecated usage of GenSeq/GenMap

[jira] [Created] (SPARK-38420) Upgrade bcprov-jdk15on from 1.60 to 1.67

22 matches

Site Navigation

Mail list logo

Footer information