[jira] [Comment Edited] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2020-07-03 Thread qingwu.fu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151133#comment-17151133
 ] 

qingwu.fu edited comment on SPARK-30602 at 7/3/20, 10:56 PM:
-

hi [~mshen], could I have the access auth to see the source code? Thanks.

Or when will this work open source?


was (Author: qingwu.fu):
hi [~mshen], could I have the access auth to see the source code? Thanks.

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_2020_magnet_shuffle.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30602) SPIP: Support push-based shuffle to improve shuffle efficiency

2020-07-03 Thread qingwu.fu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151133#comment-17151133
 ] 

qingwu.fu commented on SPARK-30602:
---

hi [~mshen], could I have the access auth to see the source code? Thanks.

> SPIP: Support push-based shuffle to improve shuffle efficiency
> --
>
> Key: SPARK-30602
> URL: https://issues.apache.org/jira/browse/SPARK-30602
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Min Shen
>Priority: Major
> Attachments: Screen Shot 2020-06-23 at 11.31.22 AM.jpg, 
> vldb_2020_magnet_shuffle.pdf
>
>
> In a large deployment of a Spark compute infrastructure, Spark shuffle is 
> becoming a potential scaling bottleneck and a source of inefficiency in the 
> cluster. When doing Spark on YARN for a large-scale deployment, people 
> usually enable Spark external shuffle service and store the intermediate 
> shuffle files on HDD. Because the number of blocks generated for a particular 
> shuffle grows quadratically compared to the size of shuffled data (# mappers 
> and reducers grows linearly with the size of shuffled data, but # blocks is # 
> mappers * # reducers), one general trend we have observed is that the more 
> data a Spark application processes, the smaller the block size becomes. In a 
> few production clusters we have seen, the average shuffle block size is only 
> 10s of KBs. Because of the inefficiency of performing random reads on HDD for 
> small amount of data, the overall efficiency of the Spark external shuffle 
> services serving the shuffle blocks degrades as we see an increasing # of 
> Spark applications processing an increasing amount of data. In addition, 
> because Spark external shuffle service is a shared service in a multi-tenancy 
> cluster, the inefficiency with one Spark application could propagate to other 
> applications as well.
> In this ticket, we propose a solution to improve Spark shuffle efficiency in 
> above mentioned environments with push-based shuffle. With push-based 
> shuffle, shuffle is performed at the end of mappers and blocks get pre-merged 
> and move towards reducers. In our prototype implementation, we have seen 
> significant efficiency improvements when performing large shuffles. We take a 
> Spark-native approach to achieve this, i.e., extending Spark’s existing 
> shuffle netty protocol, and the behaviors of Spark mappers, reducers and 
> drivers. This way, we can bring the benefits of more efficient shuffle in 
> Spark without incurring the dependency or overhead of either specialized 
> storage layer or external infrastructure pieces.
>  
> Link to dev mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-push-based-shuffle-in-Spark-td28732.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32168) DSv2 SQL overwrite incorrectly uses static plan with hidden partitions

2020-07-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32168:


Assignee: Apache Spark

> DSv2 SQL overwrite incorrectly uses static plan with hidden partitions
> --
>
> Key: SPARK-32168
> URL: https://issues.apache.org/jira/browse/SPARK-32168
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness
>
> The v2 analyzer rule {{ResolveInsertInto}} tries to detect when a static 
> overwrite and a dynamic overwrite would produce the same result and will 
> choose to use static overwrite in that case. It will only use a dynamic 
> overwrite if there is a partition column without a static value and the SQL 
> mode is set to dynamic.
> {code:lang=scala}
> val dynamicPartitionOverwrite = partCols.size > staticPartitions.size &&
>   conf.partitionOverwriteMode == PartitionOverwriteMode.DYNAMIC
> {code}
> The problem is that {{partCols}} are the names of only partitions that are in 
> the column list (identity partitions) and does not include hidden partitions, 
> like {{days(ts)}}. As a result, this doesn't detect hidden partitions and use 
> dynamic overwrite. Static overwrite is used instead; when a table has only 
> hidden partitions, the static filter drops all table data.
> This is a correctness bug because Spark will overwrite more data than just 
> the set of partitions being written to in dynamic mode. The impact is limited 
> because this rule is only used for SQL queries (not plans from 
> DataFrameWriters) and only affects tables with hidden partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32168) DSv2 SQL overwrite incorrectly uses static plan with hidden partitions

2020-07-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151118#comment-17151118
 ] 

Apache Spark commented on SPARK-32168:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/28993

> DSv2 SQL overwrite incorrectly uses static plan with hidden partitions
> --
>
> Key: SPARK-32168
> URL: https://issues.apache.org/jira/browse/SPARK-32168
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Blocker
>  Labels: correctness
>
> The v2 analyzer rule {{ResolveInsertInto}} tries to detect when a static 
> overwrite and a dynamic overwrite would produce the same result and will 
> choose to use static overwrite in that case. It will only use a dynamic 
> overwrite if there is a partition column without a static value and the SQL 
> mode is set to dynamic.
> {code:lang=scala}
> val dynamicPartitionOverwrite = partCols.size > staticPartitions.size &&
>   conf.partitionOverwriteMode == PartitionOverwriteMode.DYNAMIC
> {code}
> The problem is that {{partCols}} are the names of only partitions that are in 
> the column list (identity partitions) and does not include hidden partitions, 
> like {{days(ts)}}. As a result, this doesn't detect hidden partitions and use 
> dynamic overwrite. Static overwrite is used instead; when a table has only 
> hidden partitions, the static filter drops all table data.
> This is a correctness bug because Spark will overwrite more data than just 
> the set of partitions being written to in dynamic mode. The impact is limited 
> because this rule is only used for SQL queries (not plans from 
> DataFrameWriters) and only affects tables with hidden partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32168) DSv2 SQL overwrite incorrectly uses static plan with hidden partitions

2020-07-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32168:


Assignee: (was: Apache Spark)

> DSv2 SQL overwrite incorrectly uses static plan with hidden partitions
> --
>
> Key: SPARK-32168
> URL: https://issues.apache.org/jira/browse/SPARK-32168
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Priority: Blocker
>  Labels: correctness
>
> The v2 analyzer rule {{ResolveInsertInto}} tries to detect when a static 
> overwrite and a dynamic overwrite would produce the same result and will 
> choose to use static overwrite in that case. It will only use a dynamic 
> overwrite if there is a partition column without a static value and the SQL 
> mode is set to dynamic.
> {code:lang=scala}
> val dynamicPartitionOverwrite = partCols.size > staticPartitions.size &&
>   conf.partitionOverwriteMode == PartitionOverwriteMode.DYNAMIC
> {code}
> The problem is that {{partCols}} are the names of only partitions that are in 
> the column list (identity partitions) and does not include hidden partitions, 
> like {{days(ts)}}. As a result, this doesn't detect hidden partitions and use 
> dynamic overwrite. Static overwrite is used instead; when a table has only 
> hidden partitions, the static filter drops all table data.
> This is a correctness bug because Spark will overwrite more data than just 
> the set of partitions being written to in dynamic mode. The impact is limited 
> because this rule is only used for SQL queries (not plans from 
> DataFrameWriters) and only affects tables with hidden partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32168) DSv2 SQL overwrite incorrectly uses static plan with hidden partitions

2020-07-03 Thread Ryan Blue (Jira)
Ryan Blue created SPARK-32168:
-

 Summary: DSv2 SQL overwrite incorrectly uses static plan with 
hidden partitions
 Key: SPARK-32168
 URL: https://issues.apache.org/jira/browse/SPARK-32168
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ryan Blue


The v2 analyzer rule {{ResolveInsertInto}} tries to detect when a static 
overwrite and a dynamic overwrite would produce the same result and will choose 
to use static overwrite in that case. It will only use a dynamic overwrite if 
there is a partition column without a static value and the SQL mode is set to 
dynamic.

{code:lang=scala}
val dynamicPartitionOverwrite = partCols.size > staticPartitions.size &&
  conf.partitionOverwriteMode == PartitionOverwriteMode.DYNAMIC
{code}

The problem is that {{partCols}} are the names of only partitions that are in 
the column list (identity partitions) and does not include hidden partitions, 
like {{days(ts)}}. As a result, this doesn't detect hidden partitions and use 
dynamic overwrite. Static overwrite is used instead; when a table has only 
hidden partitions, the static filter drops all table data.

This is a correctness bug because Spark will overwrite more data than just the 
set of partitions being written to in dynamic mode. The impact is limited 
because this rule is only used for SQL queries (not plans from 
DataFrameWriters) and only affects tables with hidden partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150834#comment-17150834
 ] 

Apache Spark commented on SPARK-32167:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/28992

> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Critical
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32167:


Assignee: Wenchen Fan  (was: Apache Spark)

> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Critical
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32167:


Assignee: Apache Spark  (was: Wenchen Fan)

> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Critical
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-03 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-32167:
---

 Summary: nullability of GetArrayStructFields is incorrect
 Key: SPARK-32167
 URL: https://issues.apache.org/jira/browse/SPARK-32167
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 2.4.6
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32167) nullability of GetArrayStructFields is incorrect

2020-07-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-32167:

Labels: correctness  (was: )

> nullability of GetArrayStructFields is incorrect
> 
>
> Key: SPARK-32167
> URL: https://issues.apache.org/jira/browse/SPARK-32167
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Critical
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25433) Add support for PEX in PySpark

2020-07-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150831#comment-17150831
 ] 

Fabian Höring commented on SPARK-25433:
---

[~hyukjin.kwon]
I actually noticed that it could be interesting to combine this somehow with 
spark udf because they require pandas and pyarrow to be shipped to the cluster. 
So it doesn't just work with using the pypi pyspark package.

That's why this example is straight from the spark documentation:
https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md#submit-the-spark-application-to-the-cluster

>From https://spark.apache.org/docs/2.4.5/sql-pyspark-pandas-with-arrow.html

> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors using [PEX|https://github.com/pantsbuild/pex] 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
>  (disadvantages are that you have a separate conda package repo and ship the 
> python interpreter all the time)
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing PYSPARK_PYTHON env 
> variable should already work.
> I also have seen this 
> [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the packages on 
> each excecutor and recreate your virtual environment each time. Same problem 
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily 
> shippable to another machine. In particular there is the relocatable option 
> (see 
> [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
>  
> [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
>  which makes it very complicated for the user to ship the virtual env and be 
> sure it works.
> And here is where pex comes in. It is a nice way to create a single 
> executable zip file with all dependencies included. You have the pex command 
> line tool to build your package and when it is built you are sure it works. 
> This is in my opinion the most elegant way to ship python code (better than 
> virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one 
> single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
> to the pex files doesn't work. You can nevertheless tune the env variable 
> [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
>  and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25433) Add support for PEX in PySpark

2020-07-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150831#comment-17150831
 ] 

Fabian Höring edited comment on SPARK-25433 at 7/3/20, 8:14 AM:


[~hyukjin.kwon]
I actually noticed that it could be interesting to combine this somehow with 
pandas udf because they require pandas and pyarrow to be shipped to the 
cluster. So it doesn't just work with using the pypi pyspark package.

That's why this example is straight from the spark documentation:
https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md#submit-the-spark-application-to-the-cluster

>From https://spark.apache.org/docs/2.4.5/sql-pyspark-pandas-with-arrow.html


was (Author: fhoering):
[~hyukjin.kwon]
I actually noticed that it could be interesting to combine this somehow with 
spark udf because they require pandas and pyarrow to be shipped to the cluster. 
So it doesn't just work with using the pypi pyspark package.

That's why this example is straight from the spark documentation:
https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md#submit-the-spark-application-to-the-cluster

>From https://spark.apache.org/docs/2.4.5/sql-pyspark-pandas-with-arrow.html

> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors using [PEX|https://github.com/pantsbuild/pex] 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
>  (disadvantages are that you have a separate conda package repo and ship the 
> python interpreter all the time)
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing PYSPARK_PYTHON env 
> variable should already work.
> I also have seen this 
> [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the packages on 
> each excecutor and recreate your virtual environment each time. Same problem 
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily 
> shippable to another machine. In particular there is the relocatable option 
> (see 
> [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
>  
> [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
>  which makes it very complicated for the user to ship the virtual env and be 
> sure it works.
> And here is where pex comes in. It is a nice way to create a single 
> executable zip file with all dependencies included. You have the pex command 
> line tool to build your package and when it is built you are sure it works. 
> This is in my opinion the most elegant way to ship python code (better than 
> virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one 
> single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
> to the pex files doesn't work. You can nevertheless tune the env variable 
> [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
>  and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25433) Add support for PEX in PySpark

2020-07-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150827#comment-17150827
 ] 

Fabian Höring edited comment on SPARK-25433 at 7/3/20, 8:07 AM:


Yes, you can put me in the tickets. I can also contribute if you give me some 
guidance on what detail is expected and where to add this.

All depends on how much detail you want to include in the documentation. 

Trimmed down to one sentence it could be enough just to say: pex is supported 
by tweaking the PYSPARK_PYTHON env variable & show some sample code.

{code}
$ pex numpy -o myarchive.pex
$ export PYSPARK_PYTHON=./myarchive.pex
{code}

On my side I incremented on this idea and wrapped all this into my own tool 
which is reused by our distributed TensorFlow and PySpark jobs. I also did a 
simple end to end docker standalone spark example with S3 storage (via minio) 
for integration tests. 
https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md

It could also make sense to go beyond documentation and upstream some parts of 
this code:
https://github.com/criteo/cluster-pack/blob/master/cluster_pack/spark/spark_config_builder.py

It is currently hacked it as it uses private _option property. I admit I was 
lazy getting into the spark code again but some easy fix would be just exposing 
the private options attribute to get more flexibility on Spark.SessionBuilder. 

One could also direclty expose a method add_pex_support in SparkSession.Builder 
but personally I think it would clutter the code too much. All this should stay 
application specific and indeed makes more sense to be included into the doc.






was (Author: fhoering):
Yes, you can put me in the tickets. I can also contribute if you give me some 
guidance on what detail is expected and where to add this.

All depends on how much detail you want to include in the documentation. 

Trimmed down to one sentence it could be enough just to say: pex is supported 
by tweaking the PYSPARK_PYTHON env variable & show some sample code.

{code}
$ pex numpy -o myarchive.pex
$ export PYSPARK_PYTHON=./myarchive.pex
{code}

On my side I incremented on this idea and wrapped all this into my own tool 
which is reused by our distributed TensorFlow jobs. I also did a simple end to 
end docker standalone spark example with S3 storage (via minio) for integration 
tests. 
https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md

It could also make sense to go beyond documentation and upstream some parts of 
this code:
https://github.com/criteo/cluster-pack/blob/master/cluster_pack/spark/spark_config_builder.py

It is currently hacked it as it uses private _option property. I admit I was 
lazy getting into the spark code again but some easy fix would be just exposing 
the private options attribute to get more flexibility on Spark.SessionBuilder. 

One could also direclty expose a method add_pex_support in SparkSession.Builder 
but personally I think it would clutter the code too much. All this should stay 
application specific and indeed makes more sense to be included into the doc.





> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors using [PEX|https://github.com/pantsbuild/pex] 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
>  (disadvantages are that you have a separate conda package repo and ship the 
> python interpreter all the time)
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing PYSPARK_PYTHON env 
> variable should already work.
> I also have seen this 
> [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the packages on 
> each excecutor and recreate your virtual environment each time. Same problem 
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily 
> shippable to another 

[jira] [Commented] (SPARK-25433) Add support for PEX in PySpark

2020-07-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150827#comment-17150827
 ] 

Fabian Höring commented on SPARK-25433:
---

Yes, you can put me in the tickets. I can also contribute if you give me some 
guidance on what detail is expected and where to add this.

All depends on how much detail you want to include in the documentation. 

Trimmed down to one sentence it could be enough just to say: pex is supported 
by tweaking the PYSPARK_PYTHON env variable & show some sample code.

{code}
$ pex numpy -o myarchive.pex
$ export PYSPARK_PYTHON=./myarchive.pex
{code}

On my side I incremented on this idea and wrapped all this into my own tool 
which is reused by our distributed TensorFlow jobs. I also did a simple end to 
end docker standalone spark example with S3 storage (via minio) for integration 
tests. 
https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md

It could also make sense to go beyond documentation and upstream some parts of 
this code:
https://github.com/criteo/cluster-pack/blob/master/cluster_pack/spark/spark_config_builder.py

It is currently hacked it as it uses private _option property. I admit I was 
lazy getting into the spark code again but some easy fix would be just exposing 
the private options attribute to get more flexibility on Spark.SessionBuilder. 

One could also direclty expose a method add_pex_support in SparkSession.Builder 
but personally I think it would clutter the code too much. All this should stay 
application specific and indeed makes more sense to be included into the doc.





> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors using [PEX|https://github.com/pantsbuild/pex] 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
>  (disadvantages are that you have a separate conda package repo and ship the 
> python interpreter all the time)
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing PYSPARK_PYTHON env 
> variable should already work.
> I also have seen this 
> [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the packages on 
> each excecutor and recreate your virtual environment each time. Same problem 
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily 
> shippable to another machine. In particular there is the relocatable option 
> (see 
> [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
>  
> [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
>  which makes it very complicated for the user to ship the virtual env and be 
> sure it works.
> And here is where pex comes in. It is a nice way to create a single 
> executable zip file with all dependencies included. You have the pex command 
> line tool to build your package and when it is built you are sure it works. 
> This is in my opinion the most elegant way to ship python code (better than 
> virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one 
> single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
> to the pex files doesn't work. You can nevertheless tune the env variable 
> [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
>  and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26533) Support query auto cancel on thriftserver

2020-07-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150763#comment-17150763
 ] 

Apache Spark commented on SPARK-26533:
--

User 'leoluan2009' has created a pull request for this issue:
https://github.com/apache/spark/pull/28991

> Support query auto cancel on thriftserver
> -
>
> Key: SPARK-26533
> URL: https://issues.apache.org/jira/browse/SPARK-26533
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: zhoukang
>Priority: Major
>
> Support query auto cancelling when running too long on thriftserver.
> For some cases,we use thriftserver as long-running applications.
> Some times we want all the query need not to run more than given time.
> In these cases,we can enable auto cancel for time-consumed query.Which can 
> let us release resources for other queries to run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26533) Support query auto cancel on thriftserver

2020-07-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26533:


Assignee: Apache Spark

> Support query auto cancel on thriftserver
> -
>
> Key: SPARK-26533
> URL: https://issues.apache.org/jira/browse/SPARK-26533
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: zhoukang
>Assignee: Apache Spark
>Priority: Major
>
> Support query auto cancelling when running too long on thriftserver.
> For some cases,we use thriftserver as long-running applications.
> Some times we want all the query need not to run more than given time.
> In these cases,we can enable auto cancel for time-consumed query.Which can 
> let us release resources for other queries to run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26533) Support query auto cancel on thriftserver

2020-07-03 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-26533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26533:


Assignee: (was: Apache Spark)

> Support query auto cancel on thriftserver
> -
>
> Key: SPARK-26533
> URL: https://issues.apache.org/jira/browse/SPARK-26533
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: zhoukang
>Priority: Major
>
> Support query auto cancelling when running too long on thriftserver.
> For some cases,we use thriftserver as long-running applications.
> Some times we want all the query need not to run more than given time.
> In these cases,we can enable auto cancel for time-consumed query.Which can 
> let us release resources for other queries to run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26533) Support query auto cancel on thriftserver

2020-07-03 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150764#comment-17150764
 ] 

Apache Spark commented on SPARK-26533:
--

User 'leoluan2009' has created a pull request for this issue:
https://github.com/apache/spark/pull/28991

> Support query auto cancel on thriftserver
> -
>
> Key: SPARK-26533
> URL: https://issues.apache.org/jira/browse/SPARK-26533
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: zhoukang
>Priority: Major
>
> Support query auto cancelling when running too long on thriftserver.
> For some cases,we use thriftserver as long-running applications.
> Some times we want all the query need not to run more than given time.
> In these cases,we can enable auto cancel for time-consumed query.Which can 
> let us release resources for other queries to run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25433) Add support for PEX in PySpark

2020-07-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150831#comment-17150831
 ] 

Fabian Höring edited comment on SPARK-25433 at 7/3/20, 9:04 AM:


[~hyukjin.kwon]
I actually noticed that it could be interesting to combine this somehow with 
pandas udf because they require pandas and pyarrow to be shipped to the 
cluster. So it doesn't just work with using the pypi pyspark package.

That's why this example is straight from the spark documentation:
https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md#submit-the-spark-application-to-the-cluster

>From 
>https://spark.apache.org/docs/2.4.4/sql-pyspark-pandas-with-arrow.html#grouped-aggregate


was (Author: fhoering):
[~hyukjin.kwon]
I actually noticed that it could be interesting to combine this somehow with 
pandas udf because they require pandas and pyarrow to be shipped to the 
cluster. So it doesn't just work with using the pypi pyspark package.

That's why this example is straight from the spark documentation:
https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md#submit-the-spark-application-to-the-cluster

>From https://spark.apache.org/docs/2.4.5/sql-pyspark-pandas-with-arrow.html

> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors using [PEX|https://github.com/pantsbuild/pex] 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
>  (disadvantages are that you have a separate conda package repo and ship the 
> python interpreter all the time)
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing PYSPARK_PYTHON env 
> variable should already work.
> I also have seen this 
> [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the packages on 
> each excecutor and recreate your virtual environment each time. Same problem 
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily 
> shippable to another machine. In particular there is the relocatable option 
> (see 
> [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
>  
> [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
>  which makes it very complicated for the user to ship the virtual env and be 
> sure it works.
> And here is where pex comes in. It is a nice way to create a single 
> executable zip file with all dependencies included. You have the pex command 
> line tool to build your package and when it is built you are sure it works. 
> This is in my opinion the most elegant way to ship python code (better than 
> virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one 
> single entry point. So just shipping the pex files and setting PYSPARK_PYTHON 
> to the pex files doesn't work. You can nevertheless tune the env variable 
> [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
>  and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25433) Add support for PEX in PySpark

2020-07-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150827#comment-17150827
 ] 

Fabian Höring edited comment on SPARK-25433 at 7/3/20, 8:53 AM:


Yes, you can put me in the tickets. I can also contribute if you give me some 
guidance on what detail is expected and where to add this.

All depends on how much detail you want to include in the documentation. 

Trimmed down to one sentence it could be enough just to say: pex is supported 
by tweaking the PYSPARK_PYTHON env variable & show some sample code.

{code}
$ pex numpy -o myarchive.pex
$ export PYSPARK_PYTHON=./myarchive.pex
{code}

On my side I incremented on this idea and wrapped all this into my own tool 
which is reused by our distributed TensorFlow and PySpark jobs. I also did a 
simple end to end docker standalone spark example with S3 storage (via minio) 
for integration tests. 
https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md

It could also make sense to go beyond documentation and upstream some parts of 
this code:
https://github.com/criteo/cluster-pack/blob/master/cluster_pack/spark/spark_config_builder.py

It is currently hacked it as it uses private _option variable. I admit I was 
lazy getting into the spark code again but some easy fix would be just exposing 
the private_option variable to get more flexibility on Spark.SessionBuilder. 

One could also directly expose a method add_pex_support in SparkSession.Builder 
but personally I think it would clutter the code too much. All this should stay 
application specific and indeed makes more sense to be included into the doc.






was (Author: fhoering):
Yes, you can put me in the tickets. I can also contribute if you give me some 
guidance on what detail is expected and where to add this.

All depends on how much detail you want to include in the documentation. 

Trimmed down to one sentence it could be enough just to say: pex is supported 
by tweaking the PYSPARK_PYTHON env variable & show some sample code.

{code}
$ pex numpy -o myarchive.pex
$ export PYSPARK_PYTHON=./myarchive.pex
{code}

On my side I incremented on this idea and wrapped all this into my own tool 
which is reused by our distributed TensorFlow and PySpark jobs. I also did a 
simple end to end docker standalone spark example with S3 storage (via minio) 
for integration tests. 
https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md

It could also make sense to go beyond documentation and upstream some parts of 
this code:
https://github.com/criteo/cluster-pack/blob/master/cluster_pack/spark/spark_config_builder.py

It is currently hacked it as it uses private _option variable. I admit I was 
lazy getting into the spark code again but some easy fix would be just exposing 
the private_option variable to get more flexibility on Spark.SessionBuilder. 

One could also direclty expose a method add_pex_support in SparkSession.Builder 
but personally I think it would clutter the code too much. All this should stay 
application specific and indeed makes more sense to be included into the doc.





> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors using [PEX|https://github.com/pantsbuild/pex] 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
>  (disadvantages are that you have a separate conda package repo and ship the 
> python interpreter all the time)
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing PYSPARK_PYTHON env 
> variable should already work.
> I also have seen this 
> [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the packages on 
> each excecutor and recreate your virtual environment each time. Same problem 
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily 
> shippable to 

[jira] [Comment Edited] (SPARK-25433) Add support for PEX in PySpark

2020-07-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17150827#comment-17150827
 ] 

Fabian Höring edited comment on SPARK-25433 at 7/3/20, 8:52 AM:


Yes, you can put me in the tickets. I can also contribute if you give me some 
guidance on what detail is expected and where to add this.

All depends on how much detail you want to include in the documentation. 

Trimmed down to one sentence it could be enough just to say: pex is supported 
by tweaking the PYSPARK_PYTHON env variable & show some sample code.

{code}
$ pex numpy -o myarchive.pex
$ export PYSPARK_PYTHON=./myarchive.pex
{code}

On my side I incremented on this idea and wrapped all this into my own tool 
which is reused by our distributed TensorFlow and PySpark jobs. I also did a 
simple end to end docker standalone spark example with S3 storage (via minio) 
for integration tests. 
https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md

It could also make sense to go beyond documentation and upstream some parts of 
this code:
https://github.com/criteo/cluster-pack/blob/master/cluster_pack/spark/spark_config_builder.py

It is currently hacked it as it uses private _option variable. I admit I was 
lazy getting into the spark code again but some easy fix would be just exposing 
the private_option variable to get more flexibility on Spark.SessionBuilder. 

One could also direclty expose a method add_pex_support in SparkSession.Builder 
but personally I think it would clutter the code too much. All this should stay 
application specific and indeed makes more sense to be included into the doc.






was (Author: fhoering):
Yes, you can put me in the tickets. I can also contribute if you give me some 
guidance on what detail is expected and where to add this.

All depends on how much detail you want to include in the documentation. 

Trimmed down to one sentence it could be enough just to say: pex is supported 
by tweaking the PYSPARK_PYTHON env variable & show some sample code.

{code}
$ pex numpy -o myarchive.pex
$ export PYSPARK_PYTHON=./myarchive.pex
{code}

On my side I incremented on this idea and wrapped all this into my own tool 
which is reused by our distributed TensorFlow and PySpark jobs. I also did a 
simple end to end docker standalone spark example with S3 storage (via minio) 
for integration tests. 
https://github.com/criteo/cluster-pack/blob/master/examples/spark-with-S3/README.md

It could also make sense to go beyond documentation and upstream some parts of 
this code:
https://github.com/criteo/cluster-pack/blob/master/cluster_pack/spark/spark_config_builder.py

It is currently hacked it as it uses private _option property. I admit I was 
lazy getting into the spark code again but some easy fix would be just exposing 
the private options attribute to get more flexibility on Spark.SessionBuilder. 

One could also direclty expose a method add_pex_support in SparkSession.Builder 
but personally I think it would clutter the code too much. All this should stay 
application specific and indeed makes more sense to be included into the doc.





> Add support for PEX in PySpark
> --
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.2.2
>Reporter: Fabian Höring
>Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark 
> executors using [PEX|https://github.com/pantsbuild/pex] 
> This currently works fine with 
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
>  (disadvantages are that you have a separate conda package repo and ship the 
> python interpreter all the time)
> Basically the workflow is
>  * to zip the local conda environment ([conda 
> pack|https://github.com/conda/conda-pack] also works)
>  * ship it to each executor as an archive
>  * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587 
> ticket to provide nice entry points to spark-submit and SparkContext but 
> zipping your local virtual env and then just changing PYSPARK_PYTHON env 
> variable should already work.
> I also have seen this 
> [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
>  But recreating the virtual env each time doesn't seem to be a very scalable 
> solution. If you have hundreds of executors it will retrieve the packages on 
> each excecutor and recreate your virtual environment each time. Same problem 
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily 
> shippable to 

[jira] [Reopened] (SPARK-31666) Cannot map hostPath volumes to container

2020-07-03 Thread Stephen Hopper (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Hopper reopened SPARK-31666:


Hi [~dongjoon],

This is still a bug. I should clarify the issue a bit more.

In Spark 2.4, the `LocalDirsFeatureStep` iterates through the list of paths in 
`spark.local.dir`. For each one, it creates a Kubernetes volume of mount type 
`emptyDir` with the name `spark-local-dir-${index}`.

In Spark 3.0, the `LocalDirsFeatureStep` checks for any volume mounts with the 
prefix `spark-local-dir-`. If none exist, it iterates through the list of paths 
in `spark.local.dir` and creates a Kubernetes volume of mount type `emptyDir` 
with the name `spark-local-dir-${index}`.

 

The issue is that I need my Spark job to use paths from my host machine that 
are on a mount point that isn't part of the directory which Kubernetes uses to 
allocate space for `emptyDir` volumes. Therefore, I mount these paths as type 
`hostPath` and ask Spark to use them as local directory space.

 

The error regarding the path already being mounted is happening because the 
path has already been mounted of my own accord; Spark need not attempt to add 
another volume mapping for something I've already done. Hence, this is a bug 
which can be resolved by simply backporting SPARK-28042. That issue is 
entangled a bit with changes from SPARK-25262. However, the changes for 
supporting tmpfs in SPARK-25262 are not required to fix this. While it would be 
easiest to just backport both fixes, I respect your desire to only backport 
fixes and avoid backporting features, so I will open a PR that just includes 
SPARK-28042. How does this sound to you?

 

On a separate note, I realize that Spark 2.4 has been out for over 18 months 
and the policy states that it will only be supported for 18 months. However, 
given the gap between the release of Spark 2.4 and the official release of 
Spark 3.0 exceeded 18 months and given that the fix for the issue I'm currently 
experiencing was merged into Spark 3.0 a full year before Spark 3.0 was 
released and wasn't made available to folks still using Spark 2.4 and 
experiencing the issue, I feel the policy should be revised to "minor release 
will be supported for 18 months or 6 months from the date of the release of 
their successor, whichever is later". I feel giving folks 6 months to migrate 
from one Spark release to the next is fair, especially now considering how 
mature Spark is as a project. What are your thoughts on this?

> Cannot map hostPath volumes to container
> 
>
> Key: SPARK-31666
> URL: https://issues.apache.org/jira/browse/SPARK-31666
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5
>Reporter: Stephen Hopper
>Priority: Major
>
> I'm trying to mount additional hostPath directories as seen in a couple of 
> places:
> [https://aws.amazon.com/blogs/containers/optimizing-spark-performance-on-kubernetes/]
> [https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md#using-volume-for-scratch-space]
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html#using-kubernetes-volumes]
>  
> However, whenever I try to submit my job, I run into this error:
> {code:java}
> Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 │
>  io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://kubernetes.default.svc/api/v1/namespaces/my-spark-ns/pods. 
> Message: Pod "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique. Received status: Status(apiVersion=v1, code=422, 
> details=StatusDetails(causes=[StatusCause(field=spec.containers[0].volumeMounts[1].mountPath,
>  message=Invalid value: "/tmp1": must be unique, reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Pod, 
> name=spark-pi-1588970477877-exec-1, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=Pod 
> "spark-pi-1588970477877-exec-1" is invalid: 
> spec.containers[0].volumeMounts[1].mountPath: Invalid value: "/tmp1": must be 
> unique, metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=Invalid, status=Failure, additionalProperties={}).{code}
>  
> This is my spark-submit command (note: I've used my own build of spark for 
> kubernetes as well as a few other images that I've seen floating around (such 
> as this one seedjeffwan/spark:v2.4.5) and they all have this same issue):
> {code:java}
> bin/spark-submit \
>  --master k8s://https://my-k8s-server:443 \
>  --deploy-mode cluster \
>  --name spark-pi \
>  --class org.apache.spark.examples.SparkPi \
>  --conf