[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

2018-09-19 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621389#comment-16621389
 ] 

holdenk commented on SPARK-17602:
-

Did we end up going anywhere with this?

> PySpark - Performance Optimization Large Size of Broadcast Variable
> ---
>
> Key: SPARK-17602
> URL: https://issues.apache.org/jira/browse/SPARK-17602
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
> Environment: Linux
>Reporter: Xiao Ming Bao
>Priority: Major
> Attachments: PySpark – Performance Optimization for Large Size of 
> Broadcast variable.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Problem: currently at executor side, the broadcast variable is written to 
> disk as file and each python work process reads the bd from local disk and 
> de-serialize to python object before executing a task, when the size of 
> broadcast  variables is large, the read/de-serialization takes a lot of time. 
> And when the python worker is NOT reused and the number of task is large, 
> this performance would be very bad since python worker needs to 
> read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or 
> socket/mmap) and deserialize file to object in daemon python process, after 
> worker python process forked by daemon python process, worker python process 
> would automatically has the deserialzied object and use it directly because 
> of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

2017-09-11 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162457#comment-16162457
 ] 

holdenk commented on SPARK-17602:
-

[~liujunf] how about you go ahead and make a pull request and put [WIP] in the 
title so we can all take a look at it? I've got some more bandwidth available 
to do reviews and if we need to we can discuss it some more @ Spark Summit.

> PySpark - Performance Optimization Large Size of Broadcast Variable
> ---
>
> Key: SPARK-17602
> URL: https://issues.apache.org/jira/browse/SPARK-17602
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
> Environment: Linux
>Reporter: Xiao Ming Bao
> Attachments: PySpark – Performance Optimization for Large Size of 
> Broadcast variable.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Problem: currently at executor side, the broadcast variable is written to 
> disk as file and each python work process reads the bd from local disk and 
> de-serialize to python object before executing a task, when the size of 
> broadcast  variables is large, the read/de-serialization takes a lot of time. 
> And when the python worker is NOT reused and the number of task is large, 
> this performance would be very bad since python worker needs to 
> read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or 
> socket/mmap) and deserialize file to object in daemon python process, after 
> worker python process forked by daemon python process, worker python process 
> would automatically has the deserialzied object and use it directly because 
> of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

2017-01-20 Thread Junfeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831964#comment-15831964
 ] 

Junfeng commented on SPARK-17602:
-

[~davies] the trouble really is the python worker share mode is not works for 
many cases. For example the static variable will cause trouble among tasks. 
Many of our users do not setup the shared mode. Then it leads to this issue.

> PySpark - Performance Optimization Large Size of Broadcast Variable
> ---
>
> Key: SPARK-17602
> URL: https://issues.apache.org/jira/browse/SPARK-17602
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
> Environment: Linux
>Reporter: Xiao Ming Bao
> Attachments: PySpark – Performance Optimization for Large Size of 
> Broadcast variable.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Problem: currently at executor side, the broadcast variable is written to 
> disk as file and each python work process reads the bd from local disk and 
> de-serialize to python object before executing a task, when the size of 
> broadcast  variables is large, the read/de-serialization takes a lot of time. 
> And when the python worker is NOT reused and the number of task is large, 
> this performance would be very bad since python worker needs to 
> read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or 
> socket/mmap) and deserialize file to object in daemon python process, after 
> worker python process forked by daemon python process, worker python process 
> would automatically has the deserialzied object and use it directly because 
> of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

2017-01-19 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830644#comment-15830644
 ] 

Davies Liu commented on SPARK-17602:


The Python workers are reused by default, could you re-run the benchmark while 
re-use the workers?

> PySpark - Performance Optimization Large Size of Broadcast Variable
> ---
>
> Key: SPARK-17602
> URL: https://issues.apache.org/jira/browse/SPARK-17602
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
> Environment: Linux
>Reporter: Xiao Ming Bao
> Attachments: PySpark – Performance Optimization for Large Size of 
> Broadcast variable.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Problem: currently at executor side, the broadcast variable is written to 
> disk as file and each python work process reads the bd from local disk and 
> de-serialize to python object before executing a task, when the size of 
> broadcast  variables is large, the read/de-serialization takes a lot of time. 
> And when the python worker is NOT reused and the number of task is large, 
> this performance would be very bad since python worker needs to 
> read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or 
> socket/mmap) and deserialize file to object in daemon python process, after 
> worker python process forked by daemon python process, worker python process 
> would automatically has the deserialzied object and use it directly because 
> of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

2017-01-19 Thread Junfeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830528#comment-15830528
 ] 

Junfeng commented on SPARK-17602:
-

Thanks [~holdenk] 

 [~davies]  Could you let me know your comments around the design? We can have 
phone call to review the code changes

> PySpark - Performance Optimization Large Size of Broadcast Variable
> ---
>
> Key: SPARK-17602
> URL: https://issues.apache.org/jira/browse/SPARK-17602
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
> Environment: Linux
>Reporter: Xiao Ming Bao
> Attachments: PySpark – Performance Optimization for Large Size of 
> Broadcast variable.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Problem: currently at executor side, the broadcast variable is written to 
> disk as file and each python work process reads the bd from local disk and 
> de-serialize to python object before executing a task, when the size of 
> broadcast  variables is large, the read/de-serialization takes a lot of time. 
> And when the python worker is NOT reused and the number of task is large, 
> this performance would be very bad since python worker needs to 
> read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or 
> socket/mmap) and deserialize file to object in daemon python process, after 
> worker python process forked by daemon python process, worker python process 
> would automatically has the deserialzied object and use it directly because 
> of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

2017-01-18 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828709#comment-15828709
 ] 

holdenk commented on SPARK-17602:
-

Ah yes, sorry I've been pretty busy. I just had an interesting chat over the 
weekend at a conference with  someone who was running into some challanges that 
could be improved by this so lets take a look. If [~davies] has some bandwith 
to look at the design doc that would be a good starting point otherwise making 
a PR would be maybe be a good next step and then [~davies] or I could take a 
look after Spark Summit (I've got some stuff I need to get in order before 
then).

> PySpark - Performance Optimization Large Size of Broadcast Variable
> ---
>
> Key: SPARK-17602
> URL: https://issues.apache.org/jira/browse/SPARK-17602
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
> Environment: Linux
>Reporter: Xiao Ming Bao
> Attachments: PySpark – Performance Optimization for Large Size of 
> Broadcast variable.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Problem: currently at executor side, the broadcast variable is written to 
> disk as file and each python work process reads the bd from local disk and 
> de-serialize to python object before executing a task, when the size of 
> broadcast  variables is large, the read/de-serialization takes a lot of time. 
> And when the python worker is NOT reused and the number of task is large, 
> this performance would be very bad since python worker needs to 
> read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or 
> socket/mmap) and deserialize file to object in daemon python process, after 
> worker python process forked by daemon python process, worker python process 
> would automatically has the deserialzied object and use it directly because 
> of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

2017-01-18 Thread Junfeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828674#comment-15828674
 ] 

Junfeng commented on SPARK-17602:
-

[~holdenk] could you send me instruction how to move forward this?? It has been 
open for a long time

> PySpark - Performance Optimization Large Size of Broadcast Variable
> ---
>
> Key: SPARK-17602
> URL: https://issues.apache.org/jira/browse/SPARK-17602
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
> Environment: Linux
>Reporter: Xiao Ming Bao
> Attachments: PySpark – Performance Optimization for Large Size of 
> Broadcast variable.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Problem: currently at executor side, the broadcast variable is written to 
> disk as file and each python work process reads the bd from local disk and 
> de-serialize to python object before executing a task, when the size of 
> broadcast  variables is large, the read/de-serialization takes a lot of time. 
> And when the python worker is NOT reused and the number of task is large, 
> this performance would be very bad since python worker needs to 
> read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or 
> socket/mmap) and deserialize file to object in daemon python process, after 
> worker python process forked by daemon python process, worker python process 
> would automatically has the deserialzied object and use it directly because 
> of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

2016-10-27 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611611#comment-15611611
 ] 

holdenk commented on SPARK-17602:
-

This certainly looks interesting, do you maybe have some code you could make a 
draft PR with? It could be tagged as WIP. Once we've got a draft PR together 
could maybe reach out to the dev list to get some feedback.

> PySpark - Performance Optimization Large Size of Broadcast Variable
> ---
>
> Key: SPARK-17602
> URL: https://issues.apache.org/jira/browse/SPARK-17602
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
> Environment: Linux
>Reporter: Xiao Ming Bao
> Attachments: PySpark – Performance Optimization for Large Size of 
> Broadcast variable.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Problem: currently at executor side, the broadcast variable is written to 
> disk as file and each python work process reads the bd from local disk and 
> de-serialize to python object before executing a task, when the size of 
> broadcast  variables is large, the read/de-serialization takes a lot of time. 
> And when the python worker is NOT reused and the number of task is large, 
> this performance would be very bad since python worker needs to 
> read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or 
> socket/mmap) and deserialize file to object in daemon python process, after 
> worker python process forked by daemon python process, worker python process 
> would automatically has the deserialzied object and use it directly because 
> of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable

2016-09-20 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508065#comment-15508065
 ] 

Miao Wang commented on SPARK-17602:
---

Does this change also benefit/impact Windows OS? 

> PySpark - Performance Optimization Large Size of Broadcast Variable
> ---
>
> Key: SPARK-17602
> URL: https://issues.apache.org/jira/browse/SPARK-17602
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.2, 2.0.0
> Environment: Linux
>Reporter: Xiao Ming Bao
> Attachments: PySpark – Performance Optimization for Large Size of 
> Broadcast variable.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> Problem: currently at executor side, the broadcast variable is written to 
> disk as file and each python work process reads the bd from local disk and 
> de-serialize to python object before executing a task, when the size of 
> broadcast  variables is large, the read/de-serialization takes a lot of time. 
> And when the python worker is NOT reused and the number of task is large, 
> this performance would be very bad since python worker needs to 
> read/de-serialize for each task. 
> Brief of the solution:
>  transfer the broadcast variable to daemon python process via file (or 
> socket/mmap) and deserialize file to object in daemon python process, after 
> worker python process forked by daemon python process, worker python process 
> would automatically has the deserialzied object and use it directly because 
> of the memory Copy-on-write tech of Linux.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org