[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable
[ https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621389#comment-16621389 ] holdenk commented on SPARK-17602: - Did we end up going anywhere with this? > PySpark - Performance Optimization Large Size of Broadcast Variable > --- > > Key: SPARK-17602 > URL: https://issues.apache.org/jira/browse/SPARK-17602 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.2, 2.0.0 > Environment: Linux >Reporter: Xiao Ming Bao >Priority: Major > Attachments: PySpark – Performance Optimization for Large Size of > Broadcast variable.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > Problem: currently at executor side, the broadcast variable is written to > disk as file and each python work process reads the bd from local disk and > de-serialize to python object before executing a task, when the size of > broadcast variables is large, the read/de-serialization takes a lot of time. > And when the python worker is NOT reused and the number of task is large, > this performance would be very bad since python worker needs to > read/de-serialize for each task. > Brief of the solution: > transfer the broadcast variable to daemon python process via file (or > socket/mmap) and deserialize file to object in daemon python process, after > worker python process forked by daemon python process, worker python process > would automatically has the deserialzied object and use it directly because > of the memory Copy-on-write tech of Linux. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable
[ https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162457#comment-16162457 ] holdenk commented on SPARK-17602: - [~liujunf] how about you go ahead and make a pull request and put [WIP] in the title so we can all take a look at it? I've got some more bandwidth available to do reviews and if we need to we can discuss it some more @ Spark Summit. > PySpark - Performance Optimization Large Size of Broadcast Variable > --- > > Key: SPARK-17602 > URL: https://issues.apache.org/jira/browse/SPARK-17602 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.2, 2.0.0 > Environment: Linux >Reporter: Xiao Ming Bao > Attachments: PySpark – Performance Optimization for Large Size of > Broadcast variable.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > Problem: currently at executor side, the broadcast variable is written to > disk as file and each python work process reads the bd from local disk and > de-serialize to python object before executing a task, when the size of > broadcast variables is large, the read/de-serialization takes a lot of time. > And when the python worker is NOT reused and the number of task is large, > this performance would be very bad since python worker needs to > read/de-serialize for each task. > Brief of the solution: > transfer the broadcast variable to daemon python process via file (or > socket/mmap) and deserialize file to object in daemon python process, after > worker python process forked by daemon python process, worker python process > would automatically has the deserialzied object and use it directly because > of the memory Copy-on-write tech of Linux. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable
[ https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15831964#comment-15831964 ] Junfeng commented on SPARK-17602: - [~davies] the trouble really is the python worker share mode is not works for many cases. For example the static variable will cause trouble among tasks. Many of our users do not setup the shared mode. Then it leads to this issue. > PySpark - Performance Optimization Large Size of Broadcast Variable > --- > > Key: SPARK-17602 > URL: https://issues.apache.org/jira/browse/SPARK-17602 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.2, 2.0.0 > Environment: Linux >Reporter: Xiao Ming Bao > Attachments: PySpark – Performance Optimization for Large Size of > Broadcast variable.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > Problem: currently at executor side, the broadcast variable is written to > disk as file and each python work process reads the bd from local disk and > de-serialize to python object before executing a task, when the size of > broadcast variables is large, the read/de-serialization takes a lot of time. > And when the python worker is NOT reused and the number of task is large, > this performance would be very bad since python worker needs to > read/de-serialize for each task. > Brief of the solution: > transfer the broadcast variable to daemon python process via file (or > socket/mmap) and deserialize file to object in daemon python process, after > worker python process forked by daemon python process, worker python process > would automatically has the deserialzied object and use it directly because > of the memory Copy-on-write tech of Linux. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable
[ https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830644#comment-15830644 ] Davies Liu commented on SPARK-17602: The Python workers are reused by default, could you re-run the benchmark while re-use the workers? > PySpark - Performance Optimization Large Size of Broadcast Variable > --- > > Key: SPARK-17602 > URL: https://issues.apache.org/jira/browse/SPARK-17602 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.2, 2.0.0 > Environment: Linux >Reporter: Xiao Ming Bao > Attachments: PySpark – Performance Optimization for Large Size of > Broadcast variable.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > Problem: currently at executor side, the broadcast variable is written to > disk as file and each python work process reads the bd from local disk and > de-serialize to python object before executing a task, when the size of > broadcast variables is large, the read/de-serialization takes a lot of time. > And when the python worker is NOT reused and the number of task is large, > this performance would be very bad since python worker needs to > read/de-serialize for each task. > Brief of the solution: > transfer the broadcast variable to daemon python process via file (or > socket/mmap) and deserialize file to object in daemon python process, after > worker python process forked by daemon python process, worker python process > would automatically has the deserialzied object and use it directly because > of the memory Copy-on-write tech of Linux. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable
[ https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830528#comment-15830528 ] Junfeng commented on SPARK-17602: - Thanks [~holdenk] [~davies] Could you let me know your comments around the design? We can have phone call to review the code changes > PySpark - Performance Optimization Large Size of Broadcast Variable > --- > > Key: SPARK-17602 > URL: https://issues.apache.org/jira/browse/SPARK-17602 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.2, 2.0.0 > Environment: Linux >Reporter: Xiao Ming Bao > Attachments: PySpark – Performance Optimization for Large Size of > Broadcast variable.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > Problem: currently at executor side, the broadcast variable is written to > disk as file and each python work process reads the bd from local disk and > de-serialize to python object before executing a task, when the size of > broadcast variables is large, the read/de-serialization takes a lot of time. > And when the python worker is NOT reused and the number of task is large, > this performance would be very bad since python worker needs to > read/de-serialize for each task. > Brief of the solution: > transfer the broadcast variable to daemon python process via file (or > socket/mmap) and deserialize file to object in daemon python process, after > worker python process forked by daemon python process, worker python process > would automatically has the deserialzied object and use it directly because > of the memory Copy-on-write tech of Linux. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable
[ https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828709#comment-15828709 ] holdenk commented on SPARK-17602: - Ah yes, sorry I've been pretty busy. I just had an interesting chat over the weekend at a conference with someone who was running into some challanges that could be improved by this so lets take a look. If [~davies] has some bandwith to look at the design doc that would be a good starting point otherwise making a PR would be maybe be a good next step and then [~davies] or I could take a look after Spark Summit (I've got some stuff I need to get in order before then). > PySpark - Performance Optimization Large Size of Broadcast Variable > --- > > Key: SPARK-17602 > URL: https://issues.apache.org/jira/browse/SPARK-17602 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.2, 2.0.0 > Environment: Linux >Reporter: Xiao Ming Bao > Attachments: PySpark – Performance Optimization for Large Size of > Broadcast variable.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > Problem: currently at executor side, the broadcast variable is written to > disk as file and each python work process reads the bd from local disk and > de-serialize to python object before executing a task, when the size of > broadcast variables is large, the read/de-serialization takes a lot of time. > And when the python worker is NOT reused and the number of task is large, > this performance would be very bad since python worker needs to > read/de-serialize for each task. > Brief of the solution: > transfer the broadcast variable to daemon python process via file (or > socket/mmap) and deserialize file to object in daemon python process, after > worker python process forked by daemon python process, worker python process > would automatically has the deserialzied object and use it directly because > of the memory Copy-on-write tech of Linux. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable
[ https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828674#comment-15828674 ] Junfeng commented on SPARK-17602: - [~holdenk] could you send me instruction how to move forward this?? It has been open for a long time > PySpark - Performance Optimization Large Size of Broadcast Variable > --- > > Key: SPARK-17602 > URL: https://issues.apache.org/jira/browse/SPARK-17602 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.2, 2.0.0 > Environment: Linux >Reporter: Xiao Ming Bao > Attachments: PySpark – Performance Optimization for Large Size of > Broadcast variable.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > Problem: currently at executor side, the broadcast variable is written to > disk as file and each python work process reads the bd from local disk and > de-serialize to python object before executing a task, when the size of > broadcast variables is large, the read/de-serialization takes a lot of time. > And when the python worker is NOT reused and the number of task is large, > this performance would be very bad since python worker needs to > read/de-serialize for each task. > Brief of the solution: > transfer the broadcast variable to daemon python process via file (or > socket/mmap) and deserialize file to object in daemon python process, after > worker python process forked by daemon python process, worker python process > would automatically has the deserialzied object and use it directly because > of the memory Copy-on-write tech of Linux. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable
[ https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15611611#comment-15611611 ] holdenk commented on SPARK-17602: - This certainly looks interesting, do you maybe have some code you could make a draft PR with? It could be tagged as WIP. Once we've got a draft PR together could maybe reach out to the dev list to get some feedback. > PySpark - Performance Optimization Large Size of Broadcast Variable > --- > > Key: SPARK-17602 > URL: https://issues.apache.org/jira/browse/SPARK-17602 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.2, 2.0.0 > Environment: Linux >Reporter: Xiao Ming Bao > Attachments: PySpark – Performance Optimization for Large Size of > Broadcast variable.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > Problem: currently at executor side, the broadcast variable is written to > disk as file and each python work process reads the bd from local disk and > de-serialize to python object before executing a task, when the size of > broadcast variables is large, the read/de-serialization takes a lot of time. > And when the python worker is NOT reused and the number of task is large, > this performance would be very bad since python worker needs to > read/de-serialize for each task. > Brief of the solution: > transfer the broadcast variable to daemon python process via file (or > socket/mmap) and deserialize file to object in daemon python process, after > worker python process forked by daemon python process, worker python process > would automatically has the deserialzied object and use it directly because > of the memory Copy-on-write tech of Linux. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17602) PySpark - Performance Optimization Large Size of Broadcast Variable
[ https://issues.apache.org/jira/browse/SPARK-17602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508065#comment-15508065 ] Miao Wang commented on SPARK-17602: --- Does this change also benefit/impact Windows OS? > PySpark - Performance Optimization Large Size of Broadcast Variable > --- > > Key: SPARK-17602 > URL: https://issues.apache.org/jira/browse/SPARK-17602 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.6.2, 2.0.0 > Environment: Linux >Reporter: Xiao Ming Bao > Attachments: PySpark – Performance Optimization for Large Size of > Broadcast variable.pdf > > Original Estimate: 120h > Remaining Estimate: 120h > > Problem: currently at executor side, the broadcast variable is written to > disk as file and each python work process reads the bd from local disk and > de-serialize to python object before executing a task, when the size of > broadcast variables is large, the read/de-serialization takes a lot of time. > And when the python worker is NOT reused and the number of task is large, > this performance would be very bad since python worker needs to > read/de-serialize for each task. > Brief of the solution: > transfer the broadcast variable to daemon python process via file (or > socket/mmap) and deserialize file to object in daemon python process, after > worker python process forked by daemon python process, worker python process > would automatically has the deserialzied object and use it directly because > of the memory Copy-on-write tech of Linux. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org