[GitHub] spark pull request #17694: [SPARK-12717][PYSPARK] Resolving race condition w...

2017-08-05 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17694


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17694: [SPARK-12717][PYSPARK] Resolving race condition w...

2017-04-19 Thread vundela
GitHub user vundela opened a pull request:

https://github.com/apache/spark/pull/17694

[SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcas…

…ts when using multiple threads

## What changes were proposed in this pull request?

In pyspark when multiple threads are used, broadcast variables are picked 
with wrong PythonRDD wrap functions which leads to the following 
exception(Because of the race condition between the threads on java side with 
py4j).

16/01/08 17:10:20 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 9)
org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
  File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",
 line 98, in main
command = pickleSer._read_with_length(infile)
  File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
 line 164, in _read_with_length
return self.loads(obj)
  File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
 line 422, in loads
return pickle.loads(obj)
  File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/broadcast.py",
 line 39, in _from_id
raise Exception("Broadcast variable '%s' not loaded!" % bid)
Exception: (Exception("Broadcast variable '6' not loaded!",), , (6L,))

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

This change will fix the race condition by making sure that broadcast 
variables are picked with same pythonRDD function.

## How was this patch tested?
1) Reproduced the issue mentioned in SPARK-12717, following the 
instructions specified in jira
2) Make sure that issue is fixed with the changes.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vundela/spark SPARK-12717

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17694.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17694


commit 4670678ba4fa7f2a7f66bc9309716fcefb05c54d
Author: Srinivasa Reddy Vundela 
Date:   2017-04-20T01:37:12Z

[SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcasts 
when using multiple threads




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org