This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-2.4 by this push:
     new 3337477  [SPARK-25992][PYTHON] Document SparkContext cannot be shared 
for multiprocessing
3337477 is described below

commit 3337477b759433f56d2a43be596196479f2b00de
Author: Hyukjin Kwon <gurwls...@apache.org>
AuthorDate: Wed Jan 16 23:25:57 2019 +0800

    [SPARK-25992][PYTHON] Document SparkContext cannot be shared for 
multiprocessing
    
    This PR proposes to explicitly document that SparkContext cannot be shared 
for multiprocessing, and multi-processing execution is not guaranteed in 
PySpark.
    
    I have seen some cases that users attempt to use multiple processes via 
`multiprocessing` module time to time. For instance, see the example in the 
JIRA (https://issues.apache.org/jira/browse/SPARK-25992).
    
    Py4J itself does not support Python's multiprocessing out of the box 
(sharing the same JavaGateways for instance).
    
    In general, such pattern can cause errors with somewhat arbitrary symptoms 
difficult to diagnose. For instance, see the error message in JIRA:
    
    ```
    Traceback (most recent call last):
    File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 317, 
in _handle_request_noblock
        self.process_request(request, client_address)
    File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 348, 
in process_request
        self.finish_request(request, client_address)
    File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 361, 
in finish_request
        self.RequestHandlerClass(request, client_address, self)
    File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 696, 
in __init__
        self.handle()
    File "/usr/local/hadoop/spark2.3.1/python/pyspark/accumulators.py", line 
238, in handle
        _accumulatorRegistry[aid] += update
    KeyError: 0
    ```
    
    The root cause of this was because global `_accumulatorRegistry` is not 
shared across processes.
    
    Using thread instead of process is quite easy in Python. See `threading` vs 
`multiprocessing` in Python - they can be usually direct replacement for each 
other. For instance, Python also support threadpool as well 
(`multiprocessing.pool.ThreadPool`) which can be direct replacement of 
process-based thread pool (`multiprocessing.Pool`).
    
    Manually tested, and manually built the doc.
    
    Closes #23564 from HyukjinKwon/SPARK-25992.
    
    Authored-by: Hyukjin Kwon <gurwls...@apache.org>
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
    (cherry picked from commit 670bc55f8d357a5cd894e290cc2834e952a7cfe0)
    Signed-off-by: Hyukjin Kwon <gurwls...@apache.org>
---
 python/pyspark/context.py | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 6d99e98..aff3635 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -63,6 +63,10 @@ class SparkContext(object):
     Main entry point for Spark functionality. A SparkContext represents the
     connection to a Spark cluster, and can be used to create L{RDD} and
     broadcast variables on that cluster.
+
+    .. note:: :class:`SparkContext` instance is not supported to share across 
multiple
+        processes out of the box, and PySpark does not guarantee 
multi-processing execution.
+        Use threads instead for concurrent processing purpose.
     """
 
     _gateway = None


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to