[jira] [Updated] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway

2017-06-10 Thread Joshuawangzj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshuawangzj updated SPARK-21045:
-
Description: 
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

It is blocking in socket read.  I view the log on blocking executor and found 
error:

{code}
Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in main
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
ordinal not in range(128)
{code}

Finally I found the problem:

{code:title=worker.py|borderStyle=solid}
# 178 line in spark 2.1.1
except Exception:
try:
write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
except IOError:
# JVM close the socket
pass
except Exception:
# Write the error to stderr if it happened while serializing
print("PySpark worker failed with exception:", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
{code}

when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
exception like UnicodeDecodeError, the python worker can't send the trace info, 
but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace 
info length next. So it is blocking.

{code:title=PythonRDD.scala|borderStyle=solid}
# 190 line in spark 2.1.1
case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
 // Signals that an exception has been thrown in python
 val exLength = stream.readInt()  // It is possible to be blocked
{code}

{color:red}
We can triggle the bug use simple program:
{color}
{code:title=test.py|borderStyle=solid}
spark = SparkSession.builder.master('local').getOrCreate()
rdd = spark.sparkContext.parallelize(['中']).map(lambda x: x.encode("utf8"))
rdd.collect()
{code}

  was:
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.co

[jira] [Updated] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway

2017-06-10 Thread Joshuawangzj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshuawangzj updated SPARK-21045:
-
Description: 
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

It is blocking in socket read.  I view the log on blocking executor and found 
error:

{code}
Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in main
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
ordinal not in range(128)
{code}

Finally I found the problem:

{code:title=worker.py|borderStyle=solid}
# 178 line in spark 2.1.1
except Exception:
try:
write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
except IOError:
# JVM close the socket
pass
except Exception:
# Write the error to stderr if it happened while serializing
print("PySpark worker failed with exception:", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
{code}

when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
exception like UnicodeDecodeError, the python worker can't send the trace info, 
but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace 
info length next. So it is blocking.

{code:title=PythonRDD.scala|borderStyle=solid}
# 190 line in spark 2.1.1
case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
 // Signals that an exception has been thrown in python
 val exLength = stream.readInt()  // It is possible to be blocked
{code}

We can triggle the bug use simple program:

{code title=test.py|borderStyle=solid}
spark = SparkSession.builder.master('local').getOrCreate()
rdd = spark.sparkContext.parallelize(['中']).map(lambda x: x.encode("utf8"))
rdd.collect()
{code}



  was:
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckp

[jira] [Updated] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway

2017-06-10 Thread Joshuawangzj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshuawangzj updated SPARK-21045:
-
Description: 
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

It is blocking in socket read.  I view the log on blocking executor and found 
error:

{code}
Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in main
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
ordinal not in range(128)
{code}

Finally I found the problem:

{code:title=worker.py|borderStyle=solid}
# 178 line in spark 2.1.1
except Exception:
try:
write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
except IOError:
# JVM close the socket
pass
except Exception:
# Write the error to stderr if it happened while serializing
print("PySpark worker failed with exception:", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
{code}

when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
exception like UnicodeDecodeError, the python worker can't send the trace info, 
but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace 
info length next. So it is blocking.

{code:title=PythonRDD.scala|borderStyle=solid}
# 190 line in spark 2.1.1
case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
 // Signals that an exception has been thrown in python
 val exLength = stream.readInt()  // It is possible to be blocked
{code}

{color:red}
We can triggle the bug use simple program:
{color}

{code title=test.py|borderStyle=solid}
spark = SparkSession.builder.master('local').getOrCreate()
rdd = spark.sparkContext.parallelize(['中']).map(lambda x: x.encode("utf8"))
rdd.collect()
{code}



  was:
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD

[jira] [Updated] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway

2017-06-10 Thread Joshuawangzj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshuawangzj updated SPARK-21045:
-
Description: 
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

It is blocking in socket read.  I view the log on blocking executor and found 
error:

{code}
Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in main
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
ordinal not in range(128)
{code}

Finally I found the problem:

{code:title=worker.py|borderStyle=solid}
# 178 line in spark 2.1.1
except Exception:
try:
write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
except IOError:
# JVM close the socket
pass
except Exception:
# Write the error to stderr if it happened while serializing
print("PySpark worker failed with exception:", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
{code}

when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
exception like UnicodeDecodeError, the python worker can't send the trace info, 
but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace 
info length next. So it is blocking.

{code:title=PythonRDD.scala|borderStyle=solid}
# 190 line in spark 2.1.1
case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
 // Signals that an exception has been thrown in python
 val exLength = stream.readInt()  // It is possible to be blocked
{code}

{color:red}
We can triggle the bug use simple program:
{color}
{code title=test.py|borderStyle=solid}
spark = SparkSession.builder.master('local').getOrCreate()
rdd = spark.sparkContext.parallelize(['中']).map(lambda x: x.encode("utf8"))
rdd.collect()
{code}



  was:
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.

[jira] [Updated] (SPARK-21045) Spark executor is blocked instead of throwing exception because exception occur when python worker send exception trace stack info to Java Gateway

2017-06-10 Thread Joshuawangzj (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshuawangzj updated SPARK-21045:
-
Description: 
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

It is blocking in socket read.  I view the log on blocking executor and found 
error:

{code}
Traceback (most recent call last):
  File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 178, in main
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 618: 
ordinal not in range(128)
{code}

Finally I found the problem:

{code:title=worker.py|borderStyle=solid}
# 178 line in spark 2.1.1
except Exception:
try:
write_int(SpecialLengths.PYTHON_EXCEPTION_THROWN, outfile)
write_with_length(traceback.format_exc().encode("utf-8"), outfile)
except IOError:
# JVM close the socket
pass
except Exception:
# Write the error to stderr if it happened while serializing
print("PySpark worker failed with exception:", file=sys.stderr)
print(traceback.format_exc(), file=sys.stderr)
{code}

when write_with_length(traceback.format_exc().encode("utf-8"), outfile) occur 
exception like UnicodeDecodeError, the python worker can't send the trace info, 
but when the PythonRDD get PYTHON_EXCEPTION_THROWN, It should read the trace 
info length next. So it is blocking.

{code:title=PythonRDD.scala|borderStyle=solid}
# 190 line in spark 2.1.1
case SpecialLengths.PYTHON_EXCEPTION_THROWN =>
 // Signals that an exception has been thrown in python
 val exLength = stream.readInt()  // It is possible to be blocked
{code}



  was:
My pyspark program is always blocking in product yarn cluster. Then I jstack 
and found :

{code}
"Executor task launch worker for task 0" #60 daemon prio=5 os_prio=31 
tid=0x7fb2f44e3000 nid=0xa003 runnable [0x000123b4a000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
- locked <0x0007acab1c98> (a java.io.BufferedInputStream)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:190)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.exec