Re: Stalling during large iterative PySpark jobs

Matei Zaharia Tue, 14 Jan 2014 11:41:40 -0800

Hi Jeremy,

If you look at the stdout and stderr files on that worker, do you see any 
earlier errors? I wonder if one of the Python workers crashed earlier.


It would also be good to run “top” and see if more memory is used during the 
computation. I guess the cached RDD itself fits in less than 50% of the RAM as 
you said?

Matei


On Jan 12, 2014, at 8:45 PM, Jeremy Freeman <[email protected]> wrote:

> I'm reliably getting a bug in PySpark where jobs with many iterative
> calculations on cached data stall out. 
> 
> Data is a folder of ~40 text files, each with 2 mil rows and 360 entries per
> row, total size is ~250GB. 
> 
> I'm testing with the KMeans analyses included as examples (though I see the
> same error on my own iterative algorithms). The scala version completes 50+
> iterations fine. In PySpark, it successfully completes 9 iterations, and
> then stalls. On the driver, I'll get this: 
> 
> java.net.NoRouteToHostException: Cannot assign requested address 
>        at java.net.PlainSocketImpl.socketConnect(Native Method) 
>        at
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) 
>        at
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>  
>        at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) 
>        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) 
>        at java.net.Socket.connect(Socket.java:579) 
>        at java.net.Socket.connect(Socket.java:528) 
>        at java.net.Socket.<init>(Socket.java:425) 
>        at java.net.Socket.<init>(Socket.java:208) 
>        at
> org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:328)
>  
>        at
> org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:311)
>  
>        at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:70) 
>        at
> org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:253) 
>        at
> org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:251) 
>        at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95) 
>        at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95) 
>        at scala.collection.Iterator$class.foreach(Iterator.scala:772) 
>        at
> scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157) 
>        at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190) 
>        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45) 
>        at scala.collection.mutable.HashMap.foreach(HashMap.scala:95) 
>        at org.apache.spark.Accumulators$.add(Accumulators.scala:251) 
>        at
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:598)
>  
>        at
> org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:376) 
>        at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:441)
>  
>        at
> org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:149) 
> 
> But the job will continue after delivering this error, appearing to finish
> the remaining tasks, until it displays: 
> 
> INFO ClusterScheduler: Remove TaskSet 18.0 from pool 
> 
> And then just stalls. 
> 
> The web UI shows a subset of tasks completed; the number of the current task
> is the number displayed on the driver around the time the error message
> displayed. I don't see any errors in the stdout or stderr on the worker
> executing that task, just on the driver. Memory usage on all workers and
> driver are well below 50%. 
> 
> Other observations: 
> - It's data size dependent. If I load ~50 GB, it finishes 20 iterations
> before stalling. If I load ~10 GB, it finishes 35. 
> - It's not due to the multiple files; I see the same error on a single large
> file. 
> - I always get the error with 30 or 60 nodes, but I don't see it when using
> 20. 
> - For a given cluster/data size, it stalls at the same point on every run. 
> 
> I was going to test all this on EC2, in case it's something specific to our
> set up (private HPC running Spark in standalone mode, 16 cores and 100 GB
> used per node). But it'd be great if anyone had ideas in the meantime. 
> 
> Thanks!
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Stalling-during-large-iterative-PySpark-jobs-tp492.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Stalling during large iterative PySpark jobs

Reply via email to