Thanks for the thoughts Matei! I poked at this some more. I ran top on each of the workers during the job (I'm testing with the example kmeans), and confirmed that the run dies when memory usage (of the java process) is still around 30%. I do notice it going up, from around 20% after the first iteration, to 30% by the time it dies, so definitely stays under 50%. Also, memory is around 30% when running KMeans in scala, and I never get the error.
I can't find anything suspect in any of the worker logs (I'm looking at stdout and stderr in spark.local.dir). The only error is that one reported to the driver. Still haven't tried reproducing on EC2, will let you know if I can... -- Jeremy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Stalling-during-large-iterative-PySpark-jobs-tp492p792.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
