Hey Jeremy, what happens if you pass batchSize=10 as an argument to your SparkContext? It tries to serialize that many objects together at a time, which might be too much. By default the batchSize is 1024.
Matei On Mar 23, 2014, at 10:11 AM, Jeremy Freeman <freeman.jer...@gmail.com> wrote: > Hi all, > > Hitting a mysterious error loading large text files, specific to PySpark > 0.9.0. > > In PySpark 0.8.1, this works: > > data = sc.textFile("path/to/myfile") > data.count() > > But in 0.9.0, it stalls. There are indications of completion up to: > > 14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in 1699 ms on X.X.X.X > (progress: 15/537) > 14/03/17 16:54:24 INFO DAGScheduler: Completed ResultTask(5, 4) > > And then this repeats indefinitely > > 14/03/17 16:54:24 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5, > runningTasks: 144 > 14/03/17 16:54:25 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5, > runningTasks: 144 > > Always stalls at the same place. There's nothing in stderr on the workers, > but in stdout there are several of these messages: > > INFO PythonRDD: stdin writer to Python finished early > > So perhaps the real error is being suppressed as in > https://spark-project.atlassian.net/browse/SPARK-1025 > > Data is just rows of space-separated numbers, ~20GB, with 300k rows and 50k > characters per row. Running on a private cluster with 10 nodes, 100GB / 16 > cores each, Python v 2.7.6. > > I doubt the data is corrupted as it works fine in Scala in 0.8.1 and 0.9.0, > and in PySpark in 0.8.1. Happy to post the file, but it should repro for > anything with these dimensions. It *might* be specific to long strings: I > don't see it with fewer characters (10k) per row, but I also don't see it > with many fewer rows but the same number of characters per row. > > Happy to try and provide more info / help debug! > > -- Jeremy > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.