It seems possible that you are running out of memory unrolling a single partition of the RDD. This is something that can cause your executor to OOM, especially if the cache is close to being full so the executor doesn't have much free memory left. How large are your executors? At the time of failure, is the cache already nearly full?
I also believe the Snappy compression codec in Hadoop is not splittable. This means that each of your JSON files is read in its entirety as one spark partition. If you have files that are larger than the standard block size (128MB), it will exacerbate this shortcoming of Spark. Incidentally, this means minPartitions won't help you at all here. This is fixed in the master branch and will be fixed in Spark 1.1. As a debugging step (if this is doable), it's worth running this job on the master branch and seeing if it succeeds. https://github.com/apache/spark/pull/1165 A (potential) workaround would be to first persist your data to disk, then re-partition it, then cache it. I'm not 100% sure whether that will work though. val a = sc.textFile("s3n://some-path/*.json").persist(DISK_ONLY).repartition(larger nr of partitions).cache() - Patrick On Fri, Aug 1, 2014 at 10:17 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > On Fri, Aug 1, 2014 at 12:39 PM, Sean Owen <so...@cloudera.com> wrote: > > Isn't this your worker running out of its memory for computations, >> rather than for caching RDDs? >> > I'm not sure how to interpret the stack trace, but let's say that's true. > I'm even seeing this with a simple a = sc.textFile().cache() and then > a.count(). Spark shouldn't need that much memory for this kind of work, > no? > > then the answer is that you should tell >> it to use less memory for caching. >> > I can try that. That's done by changing spark.storage.memoryFraction, > right? > > This still seems strange though. The default fraction of the JVM left for > non-cache activity (1 - 0.6 = 40% > <http://spark.apache.org/docs/latest/configuration.html#execution-behavior>) > should be plenty for just counting elements. I'm using m1.xlarge nodes > that have 15GB of memory apiece. > > Nick > >