Hi Aaron, > You're precisely correct about your (N*M)/(# machines) shuffle blocks > per machine. I believe the 5.5 million data structures instead of 9.8 > comes from the fact that the shuffle was only around 50% of the way > through before it blew up.
Cool, that sounds right. I like it when numbers match up, as it means my mental might not be horribly wrong. :-) > However, Spark in general takes the stance that if a partition doesn't fit in > memory, things may blow up. Ah, okay. I knew that was the case before, just wasn't sure if it was loosened. Currently, we have to do some gyrations for this...like if a report wants to load N files from S3 into an RDD, we total up the size, divide by our desired partition size (64mb, which is Hadoop's IIRC), and then coalesce on that. So, that's how we got 7,000 partitions for a month of data (500,000mb / 64mb = 7k partitions). (Without the coalesce, we have lots of tiny log files, so our number of partitions shot way, way up, which, yeah, was blowing up.) And, if we were to set spark.default.parallelism to, say, number of machines * 5, so 25 in this case, that would drop down to just 25 partitions, so, in the naive case where we have all 500gb of data still in the RDD, that'd be 20gb per partition. Granted, we could set spark.default.parallelism higher, but it seems hard to find the right value for a global config variable given that each cogroup will have a different amount data/existing partitions. That's why we've avoided it so far, and I guess have just gotten lucky that we've used big enough cluster sizes to not notice the M*N blow up. (We had also run into Spark slowing down in the default parallelism was too high--lots of really tiny tasks IIRC.) Well, darn. I was going to be really excited if Spark could stream RDDs...I had assumed shuffling was the biggest/only thing that assumed in-memory partitions. I guess we could bump up our target partition size from 64mb to 500mb or 1gb...at the time, we were getting a lot of partition wonkiness (OOMEs/etc.) that it seemed other people weren't getting with Spark, and I attributed this to most people reading data pre-partitioned from HDFS, while all of our data always comes in via S3 (in tiny gzip files). So I thought matching HDFS partitions as close as possible would be the safest thing to do. Thanks for the input--I'll mull over what we should do, and for now try a higher goal partition size. Any other insights are appreciated. - Stephen
