Re: oome from blockmanager

Stephen Haberman Sat, 26 Oct 2013 19:45:24 -0700

Hi Aaron,

> You're precisely correct about your (N*M)/(# machines) shuffle blocks
> per machine. I believe the 5.5 million data structures instead of 9.8
> comes from the fact that the shuffle was only around 50% of the way
> through before it blew up.


Cool, that sounds right. I like it when numbers match up, as it means my mental
might not be horribly wrong. :-)

> However, Spark in general takes the stance that if a partition doesn't fit in
> memory, things may blow up.

Ah, okay. I knew that was the case before, just wasn't sure if it was loosened.

Currently, we have to do some gyrations for this...like if a report wants to
load N files from S3 into an RDD, we total up the size, divide by our desired
partition size (64mb, which is Hadoop's IIRC), and then coalesce on that.

So, that's how we got 7,000 partitions for a month of data (500,000mb / 64mb =
7k partitions). (Without the coalesce, we have lots of tiny log files, so our
number of partitions shot way, way up, which, yeah, was blowing up.)

And, if we were to set spark.default.parallelism to, say, number of machines *
5, so 25 in this case, that would drop down to just 25 partitions, so, in the
naive case where we have all 500gb of data still in the RDD, that'd be 20gb per
partition.

Granted, we could set spark.default.parallelism higher, but it seems hard to
find the right value for a global config variable given that each cogroup will
have a different amount data/existing partitions. That's why we've avoided it so
far, and I guess have just gotten lucky that we've used big enough cluster sizes
to not notice the M*N blow up. (We had also run into Spark slowing down in the
default parallelism was too high--lots of really tiny tasks IIRC.)

Well, darn. I was going to be really excited if Spark could stream RDDs...I had
assumed shuffling was the biggest/only thing that assumed in-memory partitions.

I guess we could bump up our target partition size from 64mb to 500mb or
1gb...at the time, we were getting a lot of partition wonkiness (OOMEs/etc.)
that it seemed other people weren't getting with Spark, and I attributed this to
most people reading data pre-partitioned from HDFS, while all of our data always
comes in via S3 (in tiny gzip files). So I thought matching HDFS
partitions as close as possible would be the safest thing to do.

Thanks for the input--I'll mull over what we should do, and for now try a higher
goal partition size. Any other insights are appreciated.

- Stephen

Re: oome from blockmanager

Reply via email to