There is distinction to be made between the number of incoming partitions
and the number of reducers. Let's say that the number of partitions is more
or less irrelevant, especially since we don't directly control the number
of partitions in the input data set (as you pointed out). More significant
in shuffling data is the number of *reducers, *or output partitions for the
shuffle.

To be 100% clear on what I mean, a shuffle completely redistributes input
partitions to output partitions, so you might start with 67k incoming
partitions, but if you have 10k reducers, you'll end up with 10k partitions
after the reduce. Every output partition should fit in memory since some of
Spark's reducing operations require the partition to fit in memory.
However, you can fully tune the number of reducers such that each output
partition fits into memory.

In your particular case, let's say you expect around 1.1TB to make it to
the reducers, and additionally that you have 8GB of RAM per core of the
executor. We need to ensure each output partition fits in memory, so the
lower bound on the number of reducers is 1.1TB/8GB = 138. Any fewer than
that and we would expect the reducers to OOM. Additionally, since we expect
some, possibly very significant, Java overhead, we may want 2-4 times as
many reducers to ensure they fit in memory.

While we now have a lower bound on the number of reducers, it also turns
out that we want to minimize the number of reducers, because each reducer
has significant metadata associated with it. In particular, the total
amount of metadata per machine for a shuffle is approximately 8*M*R/N bytes
for M mappers, R reducers, and N machines. So if we used 67k *reducers* as
well as incoming partitions (and 5 machines), we'd expect the metadata to
take 8*67k*67k/5 = 7.2 GB. If, however, we used only 1000 reducers (which
is 7x the lower bound), we'd expect the metadata to take only 100 MB per
machine.

This balancing act for the number of reducers is not fun. Ideally, we
wouldn't be so sensitive to your decisions on the number of partitions and
reducers, but this is future work.

So a key question for you is, how many reducers did you use in this task?
I'll also be very interested so see any heap dumps, as it's quite possible
a bug is causing more memory to be used somewhere than expected, so I'd
like to corroborate what you're seeing with what we expect.


On Thu, Nov 21, 2013 at 1:00 PM, Stephen Haberman <
[email protected]> wrote:

> Hi Patrick/Aaron,
>
> Sorry to revive this thread, but we're seeing some OOMEs errors again
> (running with master a few commits after Aaron's optimizations). I
> can tweak our job, but I just wanted to ask some clarifications.
>
> > Another way to fix it is to modify your job to create fewer
> > partitions.
>
> So, I get the impression that most people are not having OOMEs like we
> are...why is that? Do we really have significantly more partitions than
> most people use?
>
> When Spark loads data directly from Hadoop/HDFS (which we don't do),
> AFAIK the default partition size is 64mb, which surely results in a
> large number of partitions (>10k?) for many data sets?
>
> When this job loads from S3, 1 month of data was originally 67,000
> files (1 file per partition), but then we have a routine that coalesces
> it down to ~64mb partitions, which for this job meant 18,000 partitions.
>
> 18k partitions * 64mb/partition = ~1.1tb, which matches how much data
> is in S3.
>
> How many partitions would usually be a good idea for 1tb of data?
>
> I was generally under the impression, what with the Sparrow/etc. slides
> I had come across, that smaller partition sizes were better
> scheduled/retried/etc. anyway.
>
> So, yeah, I can manually partition this job down to a lower number, or
> adjust our auto-coalescing to shoot for larger partitions, but I was
> just hoping to get some feedback about what the Spark team considers an
> acceptable number of partitions, both in general and for a ~1tb size
> RDD.
>
> I'll send a separate email with the specifics of the OOME/heap dump.
>
> Thanks!
>
> - Stephen
>
>

Reply via email to