Re: Controlling resulting file size?

Zach Bailey Tue, 21 Dec 2010 18:27:58 -0800

Thank you very much both Dmitriy and Andrew.

Unfortunately I'm stuck in a bit of a bind. Specifying additional reducers
is a problem because the workload I have is very reduce heavy. So
unfortunately I was running into memory problems exactly as described in
this thread:

https://forums.aws.amazon.com/thread.jspa?threadID=49024

I ended up having to bump my EMR slave instances up to m2.xlarge instances
to handle the memory pressure.

Since this is running on EMR I can of course opt to throw more machines at
the whole thing. Correct me if I'm wrong but that will hopefully solve both
problems at the same time, although it doesn't get my output files down to
the ideal size I was hoping for. The task was running on 4x m2.xlarge
instances (after failing on 4x m1.large and 8x c1.medium), I think the next
run I'll try doubling up to 8x m1.large and hopefully that will be enough
reduce slots to keep the file size down and avoid the memory pressure
problem.

-Zach

On Tue, Dec 21, 2010 at 8:10 PM, Andrew Hitchcock <[email protected]>wrote:

> Zach,
>
> I work on the Elastic MapReduce team. We are planning to launch
> support for multipart upload into Amazon S3 in early January. This
> will enable you to write files into Amazon S3 from your reducer that
> are up to 5 TB in size.
>
> In the mean time, Dmitriy's advise should work. Increase the number of
> reducers and each reducer will process and write less data. This will
> work unless you have a very uneven data distribution.
>
> Regards,
> Andrew
>
> On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <[email protected]>
> wrote:
> >  Does anyone know of any existing StoreFunc to specify a maximum output
> file size? Or would I need to write a custom StoreFunc to do this?
> >
> >
> > I am running into a problem on Amazon's EMR where the files the reducers
> are writing are too large to be uploaded to S3 (5GB limit per file) and I
> need to figure out a way to get the output file sizes down into a reasonable
> range.
> >
> >
> > The other way would be to fire up more machines, which would provide more
> reducers, meaning the data is split into more files, yielding smaller files.
> But I want the resulting files to be split on some reasonable file size (50
> - 100MB) so they are friendly for pulling down, inspecting, and testing
> with.
> >
> >
> > Any ideas?
> > -Zach
> >
> >
> >
>

Re: Controlling resulting file size?

Reply via email to