Thank you very much both Dmitriy and Andrew. Unfortunately I'm stuck in a bit of a bind. Specifying additional reducers is a problem because the workload I have is very reduce heavy. So unfortunately I was running into memory problems exactly as described in this thread:
https://forums.aws.amazon.com/thread.jspa?threadID=49024 I ended up having to bump my EMR slave instances up to m2.xlarge instances to handle the memory pressure. Since this is running on EMR I can of course opt to throw more machines at the whole thing. Correct me if I'm wrong but that will hopefully solve both problems at the same time, although it doesn't get my output files down to the ideal size I was hoping for. The task was running on 4x m2.xlarge instances (after failing on 4x m1.large and 8x c1.medium), I think the next run I'll try doubling up to 8x m1.large and hopefully that will be enough reduce slots to keep the file size down and avoid the memory pressure problem. -Zach On Tue, Dec 21, 2010 at 8:10 PM, Andrew Hitchcock <[email protected]>wrote: > Zach, > > I work on the Elastic MapReduce team. We are planning to launch > support for multipart upload into Amazon S3 in early January. This > will enable you to write files into Amazon S3 from your reducer that > are up to 5 TB in size. > > In the mean time, Dmitriy's advise should work. Increase the number of > reducers and each reducer will process and write less data. This will > work unless you have a very uneven data distribution. > > Regards, > Andrew > > On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <[email protected]> > wrote: > > Does anyone know of any existing StoreFunc to specify a maximum output > file size? Or would I need to write a custom StoreFunc to do this? > > > > > > I am running into a problem on Amazon's EMR where the files the reducers > are writing are too large to be uploaded to S3 (5GB limit per file) and I > need to figure out a way to get the output file sizes down into a reasonable > range. > > > > > > The other way would be to fire up more machines, which would provide more > reducers, meaning the data is split into more files, yielding smaller files. > But I want the resulting files to be split on some reasonable file size (50 > - 100MB) so they are friendly for pulling down, inspecting, and testing > with. > > > > > > Any ideas? > > -Zach > > > > > > >
