Zach, as a followup, you can now use multipart upload to create files large than 5 GB using EMR. You have to specifically enable this however. The documentation about the feature and how to enable it can be found here:
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?UsingEMR_Config.html#Config_Multipart Regards, Andrew On Tue, Dec 21, 2010 at 5:10 PM, Andrew Hitchcock <[email protected]> wrote: > Zach, > > I work on the Elastic MapReduce team. We are planning to launch > support for multipart upload into Amazon S3 in early January. This > will enable you to write files into Amazon S3 from your reducer that > are up to 5 TB in size. > > In the mean time, Dmitriy's advise should work. Increase the number of > reducers and each reducer will process and write less data. This will > work unless you have a very uneven data distribution. > > Regards, > Andrew > > On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <[email protected]> wrote: >> Does anyone know of any existing StoreFunc to specify a maximum output file >> size? Or would I need to write a custom StoreFunc to do this? >> >> >> I am running into a problem on Amazon's EMR where the files the reducers are >> writing are too large to be uploaded to S3 (5GB limit per file) and I need >> to figure out a way to get the output file sizes down into a reasonable >> range. >> >> >> The other way would be to fire up more machines, which would provide more >> reducers, meaning the data is split into more files, yielding smaller files. >> But I want the resulting files to be split on some reasonable file size (50 >> - 100MB) so they are friendly for pulling down, inspecting, and testing with. >> >> >> Any ideas? >> -Zach >> >> >> >
