Zach, as a followup, you can now use multipart upload to create files
large than 5 GB using EMR. You have to specifically enable this
however. The documentation about the feature and how to enable it can
be found here:

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?UsingEMR_Config.html#Config_Multipart

Regards,
Andrew

On Tue, Dec 21, 2010 at 5:10 PM, Andrew Hitchcock <[email protected]> wrote:
> Zach,
>
> I work on the Elastic MapReduce team. We are planning to launch
> support for multipart upload into Amazon S3 in early January. This
> will enable you to write files into Amazon S3 from your reducer that
> are up to 5 TB in size.
>
> In the mean time, Dmitriy's advise should work. Increase the number of
> reducers and each reducer will process and write less data. This will
> work unless you have a very uneven data distribution.
>
> Regards,
> Andrew
>
> On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <[email protected]> wrote:
>>  Does anyone know of any existing StoreFunc to specify a maximum output file 
>> size? Or would I need to write a custom StoreFunc to do this?
>>
>>
>> I am running into a problem on Amazon's EMR where the files the reducers are 
>> writing are too large to be uploaded to S3 (5GB limit per file) and I need 
>> to figure out a way to get the output file sizes down into a reasonable 
>> range.
>>
>>
>> The other way would be to fire up more machines, which would provide more 
>> reducers, meaning the data is split into more files, yielding smaller files. 
>> But I want the resulting files to be split on some reasonable file size (50 
>> - 100MB) so they are friendly for pulling down, inspecting, and testing with.
>>
>>
>> Any ideas?
>> -Zach
>>
>>
>>
>

Reply via email to