I don't know of anything that would give you this out of the box; you'd have
to write your own output format + StoreFunc.

As far as firing up more machines -- you don't really need to, you can just
increase parallelism of your job -- if you ask for more reducers than you
have reduce slots in the cluster, they will get scheduled in waves instead
of all at the same time, but they'll come through.

D

On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <[email protected]>wrote:

>  Does anyone know of any existing StoreFunc to specify a maximum output
> file size? Or would I need to write a custom StoreFunc to do this?
>
>
> I am running into a problem on Amazon's EMR where the files the reducers
> are writing are too large to be uploaded to S3 (5GB limit per file) and I
> need to figure out a way to get the output file sizes down into a reasonable
> range.
>
>
> The other way would be to fire up more machines, which would provide more
> reducers, meaning the data is split into more files, yielding smaller files.
> But I want the resulting files to be split on some reasonable file size (50
> - 100MB) so they are friendly for pulling down, inspecting, and testing
> with.
>
>
> Any ideas?
> -Zach
>
>
>

Reply via email to