I don't know of anything that would give you this out of the box; you'd have to write your own output format + StoreFunc.
As far as firing up more machines -- you don't really need to, you can just increase parallelism of your job -- if you ask for more reducers than you have reduce slots in the cluster, they will get scheduled in waves instead of all at the same time, but they'll come through. D On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <[email protected]>wrote: > Does anyone know of any existing StoreFunc to specify a maximum output > file size? Or would I need to write a custom StoreFunc to do this? > > > I am running into a problem on Amazon's EMR where the files the reducers > are writing are too large to be uploaded to S3 (5GB limit per file) and I > need to figure out a way to get the output file sizes down into a reasonable > range. > > > The other way would be to fire up more machines, which would provide more > reducers, meaning the data is split into more files, yielding smaller files. > But I want the resulting files to be split on some reasonable file size (50 > - 100MB) so they are friendly for pulling down, inspecting, and testing > with. > > > Any ideas? > -Zach > > >
