What you could do is increase the number of reducers your job runs and at the same time decrease the number of reducers that each machine runs concurrently. The settings for that are:
mapred.reduce.tasks (increase this one) mapred.tasktracker.reduce.tasks.maximum (decrease this one) Andrew On Tue, Dec 21, 2010 at 6:27 PM, Zach Bailey <[email protected]> wrote: > Thank you very much both Dmitriy and Andrew. > > Unfortunately I'm stuck in a bit of a bind. Specifying additional reducers > is a problem because the workload I have is very reduce heavy. So > unfortunately I was running into memory problems exactly as described in > this thread: > > https://forums.aws.amazon.com/thread.jspa?threadID=49024 > > I ended up having to bump my EMR slave instances up to m2.xlarge instances > to handle the memory pressure. > > Since this is running on EMR I can of course opt to throw more machines at > the whole thing. Correct me if I'm wrong but that will hopefully solve both > problems at the same time, although it doesn't get my output files down to > the ideal size I was hoping for. The task was running on 4x m2.xlarge > instances (after failing on 4x m1.large and 8x c1.medium), I think the next > run I'll try doubling up to 8x m1.large and hopefully that will be enough > reduce slots to keep the file size down and avoid the memory pressure > problem. > > -Zach > > On Tue, Dec 21, 2010 at 8:10 PM, Andrew Hitchcock <[email protected]>wrote: > >> Zach, >> >> I work on the Elastic MapReduce team. We are planning to launch >> support for multipart upload into Amazon S3 in early January. This >> will enable you to write files into Amazon S3 from your reducer that >> are up to 5 TB in size. >> >> In the mean time, Dmitriy's advise should work. Increase the number of >> reducers and each reducer will process and write less data. This will >> work unless you have a very uneven data distribution. >> >> Regards, >> Andrew >> >> On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <[email protected]> >> wrote: >> > Does anyone know of any existing StoreFunc to specify a maximum output >> file size? Or would I need to write a custom StoreFunc to do this? >> > >> > >> > I am running into a problem on Amazon's EMR where the files the reducers >> are writing are too large to be uploaded to S3 (5GB limit per file) and I >> need to figure out a way to get the output file sizes down into a reasonable >> range. >> > >> > >> > The other way would be to fire up more machines, which would provide more >> reducers, meaning the data is split into more files, yielding smaller files. >> But I want the resulting files to be split on some reasonable file size (50 >> - 100MB) so they are friendly for pulling down, inspecting, and testing >> with. >> > >> > >> > Any ideas? >> > -Zach >> > >> > >> > >> >
