Ah, good point. This is what the pig PARALLEL keyword does IIRC... However, for the mapred.tasktracker.reduce.tasks.maximum variable that will require a bootstrap script parameter, right?
-Zach On Tue, Dec 21, 2010 at 9:56 PM, Andrew Hitchcock <[email protected]>wrote: > What you could do is increase the number of reducers your job runs and > at the same time decrease the number of reducers that each machine > runs concurrently. The settings for that are: > > mapred.reduce.tasks (increase this one) > mapred.tasktracker.reduce.tasks.maximum (decrease this one) > > Andrew > > On Tue, Dec 21, 2010 at 6:27 PM, Zach Bailey <[email protected]> > wrote: > > Thank you very much both Dmitriy and Andrew. > > > > Unfortunately I'm stuck in a bit of a bind. Specifying additional > reducers > > is a problem because the workload I have is very reduce heavy. So > > unfortunately I was running into memory problems exactly as described in > > this thread: > > > > https://forums.aws.amazon.com/thread.jspa?threadID=49024 > > > > I ended up having to bump my EMR slave instances up to m2.xlarge > instances > > to handle the memory pressure. > > > > Since this is running on EMR I can of course opt to throw more machines > at > > the whole thing. Correct me if I'm wrong but that will hopefully solve > both > > problems at the same time, although it doesn't get my output files down > to > > the ideal size I was hoping for. The task was running on 4x m2.xlarge > > instances (after failing on 4x m1.large and 8x c1.medium), I think the > next > > run I'll try doubling up to 8x m1.large and hopefully that will be enough > > reduce slots to keep the file size down and avoid the memory pressure > > problem. > > > > -Zach > > > > On Tue, Dec 21, 2010 at 8:10 PM, Andrew Hitchcock <[email protected] > >wrote: > > > >> Zach, > >> > >> I work on the Elastic MapReduce team. We are planning to launch > >> support for multipart upload into Amazon S3 in early January. This > >> will enable you to write files into Amazon S3 from your reducer that > >> are up to 5 TB in size. > >> > >> In the mean time, Dmitriy's advise should work. Increase the number of > >> reducers and each reducer will process and write less data. This will > >> work unless you have a very uneven data distribution. > >> > >> Regards, > >> Andrew > >> > >> On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <[email protected]> > >> wrote: > >> > Does anyone know of any existing StoreFunc to specify a maximum > output > >> file size? Or would I need to write a custom StoreFunc to do this? > >> > > >> > > >> > I am running into a problem on Amazon's EMR where the files the > reducers > >> are writing are too large to be uploaded to S3 (5GB limit per file) and > I > >> need to figure out a way to get the output file sizes down into a > reasonable > >> range. > >> > > >> > > >> > The other way would be to fire up more machines, which would provide > more > >> reducers, meaning the data is split into more files, yielding smaller > files. > >> But I want the resulting files to be split on some reasonable file size > (50 > >> - 100MB) so they are friendly for pulling down, inspecting, and testing > >> with. > >> > > >> > > >> > Any ideas? > >> > -Zach > >> > > >> > > >> > > >> > > >
