Yes, iirc mapred.tasktracker.* are settings that are TT specific, not job specific, and require the TT to be started with the desired values.
-D On Tue, Dec 21, 2010 at 7:03 PM, Zach Bailey <[email protected]>wrote: > Ah, good point. This is what the pig PARALLEL keyword does IIRC... > > However, for the mapred.tasktracker.reduce.tasks.maximum variable that will > require a bootstrap script parameter, right? > > -Zach > > On Tue, Dec 21, 2010 at 9:56 PM, Andrew Hitchcock <[email protected] > >wrote: > > > What you could do is increase the number of reducers your job runs and > > at the same time decrease the number of reducers that each machine > > runs concurrently. The settings for that are: > > > > mapred.reduce.tasks (increase this one) > > mapred.tasktracker.reduce.tasks.maximum (decrease this one) > > > > Andrew > > > > On Tue, Dec 21, 2010 at 6:27 PM, Zach Bailey <[email protected]> > > wrote: > > > Thank you very much both Dmitriy and Andrew. > > > > > > Unfortunately I'm stuck in a bit of a bind. Specifying additional > > reducers > > > is a problem because the workload I have is very reduce heavy. So > > > unfortunately I was running into memory problems exactly as described > in > > > this thread: > > > > > > https://forums.aws.amazon.com/thread.jspa?threadID=49024 > > > > > > I ended up having to bump my EMR slave instances up to m2.xlarge > > instances > > > to handle the memory pressure. > > > > > > Since this is running on EMR I can of course opt to throw more machines > > at > > > the whole thing. Correct me if I'm wrong but that will hopefully solve > > both > > > problems at the same time, although it doesn't get my output files down > > to > > > the ideal size I was hoping for. The task was running on 4x m2.xlarge > > > instances (after failing on 4x m1.large and 8x c1.medium), I think the > > next > > > run I'll try doubling up to 8x m1.large and hopefully that will be > enough > > > reduce slots to keep the file size down and avoid the memory pressure > > > problem. > > > > > > -Zach > > > > > > On Tue, Dec 21, 2010 at 8:10 PM, Andrew Hitchcock <[email protected] > > >wrote: > > > > > >> Zach, > > >> > > >> I work on the Elastic MapReduce team. We are planning to launch > > >> support for multipart upload into Amazon S3 in early January. This > > >> will enable you to write files into Amazon S3 from your reducer that > > >> are up to 5 TB in size. > > >> > > >> In the mean time, Dmitriy's advise should work. Increase the number of > > >> reducers and each reducer will process and write less data. This will > > >> work unless you have a very uneven data distribution. > > >> > > >> Regards, > > >> Andrew > > >> > > >> On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey < > [email protected]> > > >> wrote: > > >> > Does anyone know of any existing StoreFunc to specify a maximum > > output > > >> file size? Or would I need to write a custom StoreFunc to do this? > > >> > > > >> > > > >> > I am running into a problem on Amazon's EMR where the files the > > reducers > > >> are writing are too large to be uploaded to S3 (5GB limit per file) > and > > I > > >> need to figure out a way to get the output file sizes down into a > > reasonable > > >> range. > > >> > > > >> > > > >> > The other way would be to fire up more machines, which would provide > > more > > >> reducers, meaning the data is split into more files, yielding smaller > > files. > > >> But I want the resulting files to be split on some reasonable file > size > > (50 > > >> - 100MB) so they are friendly for pulling down, inspecting, and > testing > > >> with. > > >> > > > >> > > > >> > Any ideas? > > >> > -Zach > > >> > > > >> > > > >> > > > >> > > > > > >
