Re: Controlling resulting file size?

Dmitriy Ryaboy Tue, 21 Dec 2010 19:11:44 -0800

Yes, iirc mapred.tasktracker.* are settings that are TT specific, not job
specific, and require the TT to be started with the desired values.


-D

On Tue, Dec 21, 2010 at 7:03 PM, Zach Bailey <[email protected]>wrote:

> Ah, good point. This is what the pig PARALLEL keyword does IIRC...
>
> However, for the mapred.tasktracker.reduce.tasks.maximum variable that will
> require a bootstrap script parameter, right?
>
> -Zach
>
> On Tue, Dec 21, 2010 at 9:56 PM, Andrew Hitchcock <[email protected]
> >wrote:
>
> > What you could do is increase the number of reducers your job runs and
> > at the same time decrease the number of reducers that each machine
> > runs concurrently. The settings for that are:
> >
> > mapred.reduce.tasks (increase this one)
> > mapred.tasktracker.reduce.tasks.maximum (decrease this one)
> >
> > Andrew
> >
> > On Tue, Dec 21, 2010 at 6:27 PM, Zach Bailey <[email protected]>
> > wrote:
> > > Thank you very much both Dmitriy and Andrew.
> > >
> > > Unfortunately I'm stuck in a bit of a bind. Specifying additional
> > reducers
> > > is a problem because the workload I have is very reduce heavy. So
> > > unfortunately I was running into memory problems exactly as described
> in
> > > this thread:
> > >
> > > https://forums.aws.amazon.com/thread.jspa?threadID=49024
> > >
> > > I ended up having to bump my EMR slave instances up to m2.xlarge
> > instances
> > > to handle the memory pressure.
> > >
> > > Since this is running on EMR I can of course opt to throw more machines
> > at
> > > the whole thing. Correct me if I'm wrong but that will hopefully solve
> > both
> > > problems at the same time, although it doesn't get my output files down
> > to
> > > the ideal size I was hoping for. The task was running on 4x m2.xlarge
> > > instances (after failing on 4x m1.large and 8x c1.medium), I think the
> > next
> > > run I'll try doubling up to 8x m1.large and hopefully that will be
> enough
> > > reduce slots to keep the file size down and avoid the memory pressure
> > > problem.
> > >
> > > -Zach
> > >
> > > On Tue, Dec 21, 2010 at 8:10 PM, Andrew Hitchcock <[email protected]
> > >wrote:
> > >
> > >> Zach,
> > >>
> > >> I work on the Elastic MapReduce team. We are planning to launch
> > >> support for multipart upload into Amazon S3 in early January. This
> > >> will enable you to write files into Amazon S3 from your reducer that
> > >> are up to 5 TB in size.
> > >>
> > >> In the mean time, Dmitriy's advise should work. Increase the number of
> > >> reducers and each reducer will process and write less data. This will
> > >> work unless you have a very uneven data distribution.
> > >>
> > >> Regards,
> > >> Andrew
> > >>
> > >> On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <
> [email protected]>
> > >> wrote:
> > >> >  Does anyone know of any existing StoreFunc to specify a maximum
> > output
> > >> file size? Or would I need to write a custom StoreFunc to do this?
> > >> >
> > >> >
> > >> > I am running into a problem on Amazon's EMR where the files the
> > reducers
> > >> are writing are too large to be uploaded to S3 (5GB limit per file)
> and
> > I
> > >> need to figure out a way to get the output file sizes down into a
> > reasonable
> > >> range.
> > >> >
> > >> >
> > >> > The other way would be to fire up more machines, which would provide
> > more
> > >> reducers, meaning the data is split into more files, yielding smaller
> > files.
> > >> But I want the resulting files to be split on some reasonable file
> size
> > (50
> > >> - 100MB) so they are friendly for pulling down, inspecting, and
> testing
> > >> with.
> > >> >
> > >> >
> > >> > Any ideas?
> > >> > -Zach
> > >> >
> > >> >
> > >> >
> > >>
> > >
> >
>

Re: Controlling resulting file size?

Reply via email to