Re: Controlling resulting file size?

Zach Bailey Tue, 21 Dec 2010 19:03:35 -0800

Ah, good point. This is what the pig PARALLEL keyword does IIRC...

However, for the mapred.tasktracker.reduce.tasks.maximum variable that will
require a bootstrap script parameter, right?


-Zach

On Tue, Dec 21, 2010 at 9:56 PM, Andrew Hitchcock <[email protected]>wrote:

> What you could do is increase the number of reducers your job runs and
> at the same time decrease the number of reducers that each machine
> runs concurrently. The settings for that are:
>
> mapred.reduce.tasks (increase this one)
> mapred.tasktracker.reduce.tasks.maximum (decrease this one)
>
> Andrew
>
> On Tue, Dec 21, 2010 at 6:27 PM, Zach Bailey <[email protected]>
> wrote:
> > Thank you very much both Dmitriy and Andrew.
> >
> > Unfortunately I'm stuck in a bit of a bind. Specifying additional
> reducers
> > is a problem because the workload I have is very reduce heavy. So
> > unfortunately I was running into memory problems exactly as described in
> > this thread:
> >
> > https://forums.aws.amazon.com/thread.jspa?threadID=49024
> >
> > I ended up having to bump my EMR slave instances up to m2.xlarge
> instances
> > to handle the memory pressure.
> >
> > Since this is running on EMR I can of course opt to throw more machines
> at
> > the whole thing. Correct me if I'm wrong but that will hopefully solve
> both
> > problems at the same time, although it doesn't get my output files down
> to
> > the ideal size I was hoping for. The task was running on 4x m2.xlarge
> > instances (after failing on 4x m1.large and 8x c1.medium), I think the
> next
> > run I'll try doubling up to 8x m1.large and hopefully that will be enough
> > reduce slots to keep the file size down and avoid the memory pressure
> > problem.
> >
> > -Zach
> >
> > On Tue, Dec 21, 2010 at 8:10 PM, Andrew Hitchcock <[email protected]
> >wrote:
> >
> >> Zach,
> >>
> >> I work on the Elastic MapReduce team. We are planning to launch
> >> support for multipart upload into Amazon S3 in early January. This
> >> will enable you to write files into Amazon S3 from your reducer that
> >> are up to 5 TB in size.
> >>
> >> In the mean time, Dmitriy's advise should work. Increase the number of
> >> reducers and each reducer will process and write less data. This will
> >> work unless you have a very uneven data distribution.
> >>
> >> Regards,
> >> Andrew
> >>
> >> On Tue, Dec 21, 2010 at 2:52 PM, Zach Bailey <[email protected]>
> >> wrote:
> >> >  Does anyone know of any existing StoreFunc to specify a maximum
> output
> >> file size? Or would I need to write a custom StoreFunc to do this?
> >> >
> >> >
> >> > I am running into a problem on Amazon's EMR where the files the
> reducers
> >> are writing are too large to be uploaded to S3 (5GB limit per file) and
> I
> >> need to figure out a way to get the output file sizes down into a
> reasonable
> >> range.
> >> >
> >> >
> >> > The other way would be to fire up more machines, which would provide
> more
> >> reducers, meaning the data is split into more files, yielding smaller
> files.
> >> But I want the resulting files to be split on some reasonable file size
> (50
> >> - 100MB) so they are friendly for pulling down, inspecting, and testing
> >> with.
> >> >
> >> >
> >> > Any ideas?
> >> > -Zach
> >> >
> >> >
> >> >
> >>
> >
>

Re: Controlling resulting file size?

Reply via email to