Ah. That's unfortunate. Yeah reading thousands of files small is suboptimal
(it's always suboptimal, but in this case, it's extra bad).

Pig committers -- currently JsonMetadata.fiindMetaFile looks for a metadata
file for each file.. what do you think about making it look at directories,
instead?

Yang -- what's the ratio between # of directories and # of files in your
case?

D

On Sat, Dec 31, 2011 at 6:05 PM, Yang Ling <[email protected]> wrote:

> Thanks for reply. I spent yesterday and find out my 40 minutes is spent on
>  JsonMetadta.findMetaFile. It seems this is new for trunk. In my setting, I
> have several thousand file/folders in my input, findMetaFile read it one by
> one and it takes a long time. I also see there is an option in PigStorage I
> can disable it using "-noschema". Once I use "noschema", I get my 40
> minutes back. Can we do something so others do not get into this pitfall?
> At 2011-12-30 03:52:34,"Dmitriy Ryaboy" <[email protected]> wrote:
> >In the past, when I've observed this kind of insane behavior (no job
> should
> >take 40 minutes to submit), it's been due the NameNode or the JobTracker
> >being extremely overloaded, responding slowly, causing timeouts+retries.
> >
> >2011/12/28 Thejas Nair <[email protected]>
> >
> >> I haven't seen/heard this issue.
> >> Do you mean to say that the extra time is actually a delay before MR job
> >> is launched ?
> >> Did you have free map/reduce slots when you ran pig job from trunk ?
> >>
> >> Thanks,
> >> Thejas
> >>
> >>
> >>
> >>
> >> On 12/23/11 9:01 PM, Yang Ling wrote:
> >>
> >>> I have a Pig job typically finish in 20 minutes. I tried Pig code from
> >>> trunk, it takes more than 1 hours to finish. My input and output are on
> >>> Amazon s3. One interesting thing is it takes about 40 minutes to start
> the
> >>> mapreduce job, but for 0.9.1 release, it takes only less than 1
> minute. Any
> >>> idea?
> >>>
> >>
> >>
>
>

Reply via email to