Ah. That's unfortunate. Yeah reading thousands of files small is suboptimal (it's always suboptimal, but in this case, it's extra bad).
Pig committers -- currently JsonMetadata.fiindMetaFile looks for a metadata file for each file.. what do you think about making it look at directories, instead? Yang -- what's the ratio between # of directories and # of files in your case? D On Sat, Dec 31, 2011 at 6:05 PM, Yang Ling <[email protected]> wrote: > Thanks for reply. I spent yesterday and find out my 40 minutes is spent on > JsonMetadta.findMetaFile. It seems this is new for trunk. In my setting, I > have several thousand file/folders in my input, findMetaFile read it one by > one and it takes a long time. I also see there is an option in PigStorage I > can disable it using "-noschema". Once I use "noschema", I get my 40 > minutes back. Can we do something so others do not get into this pitfall? > At 2011-12-30 03:52:34,"Dmitriy Ryaboy" <[email protected]> wrote: > >In the past, when I've observed this kind of insane behavior (no job > should > >take 40 minutes to submit), it's been due the NameNode or the JobTracker > >being extremely overloaded, responding slowly, causing timeouts+retries. > > > >2011/12/28 Thejas Nair <[email protected]> > > > >> I haven't seen/heard this issue. > >> Do you mean to say that the extra time is actually a delay before MR job > >> is launched ? > >> Did you have free map/reduce slots when you ran pig job from trunk ? > >> > >> Thanks, > >> Thejas > >> > >> > >> > >> > >> On 12/23/11 9:01 PM, Yang Ling wrote: > >> > >>> I have a Pig job typically finish in 20 minutes. I tried Pig code from > >>> trunk, it takes more than 1 hours to finish. My input and output are on > >>> Amazon s3. One interesting thing is it takes about 40 minutes to start > the > >>> mapreduce job, but for 0.9.1 release, it takes only less than 1 > minute. Any > >>> idea? > >>> > >> > >> > >
