Filed https://issues.apache.org/jira/browse/PIG-2453
On Sun, Jan 1, 2012 at 5:17 PM, Dmitriy Ryaboy <[email protected]> wrote: > Ah. That's unfortunate. Yeah reading thousands of files small is > suboptimal (it's always suboptimal, but in this case, it's extra bad). > > Pig committers -- currently JsonMetadata.fiindMetaFile looks for a > metadata file for each file.. what do you think about making it look at > directories, instead? > > Yang -- what's the ratio between # of directories and # of files in your > case? > > D > > > On Sat, Dec 31, 2011 at 6:05 PM, Yang Ling <[email protected]> wrote: > >> Thanks for reply. I spent yesterday and find out my 40 minutes is spent >> on JsonMetadta.findMetaFile. It seems this is new for trunk. In my >> setting, I have several thousand file/folders in my input, findMetaFile >> read it one by one and it takes a long time. I also see there is an option >> in PigStorage I can disable it using "-noschema". Once I use "noschema", I >> get my 40 minutes back. Can we do something so others do not get into this >> pitfall? >> At 2011-12-30 03:52:34,"Dmitriy Ryaboy" <[email protected]> wrote: >> >In the past, when I've observed this kind of insane behavior (no job >> should >> >take 40 minutes to submit), it's been due the NameNode or the JobTracker >> >being extremely overloaded, responding slowly, causing timeouts+retries. >> > >> >2011/12/28 Thejas Nair <[email protected]> >> > >> >> I haven't seen/heard this issue. >> >> Do you mean to say that the extra time is actually a delay before MR >> job >> >> is launched ? >> >> Did you have free map/reduce slots when you ran pig job from trunk ? >> >> >> >> Thanks, >> >> Thejas >> >> >> >> >> >> >> >> >> >> On 12/23/11 9:01 PM, Yang Ling wrote: >> >> >> >>> I have a Pig job typically finish in 20 minutes. I tried Pig code from >> >>> trunk, it takes more than 1 hours to finish. My input and output are >> on >> >>> Amazon s3. One interesting thing is it takes about 40 minutes to >> start the >> >>> mapreduce job, but for 0.9.1 release, it takes only less than 1 >> minute. Any >> >>> idea? >> >>> >> >> >> >> >> >> >
