Yang, can you send the load statement you are using and a rought description of the directory structure you are loading? That'll help test the fix.
Thanks, D On Sun, Jan 1, 2012 at 6:02 PM, Dmitriy Ryaboy <[email protected]> wrote: > Filed https://issues.apache.org/jira/browse/PIG-2453 > > > On Sun, Jan 1, 2012 at 5:17 PM, Dmitriy Ryaboy <[email protected]> wrote: > >> Ah. That's unfortunate. Yeah reading thousands of files small is >> suboptimal (it's always suboptimal, but in this case, it's extra bad). >> >> Pig committers -- currently JsonMetadata.fiindMetaFile looks for a >> metadata file for each file.. what do you think about making it look at >> directories, instead? >> >> Yang -- what's the ratio between # of directories and # of files in your >> case? >> >> D >> >> >> On Sat, Dec 31, 2011 at 6:05 PM, Yang Ling <[email protected]> wrote: >> >>> Thanks for reply. I spent yesterday and find out my 40 minutes is spent >>> on JsonMetadta.findMetaFile. It seems this is new for trunk. In my >>> setting, I have several thousand file/folders in my input, findMetaFile >>> read it one by one and it takes a long time. I also see there is an option >>> in PigStorage I can disable it using "-noschema". Once I use "noschema", I >>> get my 40 minutes back. Can we do something so others do not get into this >>> pitfall? >>> At 2011-12-30 03:52:34,"Dmitriy Ryaboy" <[email protected]> wrote: >>> >In the past, when I've observed this kind of insane behavior (no job >>> should >>> >take 40 minutes to submit), it's been due the NameNode or the JobTracker >>> >being extremely overloaded, responding slowly, causing timeouts+retries. >>> > >>> >2011/12/28 Thejas Nair <[email protected]> >>> > >>> >> I haven't seen/heard this issue. >>> >> Do you mean to say that the extra time is actually a delay before MR >>> job >>> >> is launched ? >>> >> Did you have free map/reduce slots when you ran pig job from trunk ? >>> >> >>> >> Thanks, >>> >> Thejas >>> >> >>> >> >>> >> >>> >> >>> >> On 12/23/11 9:01 PM, Yang Ling wrote: >>> >> >>> >>> I have a Pig job typically finish in 20 minutes. I tried Pig code >>> from >>> >>> trunk, it takes more than 1 hours to finish. My input and output are >>> on >>> >>> Amazon s3. One interesting thing is it takes about 40 minutes to >>> start the >>> >>> mapreduce job, but for 0.9.1 release, it takes only less than 1 >>> minute. Any >>> >>> idea? >>> >>> >>> >> >>> >> >>> >>> >> >
