Patch available.. please test if that fixes the issue. https://issues.apache.org/jira/browse/PIG-2453
On Sun, Jan 1, 2012 at 7:39 PM, Dmitriy Ryaboy <[email protected]> wrote: > Yang, can you send the load statement you are using and a rought > description of the directory structure you are loading? That'll help test > the fix. > > Thanks, > D > > > On Sun, Jan 1, 2012 at 6:02 PM, Dmitriy Ryaboy <[email protected]> wrote: > >> Filed https://issues.apache.org/jira/browse/PIG-2453 >> >> >> On Sun, Jan 1, 2012 at 5:17 PM, Dmitriy Ryaboy <[email protected]>wrote: >> >>> Ah. That's unfortunate. Yeah reading thousands of files small is >>> suboptimal (it's always suboptimal, but in this case, it's extra bad). >>> >>> Pig committers -- currently JsonMetadata.fiindMetaFile looks for a >>> metadata file for each file.. what do you think about making it look at >>> directories, instead? >>> >>> Yang -- what's the ratio between # of directories and # of files in your >>> case? >>> >>> D >>> >>> >>> On Sat, Dec 31, 2011 at 6:05 PM, Yang Ling <[email protected]>wrote: >>> >>>> Thanks for reply. I spent yesterday and find out my 40 minutes is spent >>>> on JsonMetadta.findMetaFile. It seems this is new for trunk. In my >>>> setting, I have several thousand file/folders in my input, findMetaFile >>>> read it one by one and it takes a long time. I also see there is an option >>>> in PigStorage I can disable it using "-noschema". Once I use "noschema", I >>>> get my 40 minutes back. Can we do something so others do not get into this >>>> pitfall? >>>> At 2011-12-30 03:52:34,"Dmitriy Ryaboy" <[email protected]> wrote: >>>> >In the past, when I've observed this kind of insane behavior (no job >>>> should >>>> >take 40 minutes to submit), it's been due the NameNode or the >>>> JobTracker >>>> >being extremely overloaded, responding slowly, causing >>>> timeouts+retries. >>>> > >>>> >2011/12/28 Thejas Nair <[email protected]> >>>> > >>>> >> I haven't seen/heard this issue. >>>> >> Do you mean to say that the extra time is actually a delay before MR >>>> job >>>> >> is launched ? >>>> >> Did you have free map/reduce slots when you ran pig job from trunk ? >>>> >> >>>> >> Thanks, >>>> >> Thejas >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> On 12/23/11 9:01 PM, Yang Ling wrote: >>>> >> >>>> >>> I have a Pig job typically finish in 20 minutes. I tried Pig code >>>> from >>>> >>> trunk, it takes more than 1 hours to finish. My input and output >>>> are on >>>> >>> Amazon s3. One interesting thing is it takes about 40 minutes to >>>> start the >>>> >>> mapreduce job, but for 0.9.1 release, it takes only less than 1 >>>> minute. Any >>>> >>> idea? >>>> >>> >>>> >> >>>> >> >>>> >>>> >>> >> >
