Yang, can you send the load statement you are using and a rought
description of the directory structure you are loading? That'll help test
the fix.

Thanks,
D

On Sun, Jan 1, 2012 at 6:02 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Filed https://issues.apache.org/jira/browse/PIG-2453
>
>
> On Sun, Jan 1, 2012 at 5:17 PM, Dmitriy Ryaboy <[email protected]> wrote:
>
>> Ah. That's unfortunate. Yeah reading thousands of files small is
>> suboptimal (it's always suboptimal, but in this case, it's extra bad).
>>
>> Pig committers -- currently JsonMetadata.fiindMetaFile looks for a
>> metadata file for each file.. what do you think about making it look at
>> directories, instead?
>>
>> Yang -- what's the ratio between # of directories and # of files in your
>> case?
>>
>> D
>>
>>
>> On Sat, Dec 31, 2011 at 6:05 PM, Yang Ling <[email protected]> wrote:
>>
>>> Thanks for reply. I spent yesterday and find out my 40 minutes is spent
>>> on  JsonMetadta.findMetaFile. It seems this is new for trunk. In my
>>> setting, I have several thousand file/folders in my input, findMetaFile
>>> read it one by one and it takes a long time. I also see there is an option
>>> in PigStorage I can disable it using "-noschema". Once I use "noschema", I
>>> get my 40 minutes back. Can we do something so others do not get into this
>>> pitfall?
>>> At 2011-12-30 03:52:34,"Dmitriy Ryaboy" <[email protected]> wrote:
>>> >In the past, when I've observed this kind of insane behavior (no job
>>> should
>>> >take 40 minutes to submit), it's been due the NameNode or the JobTracker
>>> >being extremely overloaded, responding slowly, causing timeouts+retries.
>>> >
>>> >2011/12/28 Thejas Nair <[email protected]>
>>> >
>>> >> I haven't seen/heard this issue.
>>> >> Do you mean to say that the extra time is actually a delay before MR
>>> job
>>> >> is launched ?
>>> >> Did you have free map/reduce slots when you ran pig job from trunk ?
>>> >>
>>> >> Thanks,
>>> >> Thejas
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On 12/23/11 9:01 PM, Yang Ling wrote:
>>> >>
>>> >>> I have a Pig job typically finish in 20 minutes. I tried Pig code
>>> from
>>> >>> trunk, it takes more than 1 hours to finish. My input and output are
>>> on
>>> >>> Amazon s3. One interesting thing is it takes about 40 minutes to
>>> start the
>>> >>> mapreduce job, but for 0.9.1 release, it takes only less than 1
>>> minute. Any
>>> >>> idea?
>>> >>>
>>> >>
>>> >>
>>>
>>>
>>
>

Reply via email to