Thanks, I tried the patch and it takes no time for me to launch the job now.
For directory structure, I only have files no directories, actually for me I
can stick with "noschema", this is to help other s3 users.
At 2012-01-02 12:15:19,"Dmitriy Ryaboy" <[email protected]> wrote:
>Patch available.. please test if that fixes the issue.
>https://issues.apache.org/jira/browse/PIG-2453
>
>On Sun, Jan 1, 2012 at 7:39 PM, Dmitriy Ryaboy <[email protected]> wrote:
>
>> Yang, can you send the load statement you are using and a rought
>> description of the directory structure you are loading? That'll help test
>> the fix.
>>
>> Thanks,
>> D
>>
>>
>> On Sun, Jan 1, 2012 at 6:02 PM, Dmitriy Ryaboy <[email protected]> wrote:
>>
>>> Filed https://issues.apache.org/jira/browse/PIG-2453
>>>
>>>
>>> On Sun, Jan 1, 2012 at 5:17 PM, Dmitriy Ryaboy <[email protected]>wrote:
>>>
>>>> Ah. That's unfortunate. Yeah reading thousands of files small is
>>>> suboptimal (it's always suboptimal, but in this case, it's extra bad).
>>>>
>>>> Pig committers -- currently JsonMetadata.fiindMetaFile looks for a
>>>> metadata file for each file.. what do you think about making it look at
>>>> directories, instead?
>>>>
>>>> Yang -- what's the ratio between # of directories and # of files in your
>>>> case?
>>>>
>>>> D
>>>>
>>>>
>>>> On Sat, Dec 31, 2011 at 6:05 PM, Yang Ling <[email protected]>wrote:
>>>>
>>>>> Thanks for reply. I spent yesterday and find out my 40 minutes is spent
>>>>> on JsonMetadta.findMetaFile. It seems this is new for trunk. In my
>>>>> setting, I have several thousand file/folders in my input, findMetaFile
>>>>> read it one by one and it takes a long time. I also see there is an option
>>>>> in PigStorage I can disable it using "-noschema". Once I use "noschema", I
>>>>> get my 40 minutes back. Can we do something so others do not get into this
>>>>> pitfall?
>>>>> At 2011-12-30 03:52:34,"Dmitriy Ryaboy" <[email protected]> wrote:
>>>>> >In the past, when I've observed this kind of insane behavior (no job
>>>>> should
>>>>> >take 40 minutes to submit), it's been due the NameNode or the
>>>>> JobTracker
>>>>> >being extremely overloaded, responding slowly, causing
>>>>> timeouts+retries.
>>>>> >
>>>>> >2011/12/28 Thejas Nair <[email protected]>
>>>>> >
>>>>> >> I haven't seen/heard this issue.
>>>>> >> Do you mean to say that the extra time is actually a delay before MR
>>>>> job
>>>>> >> is launched ?
>>>>> >> Did you have free map/reduce slots when you ran pig job from trunk ?
>>>>> >>
>>>>> >> Thanks,
>>>>> >> Thejas
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> On 12/23/11 9:01 PM, Yang Ling wrote:
>>>>> >>
>>>>> >>> I have a Pig job typically finish in 20 minutes. I tried Pig code
>>>>> from
>>>>> >>> trunk, it takes more than 1 hours to finish. My input and output
>>>>> are on
>>>>> >>> Amazon s3. One interesting thing is it takes about 40 minutes to
>>>>> start the
>>>>> >>> mapreduce job, but for 0.9.1 release, it takes only less than 1
>>>>> minute. Any
>>>>> >>> idea?
>>>>> >>>
>>>>> >>
>>>>> >>
>>>>>
>>>>>
>>>>
>>>
>>