Johannes,
I strongly suspect it's the number of files you are trying to write at
the same time. lsof output might help determine this to a greater
degree of certainty, but seems extremely likely (likely enough I
guessed it...). What's the cardinality of the primary key? Can you
avoid writing such a large number of files?

D

On Tue, Dec 14, 2010 at 6:47 AM, Johannes Rußek
<[email protected]> wrote:
> Hello Dmitriy,
>
> thanks for the helpful questions. I'll gather all the relevant information
> when i'm going to kick off another run.
> What i can answer already:
>
> the nodes are running on 4 cpus with a load of > 19 with about ~40-50
> iowait%
> it's 20 nodes with one being the namenode.
> the storage is just a temporary HDFS being created on the "local" disks when
> the cluster is started each month.
> Yes, in fact I'm using a storefunc that writes multiple files (one for each
> "primary" key i have in the output).
>
> i will send you the rest of the answers as soon as i gathered the needed
> information.
> Thanks!
> Johannes
>
> Am 12.12.2010 12:18, schrieb Dmitriy Ryaboy:
>>
>> Johannes,
>> I wonder if something is putting enough pressure on the datanodes that
>> they
>> are unable to ack all the write requests fast enough, causing many tasks
>> to
>> give up due to what amounts to tcp throughput collapse.
>>
>> The logs certainly seem to indicate something unhealthy happening at the
>> DFS
>> level. Bunch of questions below... I am stabbing in the dark here, as I
>> don't run clusters in EC2.
>>
>> Do you have any stats on the network traffic in your cluster while this is
>> happening?
>>
>> Same, but for disk/cpu utilization and similar metrics on the data nodes?
>>
>> I am curious why there's a loader being instantiated in the reducer. Can
>> you
>> send along a relevant portion of the explain plan?
>>
>> How many map tasks and reduce tasks are you running?
>>
>> How big is the cluster?
>>
>> Is the storefunc you are using doing something like writing multiple
>> files?
>>
>> When running a cluster in EC2, what are you using for storage? S3, EBS...?
>>
>> D
>>
>> On Fri, Dec 10, 2010 at 2:53 AM,
>> jr<[email protected]>wrote:
>>
>>> Hello Ashutosh,
>>>
>>> I'm running entirely on amazon ec2, and while i get those errors, i seem
>>> to be able to access hdfs by using "hadoop fs" :/
>>>
>>> regards,
>>> Johannes
>>>
>>> Am Mittwoch, den 08.12.2010, 09:11 -0800 schrieb Ashutosh Chauhan:
>>>>
>>>>  From the logs it looks like issue is not with Pig but with your hdfs.
>>>> Either your hdfs is running out of space or some (or all) nodes in
>>>> your cluster can't talk to each other (network issue ?)
>>>>
>>>> Ashutosh
>>>> On Wed, Dec 8, 2010 at 06:09, jr<[email protected]>
>>>
>>> wrote:
>>>>>
>>>>> Hi guys,
>>>>> I'm having some trouble finished jobs that run smoothly on a smaller
>>>>> dataset, but always fail at 99% if i try to run the job on the whole
>>>>> set.
>>>>> i can see a few killed map and a few killed reduce, but quite a lot of
>>>>> failed reduce tasks that all show the same exception at the end.
>>>>> here is what i have in the logs:
>>>>>
>>>
>
>

Reply via email to