Hello Dmitriy,

thanks for the helpful questions. I'll gather all the relevant information when i'm going to kick off another run.
What i can answer already:

the nodes are running on 4 cpus with a load of > 19 with about ~40-50 iowait%
it's 20 nodes with one being the namenode.
the storage is just a temporary HDFS being created on the "local" disks when the cluster is started each month. Yes, in fact I'm using a storefunc that writes multiple files (one for each "primary" key i have in the output).

i will send you the rest of the answers as soon as i gathered the needed information.
Thanks!
Johannes

Am 12.12.2010 12:18, schrieb Dmitriy Ryaboy:
Johannes,
I wonder if something is putting enough pressure on the datanodes that they
are unable to ack all the write requests fast enough, causing many tasks to
give up due to what amounts to tcp throughput collapse.

The logs certainly seem to indicate something unhealthy happening at the DFS
level. Bunch of questions below... I am stabbing in the dark here, as I
don't run clusters in EC2.

Do you have any stats on the network traffic in your cluster while this is
happening?

Same, but for disk/cpu utilization and similar metrics on the data nodes?

I am curious why there's a loader being instantiated in the reducer. Can you
send along a relevant portion of the explain plan?

How many map tasks and reduce tasks are you running?

How big is the cluster?

Is the storefunc you are using doing something like writing multiple files?

When running a cluster in EC2, what are you using for storage? S3, EBS...?

D

On Fri, Dec 10, 2010 at 2:53 AM, jr<[email protected]>wrote:

Hello Ashutosh,

I'm running entirely on amazon ec2, and while i get those errors, i seem
to be able to access hdfs by using "hadoop fs" :/

regards,
Johannes

Am Mittwoch, den 08.12.2010, 09:11 -0800 schrieb Ashutosh Chauhan:
 From the logs it looks like issue is not with Pig but with your hdfs.
Either your hdfs is running out of space or some (or all) nodes in
your cluster can't talk to each other (network issue ?)

Ashutosh
On Wed, Dec 8, 2010 at 06:09, jr<[email protected]>
wrote:
Hi guys,
I'm having some trouble finished jobs that run smoothly on a smaller
dataset, but always fail at 99% if i try to run the job on the whole
set.
i can see a few killed map and a few killed reduce, but quite a lot of
failed reduce tasks that all show the same exception at the end.
here is what i have in the logs:



Reply via email to