Johannes, I strongly suspect it's the number of files you are trying to write at the same time. lsof output might help determine this to a greater degree of certainty, but seems extremely likely (likely enough I guessed it...). What's the cardinality of the primary key? Can you avoid writing such a large number of files?
D On Tue, Dec 14, 2010 at 6:47 AM, Johannes Rußek <[email protected]> wrote: > Hello Dmitriy, > > thanks for the helpful questions. I'll gather all the relevant information > when i'm going to kick off another run. > What i can answer already: > > the nodes are running on 4 cpus with a load of > 19 with about ~40-50 > iowait% > it's 20 nodes with one being the namenode. > the storage is just a temporary HDFS being created on the "local" disks when > the cluster is started each month. > Yes, in fact I'm using a storefunc that writes multiple files (one for each > "primary" key i have in the output). > > i will send you the rest of the answers as soon as i gathered the needed > information. > Thanks! > Johannes > > Am 12.12.2010 12:18, schrieb Dmitriy Ryaboy: >> >> Johannes, >> I wonder if something is putting enough pressure on the datanodes that >> they >> are unable to ack all the write requests fast enough, causing many tasks >> to >> give up due to what amounts to tcp throughput collapse. >> >> The logs certainly seem to indicate something unhealthy happening at the >> DFS >> level. Bunch of questions below... I am stabbing in the dark here, as I >> don't run clusters in EC2. >> >> Do you have any stats on the network traffic in your cluster while this is >> happening? >> >> Same, but for disk/cpu utilization and similar metrics on the data nodes? >> >> I am curious why there's a loader being instantiated in the reducer. Can >> you >> send along a relevant portion of the explain plan? >> >> How many map tasks and reduce tasks are you running? >> >> How big is the cluster? >> >> Is the storefunc you are using doing something like writing multiple >> files? >> >> When running a cluster in EC2, what are you using for storage? S3, EBS...? >> >> D >> >> On Fri, Dec 10, 2010 at 2:53 AM, >> jr<[email protected]>wrote: >> >>> Hello Ashutosh, >>> >>> I'm running entirely on amazon ec2, and while i get those errors, i seem >>> to be able to access hdfs by using "hadoop fs" :/ >>> >>> regards, >>> Johannes >>> >>> Am Mittwoch, den 08.12.2010, 09:11 -0800 schrieb Ashutosh Chauhan: >>>> >>>> From the logs it looks like issue is not with Pig but with your hdfs. >>>> Either your hdfs is running out of space or some (or all) nodes in >>>> your cluster can't talk to each other (network issue ?) >>>> >>>> Ashutosh >>>> On Wed, Dec 8, 2010 at 06:09, jr<[email protected]> >>> >>> wrote: >>>>> >>>>> Hi guys, >>>>> I'm having some trouble finished jobs that run smoothly on a smaller >>>>> dataset, but always fail at 99% if i try to run the job on the whole >>>>> set. >>>>> i can see a few killed map and a few killed reduce, but quite a lot of >>>>> failed reduce tasks that all show the same exception at the end. >>>>> here is what i have in the logs: >>>>> >>> > >
