Johannes, I wonder if something is putting enough pressure on the datanodes that they are unable to ack all the write requests fast enough, causing many tasks to give up due to what amounts to tcp throughput collapse.
The logs certainly seem to indicate something unhealthy happening at the DFS level. Bunch of questions below... I am stabbing in the dark here, as I don't run clusters in EC2. Do you have any stats on the network traffic in your cluster while this is happening? Same, but for disk/cpu utilization and similar metrics on the data nodes? I am curious why there's a loader being instantiated in the reducer. Can you send along a relevant portion of the explain plan? How many map tasks and reduce tasks are you running? How big is the cluster? Is the storefunc you are using doing something like writing multiple files? When running a cluster in EC2, what are you using for storage? S3, EBS...? D On Fri, Dec 10, 2010 at 2:53 AM, jr <[email protected]>wrote: > Hello Ashutosh, > > I'm running entirely on amazon ec2, and while i get those errors, i seem > to be able to access hdfs by using "hadoop fs" :/ > > regards, > Johannes > > Am Mittwoch, den 08.12.2010, 09:11 -0800 schrieb Ashutosh Chauhan: > > From the logs it looks like issue is not with Pig but with your hdfs. > > Either your hdfs is running out of space or some (or all) nodes in > > your cluster can't talk to each other (network issue ?) > > > > Ashutosh > > On Wed, Dec 8, 2010 at 06:09, jr <[email protected]> > wrote: > > > Hi guys, > > > I'm having some trouble finished jobs that run smoothly on a smaller > > > dataset, but always fail at 99% if i try to run the job on the whole > > > set. > > > i can see a few killed map and a few killed reduce, but quite a lot of > > > failed reduce tasks that all show the same exception at the end. > > > here is what i have in the logs: > > > > >
