It is my own code. I'm staring at my VertexInputFormat class right now. It extends TextVertexInputFormat<Text, DoubleWritable, NullWritable, DoubleWritable>. I cannot imagine why a value would not be set for these vertexes, but I'll drop in some code to more stringently ensure value creation.
Why would this begin to fail on a distributed deployment (multiple workers) but not with a single worker? The dataset is identical between the two executions. On Wed, Feb 13, 2013 at 2:35 PM, Alessandro Presta <[email protected]>wrote: > Hi Zachary, > > Are you running one of the examples or your own code? > It seems to me that a call to edge.getValue() is returning null, which > should never happen. > > Alessandro > > From: Zachary Hanif <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Wednesday, February 13, 2013 11:29 AM > To: "[email protected]" <[email protected]> > Subject: Giraph/Netty issues on a cluster > > (How embarrassing! I forgot a subject header in a previous attempt to > post this. Please reply to this thread, not the other.) > > Hi everyone, > > I am having some odd issues when trying to run a Giraph 0.2 job across my > CDH 3u3 cluster. After building the jar, and deploying it across the > cluster, I start to notice a handful of my nodes reporting the following > error: > > 2013-02-13 17:47:43,341 WARN >> org.apache.giraph.comm.netty.handler.ResponseClientHandler: >> exceptionCaught: Channel failed with remote address <EDITED_INTERNAL_DNS>/ >> 10.2.0.16:30001 >> java.lang.NullPointerException >> at >> org.apache.giraph.vertex.EdgeListVertexBase.write(EdgeListVertexBase.java:106) >> at >> org.apache.giraph.partition.SimplePartition.write(SimplePartition.java:169) >> at >> org.apache.giraph.comm.requests.SendVertexRequest.writeRequest(SendVertexRequest.java:71) >> at >> org.apache.giraph.comm.requests.WritableRequest.write(WritableRequest.java:127) >> at >> org.apache.giraph.comm.netty.handler.RequestEncoder.encode(RequestEncoder.java:96) >> at >> org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:61) >> at >> org.jboss.netty.handler.execution.ExecutionHandler.handleDownstream(ExecutionHandler.java:185) >> at org.jboss.netty.channel.Channels.write(Channels.java:712) >> at org.jboss.netty.channel.Channels.write(Channels.java:679) >> at >> org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:246) >> at >> org.apache.giraph.comm.netty.NettyClient.sendWritableRequest(NettyClient.java:655) >> at >> org.apache.giraph.comm.netty.NettyWorkerClient.sendWritableRequest(NettyWorkerClient.java:144) >> at >> org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.doRequest(NettyWorkerClientRequestProcessor.java:425) >> at >> org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.sendPartitionRequest(NettyWorkerClientRequestProcessor.java:195) >> at >> org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.flush(NettyWorkerClientRequestProcessor.java:365) >> at >> org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallable.java:190) >> at >> org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallable.java:58) >> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) >> at java.util.concurrent.FutureTask.run(FutureTask.java:166) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) >> at java.lang.Thread.run(Thread.java:722) >> > > What would be causing this? All other Hadoop jobs run well on the cluster, > and when the Giraph job is run with only one worker, it completes without > any issues. When run with any number of workers >1, the above error occurs. > I have referenced this > post<http://mail-archives.apache.org/mod_mbox/giraph-user/201209.mbox/%3ccaeq6y7shc4in-l73nr7abizspmrrfw9sfa8tmi3myqml8vk...@mail.gmail.com%3E>where > superficially similar issues were discussed, but the root cause > appears to be different, and suggested methods of resolution are not > panning out. > > As extra background, the 'remote address' changes, as the error cycles > through my available cluster nodes, and the failing workers do not seem to > favor one physical machine over another. Not all nodes present this issue, > only a handful per job. Is there soemthing simple that I am missing? >
