You probably have a pretty small input that is all loaded from the same worker, so this other worker gets 0 input splits. This shouldn't be a problem.
From: Zachary Hanif <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Wednesday, February 13, 2013 1:08 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: Giraph/Netty issues on a cluster Well, this is a bit odd: 2013-02-13 20:58:45,740 INFO org.apache.giraph.worker.BspServiceWorker: loadInputSplits: Using 1 thread(s), originally 1 threads(s) for 14 total splits. 2013-02-13 20:58:45,742 INFO org.apache.giraph.comm.netty.NettyServer: start: Using Netty without authentication. 2013-02-13 20:58:45,744 INFO org.apache.giraph.comm.SendPartitionCache: SendPartitionCache: maxVerticesPerTransfer = 10000 2013-02-13 20:58:45,744 INFO org.apache.giraph.comm.SendPartitionCache: SendPartitionCache: maxEdgesPerTransfer = 80000 2013-02-13 20:58:45,745 INFO org.apache.giraph.comm.netty.NettyServer: start: Using Netty without authentication. 2013-02-13 20:58:45,755 INFO org.apache.giraph.comm.netty.NettyServer: start: Using Netty without authentication. 2013-02-13 20:58:45,758 INFO org.apache.giraph.comm.netty.NettyServer: start: Using Netty without authentication. 2013-02-13 20:58:45,814 INFO org.apache.giraph.worker.InputSplitsCallable: call: Loaded 0 input splits in 0.07298644 secs, (v=0, e=0) 0.0 vertices/sec, 0.0 edges/sec 2013-02-13 20:58:45,817 INFO org.apache.giraph.comm.netty.NettyClient: waitAllRequests: Finished all requests. MBytes/sec sent = 0, MBytes/sec received = 0, MBytesSent = 0, MBytesReceived = 0, ave sent req MBytes = 0, ave received req MBytes = 0, secs waited = 8.303 2013-02-13 20:58:45,817 INFO org.apache.giraph.worker.BspServiceWorker: setup: Finally loaded a total of (v=0, e=0) What would cause this? I imagine that it's related to my overall problem. On Wed, Feb 13, 2013 at 3:31 PM, Zachary Hanif <[email protected]<mailto:[email protected]>> wrote: It is my own code. I'm staring at my VertexInputFormat class right now. It extends TextVertexInputFormat<Text, DoubleWritable, NullWritable, DoubleWritable>. I cannot imagine why a value would not be set for these vertexes, but I'll drop in some code to more stringently ensure value creation. Why would this begin to fail on a distributed deployment (multiple workers) but not with a single worker? The dataset is identical between the two executions. On Wed, Feb 13, 2013 at 2:35 PM, Alessandro Presta <[email protected]<mailto:[email protected]>> wrote: Hi Zachary, Are you running one of the examples or your own code? It seems to me that a call to edge.getValue() is returning null, which should never happen. Alessandro From: Zachary Hanif <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Wednesday, February 13, 2013 11:29 AM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Giraph/Netty issues on a cluster (How embarrassing! I forgot a subject header in a previous attempt to post this. Please reply to this thread, not the other.) Hi everyone, I am having some odd issues when trying to run a Giraph 0.2 job across my CDH 3u3 cluster. After building the jar, and deploying it across the cluster, I start to notice a handful of my nodes reporting the following error: 2013-02-13 17:47:43,341 WARN org.apache.giraph.comm.netty.handler.ResponseClientHandler: exceptionCaught: Channel failed with remote address <EDITED_INTERNAL_DNS>/10.2.0.16:30001<http://10.2.0.16:30001> java.lang.NullPointerException at org.apache.giraph.vertex.EdgeListVertexBase.write(EdgeListVertexBase.java:106) at org.apache.giraph.partition.SimplePartition.write(SimplePartition.java:169) at org.apache.giraph.comm.requests.SendVertexRequest.writeRequest(SendVertexRequest.java:71) at org.apache.giraph.comm.requests.WritableRequest.write(WritableRequest.java:127) at org.apache.giraph.comm.netty.handler.RequestEncoder.encode(RequestEncoder.java:96) at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:61) at org.jboss.netty.handler.execution.ExecutionHandler.handleDownstream(ExecutionHandler.java:185) at org.jboss.netty.channel.Channels.write(Channels.java:712) at org.jboss.netty.channel.Channels.write(Channels.java:679) at org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:246) at org.apache.giraph.comm.netty.NettyClient.sendWritableRequest(NettyClient.java:655) at org.apache.giraph.comm.netty.NettyWorkerClient.sendWritableRequest(NettyWorkerClient.java:144) at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.doRequest(NettyWorkerClientRequestProcessor.java:425) at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.sendPartitionRequest(NettyWorkerClientRequestProcessor.java:195) at org.apache.giraph.comm.netty.NettyWorkerClientRequestProcessor.flush(NettyWorkerClientRequestProcessor.java:365) at org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallable.java:190) at org.apache.giraph.worker.InputSplitsCallable.call(InputSplitsCallable.java:58) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) What would be causing this? All other Hadoop jobs run well on the cluster, and when the Giraph job is run with only one worker, it completes without any issues. When run with any number of workers >1, the above error occurs. I have referenced this post<http://mail-archives.apache.org/mod_mbox/giraph-user/201209.mbox/%3ccaeq6y7shc4in-l73nr7abizspmrrfw9sfa8tmi3myqml8vk...@mail.gmail.com%3E> where superficially similar issues were discussed, but the root cause appears to be different, and suggested methods of resolution are not panning out. As extra background, the 'remote address' changes, as the error cycles through my available cluster nodes, and the failing workers do not seem to favor one physical machine over another. Not all nodes present this issue, only a handful per job. Is there soemthing simple that I am missing?
