I want to make a correction about the errors. The error should be as
follows. The errors in my previous email are from my added debug
message. But the problem is the same, somehow some connection was
reset by peer. I did more tries. Occasionally, my job can actually run
without a problem, then more times the job fails because of this
connection reset problem. I really don't have a clue what the problem
is.
Yuanyuan
java.lang.IllegalStateException: run: Caught an unrecoverable
exception flush: Got ExecutionException
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:859)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: java.lang.IllegalStateException: flush: Got ExecutionException
at
org.apache.giraph.comm.BasicRPCCommunications.flush(BasicRPCCommunications.java:1085)
at
org.apache.giraph.graph.BspServiceWorker.finishSuperstep(BspServiceWorker.java:1080)
at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:806)
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:850)
... 7 more
Caused by: java.util.concurrent.ExecutionException:
java.lang.RuntimeException: java.io.IOException: Call to
idp33.almaden.ibm.com/172.16.0.33:30054 failed on local exception:
java.io.IOException: Connection reset by peer
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at
org.apache.giraph.comm.BasicRPCCommunications.flush(BasicRPCCommunications.java:1080)
... 10 more
Caused by: java.lang.RuntimeException: java.io.IOException: Call to
idp33.almaden.ibm.com/172.16.0.33:30054 failed on local exception:
java.io.IOException: Connection reset by peer
at
org.apache.giraph.comm.BasicRPCCommunications$PeerFlushExecutor.run(BasicRPCCommunications.java:379)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Call to
idp33.almaden.ibm.com/172.16.0.33:30054 failed on local exception:
java.io.IOException: Connection reset by peer
at org.apache.hadoop.ipc.Client.wrapException(Client.java:1065)
at org.apache.hadoop.ipc.Client.call(Client.java:1033)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224)
at $Proxy3.putVertexIdMessagesList(Unknown Source)
at
org.apache.giraph.comm.BasicRPCCommunications$PeerFlushExecutor.run(BasicRPCCommunications.java:339)
... 6 more
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
at sun.nio.ch.IOUtil.read(IOUtil.java:175)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
at
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:343)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:767)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:712)
From: Yuanyuan Tian/Almaden/IBM@IBMUS
To: [email protected]
Cc: [email protected]
Date: 06/27/2012 10:02 PM
Subject: Re: wierd communication errors
------------------------------------------------------------------------
What do you mean using netty? I am not aware that Giraph is using
netty. I am just using what ever the default giraph release 1.0 is using.
Yuanyuan
From: Avery Ching <[email protected]>
To: [email protected]
Date: 06/27/2012 07:57 PM
Subject: Re: wierd communication errors
------------------------------------------------------------------------
Same issue using netty as well?
On 6/27/12 6:14 PM, Yuanyuan Tian wrote:
Hi,
I was running a giraph job where I constantly got the following
communication related errors. The symptom is that in super step 0,
most of the workers succeeded but a few of the workers produced the
errors below, the machines that caused the connection reset are
different in each failed worker. To rule out the probability of the
cluster setup error, I also ran a different job and it worked fine.
So, the error must be caused by this particular giraph job. My giraph
job is just normal message propagation type of job, except that the
message is not a of a unique type. Therefore, I defined a special
message type (also copied in this email) that incorporates two
different types of messages: integer message and double array message.
I have tried all day but still couldn't ping point the source of the
bug. Can anyone give me some hints on what may have caused this error?
Thanks a lot,
java.lang.IllegalStateException: run: Caught an unrecoverable
exception flush: Got ExecutionException
at
org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:859)
at
org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native
Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: java.lang.IllegalStateException: flush: Got ExecutionException
at
org.apache.giraph.comm.BasicRPCCommunications.flush(BasicRPCCommunications.java:1082)
at
org.apache.giraph.graph.BspServiceWorker.finishSuperstep(BspServiceWorker.java:1080)
at
org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:806)
at
org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:850)
... 7 more
Caused by: java.util.concurrent.ExecutionException:
java.lang.reflect.UndeclaredThrowableException
at
java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at
org.apache.giraph.comm.BasicRPCCommunications.flush(BasicRPCCommunications.java:1077)
... 10 more
Caused by: java.lang.reflect.UndeclaredThrowableException
at $Proxy3.getName(Unknown Source)
at
org.apache.giraph.comm.BasicRPCCommunications$PeerFlushExecutor.run(BasicRPCCommunications.java:335)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Call to
idp35.almaden.ibm.com/172.16.0.35:30083 failed on local exception:
java.io.IOException: Connection reset by peer
at
org.apache.hadoop.ipc.Client.wrapException(Client.java:1065)
at org.apache.hadoop.ipc.Client.call(Client.java:1033)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224)
... 8 more
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at
sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202)
at sun.nio.ch.IOUtil.read(IOUtil.java:175)
at
sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243)
at
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at
java.io.FilterInputStream.read(FilterInputStream.java:116)
at
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:343)
at
java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at
java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:767)
at
org.apache.hadoop.ipc.Client$Connection.run(Client.java:712)
My special messge type:*
public**class*MyMessageWritable *implements*Writable{
*public**byte*msgType=0;
*public**long*vertexID=-1;
*public**double*[] arrayMsg=*null*;
*public**int*intMsg=-1;
*public*MyMessageWritable ()
{
}
*public*MyMessageWritable (*long*id, *byte*tp, *int*msg)
{
vertexID=id;
msgType=tp;
intMsg=msg;
}
*public*MyMessageWritable (*long*id, *byte*tp, *double*[] arr)
{
vertexID=id;
msgType=tp;
arrayMsg=arr;
}
@Override
*public**void*readFields(DataInput in) *throws*IOException {
vertexID=in.readLong();
msgType=in.readByte();
*switch*(msgType)
{
*case*1:
*case*4:
intMsg=in.readInt();
*break*;
*case*2:
*case*3:
*if*(arrayMsg==*null*)
arrayMsg=*new**double*[MyVertex./K/];
*for*(*int*i=0; i<MyVertex./K/; i++)
arrayMsg[i]=in.readDouble();
*break*;
*default*:
*throw**new*IOException("message type invalid: "+msgType);
}
}
@Override
*public**void*write(DataOutput out) *throws*IOException {
out.writeLong(vertexID);
out.writeByte(msgType);
*switch*(msgType)
{
*case*1:
*case*4:
out.writeInt(intMsg);
*break*;
*case*2:
*case*3:
*if*(arrayMsg==*null*)
*throw**new*IOException("array message is null");
*for*(*int*i=0; i<MyVertex./K/; i++)
out.writeDouble(arrayMsg[i]);
*break*;
*default*:
*throw**new*IOException("message type invalid: "+msgType);
}
}