Worker ports are not synched properly with its peers ----------------------------------------------------
Key: GIRAPH-154 URL: https://issues.apache.org/jira/browse/GIRAPH-154 Project: Giraph Issue Type: Bug Components: bsp Affects Versions: 0.2.0 Reporter: Zhiwei Gu Assignee: Zhiwei Gu When worker trying multiple ports to setup the rpc server, the final port is not synched with it's peer workers properly, and resulted in peer workers send message to the default port. Here is some logs: ############################################################################ Base port: 34900 ############################################################################ ############################################################################ log for worker 161: ############################################################################ IPC Server handler 98 on 36061: starting BasicRPCCommunications: Started RPC communication server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:36061 with 100 handlers and 199 flush threads on bind attempt 1 IPC Server handler 99 on 36061: starting setup: Registering health of this worker... getJobState: Job state already exists (/_hadoopBsp/job_201203130609_14838/_masterJobState) getApplicationAttempt: Node /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir already exists! getApplicationAttempt: Node /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir already exists! registerHealth: Created my health node for attempt=0, superstep=-1 with /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir/gsta32085.tan.ygrid.yahoo.com_161 and workerInfo= Worker(hostname=gsta32085.tan.ygrid.yahoo.com, MRpartition=161, port=35061) process: partitionAssignmentsReadyChanged (partitions are assigned) startSuperstep: Ready for computation on superstep -1 since worker selection and vertex range assignments are done in /_hadoopBsp/job_201203130609_14838/_applicationAttemptsDir/0/_superstepDir/-1/_partitionAssignments Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 0 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 1 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 2 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 3 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 4 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 5 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 6 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 7 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 8 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 9 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 10 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 11 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 12 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 13 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 14 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 15 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 16 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 17 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 18 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 19 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 20 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 21 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 22 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 23 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 24 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 25 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 26 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 27 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 28 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 29 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 30 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 31 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 32 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 33 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 34 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 35 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 36 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 37 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 38 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 39 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 40 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 41 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 42 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 43 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 44 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 45 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 46 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 47 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 48 time(s). Retrying connect to server: gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061. Already tried 49 time(s). PriviledgedActionException as:job_201203130609_14838 (auth:SIMPLE) cause:java.net.ConnectException: Call to gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection exception: java.net.ConnectException: Connection refused connectAllRPCProxys: Failed on attempt 0 of 5 to connect to (id=33,cur=Worker(hostname=gsta32085.tan.ygrid.yahoo.com, MRpartition=161, port=35061),prev=null,ckpt_file=null) java.net.ConnectException: Call to gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection exception: java.net.ConnectException: Connection refused at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095) at org.apache.hadoop.ipc.Client.call(Client.java:1071) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy8.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:370) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:420) at org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:159) at org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:155) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082) at org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:153) at org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:51) at org.apache.giraph.comm.BasicRPCCommunications.startPeerConnectionThread(BasicRPCCommunications.java:599) at org.apache.giraph.comm.BasicRPCCommunications.connectAllRPCProxys(BasicRPCCommunications.java:542) at org.apache.giraph.comm.BasicRPCCommunications.setup(BasicRPCCommunications.java:513) at org.apache.giraph.graph.BspServiceWorker.setup(BspServiceWorker.java:550) at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:458) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:630) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202) at org.apache.hadoop.ipc.Client.call(Client.java:1046) ... 25 more ############################################################################ log for worker 154 ############################################################################ PriviledgedActionException as:job_201203130609_14838 (auth:SIMPLE) cause:java.net.ConnectException: Call to gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection exception: java.net.ConnectException: Connection refused connectAllRPCProxys: Failed on attempt 4 of 5 to connect to (id=33,cur=Worker(hostname=gsta32085.tan.ygrid.yahoo.com, MRpartition=161, port=35061),prev=null,ckpt_file=null) java.net.ConnectException: Call to gsta32085.tan.ygrid.yahoo.com/10.216.148.47:35061 failed on connection exception: java.net.ConnectException: Connection refused at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095) at org.apache.hadoop.ipc.Client.call(Client.java:1071) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy8.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:370) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:420) at org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:159) at org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:155) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082) at org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:153) at org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:51) at org.apache.giraph.comm.BasicRPCCommunications.startPeerConnectionThread(BasicRPCCommunications.java:599) at org.apache.giraph.comm.BasicRPCCommunications.connectAllRPCProxys(BasicRPCCommunications.java:542) at org.apache.giraph.comm.BasicRPCCommunications.setup(BasicRPCCommunications.java:513) at org.apache.giraph.graph.BspServiceWorker.setup(BspServiceWorker.java:550) at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:458) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:630) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1082) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202) at org.apache.hadoop.ipc.Client.call(Client.java:1046) ... 25 more -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira