Hi all, We've been able to run Giraph on a production cluster on small and moderate (1.7B edges) datasets. However, when trying a large dataset (10B+ edges), the workers start logging tons of Netty warnings, and eventually the job as a whole dies, usually with the master reporting missing workers and killing the job. All of this happens during superstep -1. Are there any obvious things to try here?
Thank you, Piotr 2015-07-17 16:24:46,503 INFO [main] org.apache.giraph.comm.netty.NettyClient: checkRequestsForProblems: Re-issuing request (reqId=29,destAddr=ds0701.liveramp.net:30105,elapsedNanos=61867763,started=Tue Feb 10 18:47:11 PST 1970) 2015-07-17 16:24:46,503 INFO [main] org.apache.giraph.comm.netty.NettyClient: checkRequestsForProblems: Re-issuing request (reqId=55,destAddr=ds0640.liveramp.net:30020,elapsedNanos=61877851,started=Tue Feb 10 18:47:11 PST 1970) 2015-07-17 16:24:46,503 INFO [main] org.apache.giraph.comm.netty.NettyClient: checkRequestsForProblems: Re-issuing request (reqId=61,destAddr=ds0689.liveramp.net:30103,elapsedNanos=61887313,started=Tue Feb 10 18:47:11 PST 1970) 2015-07-17 16:24:46,503 INFO [main] org.apache.giraph.comm.netty.NettyClient: checkRequestsForProblems: Re-issuing request (reqId=51,destAddr=ds0665.liveramp.net:30044,elapsedNanos=61895473,started=Tue Feb 10 18:47:11 PST 1970) 2015-07-17 16:24:54,965 INFO [netty-client-worker-0] org.apache.giraph.comm.netty.handler.ResponseClientHandler: messageReceived: Already received response for (taskId = 93, requestId = 18) 2015-07-17 16:25:03,778 WARN [main] org.apache.giraph.comm.netty.NettyClient: checkRequestsForProblems: Problem with request id (destTask=90,reqId=30) connected = true, future done = false, success = false, cause = null, elapsed time = 609563, destination = ds0619.liveramp.net/10.100.132.111:30090 (reqId=30,destAddr= ds0619.liveramp.net:30090,elapsedNanos=609563278574,started=Tue Feb 10 18:37:18 PST 1970,writeDone=false,writeSuccess=false) 2015-07-17 16:25:03,778 WARN [main] org.apache.giraph.comm.netty.NettyClient: checkRequestsForProblems: Problem with request id (destTask=58,reqId=30) connected = true, future done = false, success = false, cause = null, elapsed time = 609563, destination = ds0649.liveramp.net/10.100.132.141:30058 (reqId=30,destAddr= ds0649.liveramp.net:30058,elapsedNanos=609563350498,started=Tue Feb 10 18:37:18 PST 1970,writeDone=false,writeSuccess=false) 2015-07-17 16:25:03,778 WARN [main] org.apache.giraph.comm.netty.NettyClient: checkRequestsForProblems: Problem with request id (destTask=42,reqId=31) connected = true, future done = false, success = false, cause = null, elapsed time = 609563, destination = ds0671.liveramp.net/10.100.132.163:30042 (reqId=31,destAddr= ds0671.liveramp.net:30042,elapsedNanos=609563356568,started=Tue Feb 10 18:37:18 PST 1970,writeDone=false,writeSuccess=false)
