Hi all, I am using giraph-1.3 (not release version) on top of Hadoop 3.1.1 The cluster has 8 workers with 60GB heap space on each worker. I am trying to run some long-running graph jobs on a billion edge graph.
The job runs for 20-ish super steps and then fails due to connection reset by peer issue: 2021-01-25 22:52:31,811 FATAL [netty-server-worker-2] graph.GraphTaskManager (GraphTaskManager.java:uncaughtException(1124)) - uncaughtException: OverrideExceptionHandler on thread netty-server-worker-2, msg = Connection reset by peer, exiting... java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:871) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:208) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:118) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101) at java.lang.Thread.run(Thread.java:748) I am using the following configuration while running the job (some of them have been taken from the facebook configuration example): -ca giraph.useNettyPooledAllocator=true \ -ca giraph.nettyAutoRead=false \ -ca giraph.nettyClientThreads=14 \ -ca giraph.channelsPerServer=7 \ -ca giraph.useBigDataIOForMessages=true \ -ca giraph.messageEncodeAndStoreType=EXTRACT_BYTEARRAY_PER_PARTITION \ -ca giraph.numComputeThreads=14 \ What other configurations can I go around tweaking for ensuring that the job completes? Thanks, Animesh.