Hi all,

I am using giraph-1.3 (not release version) on top of Hadoop 3.1.1
The cluster has 8 workers with 60GB heap space on each worker.
I am trying to run some long-running graph jobs on a billion edge graph.

The job runs for 20-ish super steps and then fails due to connection reset
by peer issue:
2021-01-25 22:52:31,811 FATAL [netty-server-worker-2]
graph.GraphTaskManager (GraphTaskManager.java:uncaughtException(1124)) -
uncaughtException: OverrideExceptionHandler on thread
netty-server-worker-2, msg = Connection reset by peer, exiting...
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:871)
at
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:208)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:118)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
at java.lang.Thread.run(Thread.java:748)

I am using the following configuration while running the job (some of them
have been taken from the facebook configuration example):
-ca giraph.useNettyPooledAllocator=true \
-ca giraph.nettyAutoRead=false \
-ca giraph.nettyClientThreads=14 \
-ca giraph.channelsPerServer=7 \
-ca giraph.useBigDataIOForMessages=true \
-ca giraph.messageEncodeAndStoreType=EXTRACT_BYTEARRAY_PER_PARTITION \
-ca giraph.numComputeThreads=14 \

What other configurations can I go around tweaking for ensuring that the
job completes?

Thanks,
Animesh.

Reply via email to