Hi ,

Needed inputs for a couple of issue that I am facing in my production
environment.

I am using spark version 1.4.0 spark streaming.

1) It so happens that the worker is lost on a machine and the executor
still shows up in the executor's tab in the UI.

Even when I kill a worker using kill -9 command the worker and executor
both dies on that machine but executor still shows up in the executors tab
on the UI. The number of active tasks sometimes shows negative on that
executor and my job keeps failing with following exception.

This usually happens when a job is running. When no computation is taking
place on the cluster i.e suppose a 1 min batch gets completed in 20 secs
and I kill the worker then executor entry is also gone from the UI but when
I kill the worker when a job is still running I run into this issue always.


16/04/01 23:54:20 WARN TaskSetManager: Lost task 141.0 in stage 19859.0
(TID 190333, 192.168.33.96): java.io.IOException: Failed to connect to /
192.168.33.97:63276
        at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
        at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
        at
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
        at
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
        at
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
        at
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: /
192.168.33.97:63276
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
        at
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
        at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
        ... 1 more



 When I relaunch the worker new executors are added but the dead one's
entry is still there until the application is killed.

 2) Another issue is when the disk becomes full on one of the workers, the
executor becomes unresponsive and the job stucks at a particular stage. The
exception that I can see in the executor logs is


 16/03/29 10:49:00 ERROR DiskBlockObjectWriter: Uncaught exception while
reverting partial writes to file
/data/spark-e2fc248f-a212-4a99-9d6c-4e52d6a69070/executor-37679a6c-cb96-451e-a284-64d6b4fe9910/blockmgr-f8ca72f4-f329-468b-8e65-ef97f8fb285c/38/temp_shuffle_8f266d70-3fc6-41e5-bbaa-c413a7b08ea4
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:315)
at
org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:58)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:274)


As a workaround I have to kill the executor, clear the space on disk and
new executor  relaunched by the worker and the failed stages are
recomputed. But, is it really the case that when the space is full on a
machine then my application gets stuck ?




This is really becoming a bottleneck and leads to unstability of my
production stack.

Please share your insights on this.


Thanks,
Abhi

Reply via email to