Re: application failed on large dataset

周千昊 Tue, 15 Sep 2015 20:21:32 -0700

Hi,
      after check with the yarn logs, all the error stack looks like below:


15/09/15 19:58:23 ERROR shuffle.OneForOneBlockFetcher: Failed while
starting block fetches
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
        at
io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
        at
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
        at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:745)

        It seems that some error occurs when try to fetch the block, and
after several retries, the executor just dies with such error.
        And for your question, I did not see any executor restart during
the job.
        PS: the operator I am using during that stage if
rdd.glom().mapPartitions()


java8964 <java8...@hotmail.com>于2015年9月15日周二 下午11:44写道：

> When you saw this error, does any executor die due to whatever error?
>
> Do you check to see if any executor restarts during your job?
>
> It is hard to help you just with the stack trace. You need to tell us the
> whole picture when your jobs are running.
>
> Yong
>
> ------------------------------
> From: qhz...@apache.org
> Date: Tue, 15 Sep 2015 15:02:28 +0000
> Subject: Re: application failed on large dataset
> To: user@spark.apache.org
>
>
> has anyone met the same problems?
> 周千昊 <qhz...@apache.org>于2015年9月14日周一 下午9:07写道：
>
> Hi, community
>       I am facing a strange problem:
>       all executors does not respond, and then all of them failed with the
> ExecutorLostFailure.
>       when I look into yarn logs, there are full of such exception
>
> 15/09/14 04:35:33 ERROR shuffle.RetryingBlockFetcher: Exception while
> beginning fetch of 1 outstanding blocks (after 3 retries)
> java.io.IOException: Failed to connect to host/ip:port
>         at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
>         at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>         at
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
>         at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>         at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>         at
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused: host/ip:port
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>         at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
>         at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
>         at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>         at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>         at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>         at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
>         ... 1 more
>
>
>       The strange thing is that, if I reduce the input size, the problems
> just disappeared. I have found a similar issue in the mail-archive(
> http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3CCAOHP_tHRtuxDfWF0qmYDauPDhZ1=MAm5thdTfgAhXDN=7kq...@mail.gmail.com%3E
> <http://mail-archives.us.apache.org/mod_mbox/spark-user/201502.mbox/%3cCAOHP_tHRtuxDfWF0qmYDauPDhZ1=MAm5thdTfgAhXDN=7KQM8A%40mail.gmail.com%3e>),
> however I didn't see the solution. So I am wondering if anyone could help
> with that?
>
>       My env is:
>       hdp 2.2.6
>       spark(1.4.1)
>       mode: yarn-client
>       spark-conf:
>       spark.driver.extraJavaOptions -Dhdp.version=2.2.6.0-2800
>       spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.6.0-2800
>       spark.executor.memory 6g
>       spark.storage.memoryFraction 0.3
>       spark.dynamicAllocation.enabled true
>       spark.shuffle.service.enabled true
>
> --
Best Regard
ZhouQianhao

Re: application failed on large dataset

Reply via email to