Hi Alexander,
Thanks a lot for your reply.
Yes, submitted by yarn.Do you just mean in the executor log file by way of yarn
logs -applicationId id,
in this file, both in some containers' stdout and stderr :
16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive connection
to ip-172-31-20-104/172.31.20.104:49991, creating a new one.
16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while beginning
fetch of 1 outstanding blocksjava.io.IOException: Failed to connect to
ip-172-31-20-104/172.31.20.104:49991 <------ may it be due to that
spark is not stable, and spark may repair itself for these kinds of error ?
(saw some in successful run )
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)............Caused
by: java.net.ConnectException: Connection refused:
ip-172-31-20-104/172.31.20.104:49991 at
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
16/06/17 11:54:38 ERROR executor.Executor: Managed memory leak detected; size =
16777216 bytes, TID = 100323 <----- would it be memory leak
issue? though no GC exception threw for other normal kinds of out of memory
16/06/17 11:54:38 ERROR executor.Executor: Exception in task 145.0 in stage
112.0 (TID 100323)java.io.IOException: Filesystem closed at
org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:837) at
org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:679) at
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903) at
java.io.DataInputStream.readFully(DataInputStream.java:195) at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)...........
sorry, there is some information in the middle of the log file, but all is okay
at the end part of the log .in the run log file as log_file generated by
command:nohup spark-submit --driver-memory 20g --num-executors 20 --class
com.dianrong.Main --master yarn-client dianrong-retention_2.10-1.0.jar
doAnalysisExtremeLender /tmp/drretention/test/output 0.96
/tmp/drretention/evaluation/test_karthik/lgmodel
/tmp/drretention/input/feature_6.0_20151001_20160531_behavior_201511_201604_summary/lenderId_feature_live
50 > log_file
executor 40 lost <------ would it be due to this,
sometimes job may fail for the reason
..........
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)
at java.io.DataInputStream.readFully(DataInputStream.java:195) at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
at
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)..........
Thanks in advance!
On Friday, June 17, 2016 3:52 PM, Alexander Kapustin <kp...@hotmail.com>
wrote:
#yiv1365829940 -- filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv1365829940
filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv1365829940
p.yiv1365829940MsoNormal, #yiv1365829940 li.yiv1365829940MsoNormal,
#yiv1365829940 div.yiv1365829940MsoNormal
{margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv1365829940 a:link,
#yiv1365829940 span.yiv1365829940MsoHyperlink
{color:blue;text-decoration:underline;}#yiv1365829940 a:visited, #yiv1365829940
span.yiv1365829940MsoHyperlinkFollowed
{color:#954F72;text-decoration:underline;}#yiv1365829940
.yiv1365829940MsoChpDefault {}#yiv1365829940 filtered {margin:2.0cm 42.5pt
2.0cm 3.0cm;}#yiv1365829940 div.yiv1365829940WordSection1 {}#yiv1365829940 Hi,
Did you submit spark job via YARN? In some cases (memory configuration
probably), yarn can kill containers where spark tasks are executed. In this
situation, please check yarn userlogs for more information… --WBR, Alexander
From: Zhiliang Zhu
Sent: 17 июня 2016 г. 9:36
To: Zhiliang Zhu; User
Subject: Re: spark job automatically killed without rhyme or reason anyone
ever met the similar problem, which is quite strange ...
On Friday, June 17, 2016 2:13 PM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID>
wrote:
Hi All,
I have a big job which mainly takes more than one hour to run the whole,
however, it is very much unreasonable to exit & finish to run midway (almost
80% of the job finished actually, but not all), without any apparent error or
exception log.
I submitted the same job for many times, it is same as that.In the last line of
the run log, just one word "killed" to end, or sometimes not any other wrong
log, all seems okay but should not finish.
What is the way for the problem? Is there any other friends that ever met the
similar issue ...
Thanks in advance!