Re: Re: flink on yarn任务启动报错 The assigned slot container_e10_1579661300080_0005_01_000002_0 was removed.

2020-01-23 文章 tison
你上面的是 taskmanager.err,需要的是 taskmanager.log

Best,
tison.


郑 洁锋  于2020年1月23日周四 下午10:22写道:

> 之前挂过 后面启动的时候 是checkpoints的文件丢了? 你是这个意思吗?
>
> 
> zjfpla...@hotmail.com
>
> 发件人: zhisheng
> 发送时间: 2020-01-22 16:45
> 收件人: user-zh
> 主题: Re: flink on yarn任务启动报错 The assigned slot
> container_e10_1579661300080_0005_01_02_0 was removed.
> 应该是你作业之前挂过了
>
> 郑 洁锋  于2020年1月22日周三 上午11:16写道:
>
> > 大家好,
> >flink on yarn任务启动时,发现报错了The assigned slot
> > container_e10_1579661300080_0005_01_02_0 was removed.
> >环境:flink1.8.1,cdh5.14.2,kafka0.10,jdk1.8.0_241
> >
> > flink版本为1.8.1,yarn上的日志:
> >
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:
> >
> 
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Starting
> > YarnJobClusterEntrypoint (Version: , Rev:7297bac,
> Date:24.06.2019
> > @ 23:04:28 CST)
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  OS current user:
> > cloudera-scm
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Current
> > Hadoop/Kerberos user: root
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  JVM: Java
> > HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.241-b07
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Maximum heap size:
> > 406 MiBytes
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  JAVA_HOME:
> > /usr/java/default
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Hadoop version:
> 2.6.5
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  JVM Options:
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint: -Xms424m
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint: -Xmx424m
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Program Arguments:
> > (none)
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Classpath:
> >
> 

Re: Re: flink on yarn任务启动报错 The assigned slot container_e10_1579661300080_0005_01_000002_0 was removed.

2020-01-23 文章 郑 洁锋
之前挂过 后面启动的时候 是checkpoints的文件丢了? 你是这个意思吗?


zjfpla...@hotmail.com

发件人: zhisheng
发送时间: 2020-01-22 16:45
收件人: user-zh
主题: Re: flink on yarn任务启动报错 The assigned slot 
container_e10_1579661300080_0005_01_02_0 was removed.
应该是你作业之前挂过了

郑 洁锋  于2020年1月22日周三 上午11:16写道:

> 大家好,
>flink on yarn任务启动时,发现报错了The assigned slot
> container_e10_1579661300080_0005_01_02_0 was removed.
>环境:flink1.8.1,cdh5.14.2,kafka0.10,jdk1.8.0_241
>
> flink版本为1.8.1,yarn上的日志:
>
> 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:
> 
> 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Starting
> YarnJobClusterEntrypoint (Version: , Rev:7297bac, Date:24.06.2019
> @ 23:04:28 CST)
> 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  OS current user:
> cloudera-scm
> 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Current
> Hadoop/Kerberos user: root
> 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  JVM: Java
> HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.241-b07
> 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Maximum heap size:
> 406 MiBytes
> 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  JAVA_HOME:
> /usr/java/default
> 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Hadoop version: 2.6.5
> 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  JVM Options:
> 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint: -Xms424m
> 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint: -Xmx424m
> 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Program Arguments:
> (none)
> 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Classpath:
> 

Re: Re: flink on yarn任务启动报错 The assigned slot container_e10_1579661300080_0005_01_000002_0 was removed.

2020-01-23 文章 郑 洁锋
日志已经在前面的邮件里面了


zjfpla...@hotmail.com

发件人: tison
发送时间: 2020-01-22 12:10
收件人: user-zh
主题: Re: Re: flink on yarn任务启动报错 The assigned slot 
container_e10_1579661300080_0005_01_02_0 was removed.
那你看下 TM 那台机器上的 TM 日志,从 JM 端来看 TM 曾经成功起来过并注册了自己,你看看 TM 是怎么挂的或者别的什么情况

Best,
tison.


郑 洁锋  于2020年1月22日周三 上午11:54写道:

> TM没有起来,服务器本身内存cpu都是够的,还很空闲
>
> 
> zjfpla...@hotmail.com
>
> 发件人: tison
> 发送时间: 2020-01-22 11:25
> 收件人: user-zh
> 主题: Re: flink on yarn任务启动报错 The assigned slot
> container_e10_1579661300080_0005_01_02_0 was removed.
> 20/01/22 11:08:49 INFO yarn.YarnResourceManager: Closing TaskExecutor
> connection container_e10_1579661300080_0005_01_02 because: The
> heartbeat of TaskManager with id container_e10_1579661300080_0005_01_02
> timed out.
>
> 你请求资源的时候把 slot 请求发到这台机器上了,然后它心跳超时了,你看看 TM 有没有正常起来,有没有资源不够或者挂了
>
> Best,
> tison.
>
>
> 郑 洁锋  于2020年1月22日周三 上午11:16写道:
>
> > 大家好,
> >flink on yarn任务启动时,发现报错了The assigned slot
> > container_e10_1579661300080_0005_01_02_0 was removed.
> >环境:flink1.8.1,cdh5.14.2,kafka0.10,jdk1.8.0_241
> >
> > flink版本为1.8.1,yarn上的日志:
> >
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:
> >
> 
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Starting
> > YarnJobClusterEntrypoint (Version: , Rev:7297bac,
> Date:24.06.2019
> > @ 23:04:28 CST)
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  OS current user:
> > cloudera-scm
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Current
> > Hadoop/Kerberos user: root
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  JVM: Java
> > HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.241-b07
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Maximum heap size:
> > 406 MiBytes
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  JAVA_HOME:
> > /usr/java/default
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Hadoop version:
> 2.6.5
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  JVM Options:
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint: -Xms424m
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint: -Xmx424m
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Program Arguments:
> > (none)
> > 20/01/22 11:07:53 INFO entrypoint.ClusterEntrypoint:  Classpath:
> >
> 

Re: java.lang.StackOverflowError

2020-01-23 文章 Piotr Nowojski
Hi,

Thanks for reporting the issue. Could you first try to upgrade to Flink 1.6.4? 
This might be a known issue fixed in a later bug fix release [1].

Also, are you sure you are using (unmodified) Flink 1.6.2? Stack traces somehow 
do not match with the 1.6.2 release tag in the repository, for example this one:

at 
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.notifyBufferAvailable(RemoteInputChannel.java:380)
 

Doesn’t match the code [2]. (But first try to upgrade the Flink)

Piotrek


[1] https://issues.apache.org/jira/browse/FLINK-10367 

[2] 
https://github.com/apache/flink/blob/3456ad0dcacb3163e8aecf8fbe2dd684404cf37d/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/RemoteInputChannel.java#L380
 


> On 22 Jan 2020, at 03:24, 刘建刚  wrote:
> 
> I am using flink 1.6.2 on yarn. State backend is rocksdb. 
> 
>> 2020年1月22日 上午10:15,刘建刚 > > 写道:
>> 
>>   I have a flink job which fails occasionally. I am eager to avoid this 
>> problem. Can anyone help me? The error stacktrace is as following:
>> java.io.IOException: java.lang.StackOverflowError
>>  at 
>> org.apache.flink.runtime.io.network.partition.consumer.InputChannel.checkError(InputChannel.java:191)
>>  at 
>> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.getNextBuffer(RemoteInputChannel.java:194)
>>  at 
>> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:589)
>>  at 
>> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:546)
>>  at 
>> org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:175)
>>  at 
>> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:236)
>>  at 
>> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
>>  at 
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:335)
>>  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:754)
>>  at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.lang.StackOverflowError
>>  at 
>> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.notifyChannelNonEmpty(SingleInputGate.java:656)
>>  at 
>> org.apache.flink.runtime.io.network.partition.consumer.InputChannel.notifyChannelNonEmpty(InputChannel.java:125)
>>  at 
>> org.apache.flink.runtime.io.network.partition.consumer.InputChannel.setError(InputChannel.java:203)
>>  at 
>> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.notifyBufferAvailable(RemoteInputChannel.java:403)
>>  at 
>> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:282)
>>  at 
>> org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:172)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release0(AbstractReferenceCountedByteBuf.java:95)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:84)
>>  at 
>> org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:147)
>>  at 
>> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.notifyBufferAvailable(RemoteInputChannel.java:380)
>>  at 
>> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:282)
>>  at 
>> org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:172)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release0(AbstractReferenceCountedByteBuf.java:95)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:84)
>>  at 
>> org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:147)
>>  at 
>> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.notifyBufferAvailable(RemoteInputChannel.java:380)
>>  at 
>> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:282)
>>  at 
>> org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:172)
>>  at 
>> org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release0(AbstractReferenceCountedByteBuf.java:95)
>>  at 
>>