Re: Job manager crash

houssem Thu, 09 Sep 2021 14:25:16 -0700

Hello ,

with respect to the api-server i dotn re


On 2021/09/09 11:37:49, Yang Wang <danrtsey...@gmail.com> wrote: 
> I think @Robert Metzger <rmetz...@apache.org> is right. You need to check
> whether your Kubernetes APIServer is working properly or not(e.g.
> overloaded).
> 
> Another hint is about the fullGC. Please use the following config option to
> enable the GC logs and check the full gc time.
> env.java.opts.jobmanager: -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps -Xloggc:/opt/flink/log/jobmanager-gc.log
> 
> Simply increasing the renew-deadline might help. But it could not solve the
> problem completely.
> high-availability.kubernetes.leader-election.lease-duration: 120 s
> high-availability.kubernetes.leader-election.renew-deadline: 120 s
> 
> 
> Best,
> Yang
> 
> Robert Metzger <rmetz...@apache.org> 于2021年9月9日周四 下午6:52写道：
> 
> > Is the kubernetes server you are using particularly busy? Maybe these
> > issues occur because the server is overloaded?
> >
> > "Triggering checkpoint 2193 (type=CHECKPOINT) @ 1630681482667 for job
> > 00000000000000000000000000000000."
> > "Completed checkpoint 2193 for job 00000000000000000000000000000000 (474
> > bytes in 195 ms)."
> > "Triggering checkpoint 2194 (type=CHECKPOINT) @ 1630681492667 for job
> > 00000000000000000000000000000000."
> > "Completed checkpoint 2194 for job 00000000000000000000000000000000 (474
> > bytes in 161 ms)."
> > "Renew deadline reached after 60 seconds while renewing lock
> > ConfigMapLock: myNs - myJob-dispatcher-leader
> > (1bcda6b0-8a5a-4969-b9e4-2257c4478572)"
> > "Stopping SessionDispatcherLeaderProcess."
> >
> > At some point, the leader election mechanism in fabric8 seems to give up.
> >
> >
> > On Tue, Sep 7, 2021 at 10:05 AM mejri houssem <mejrihousse...@gmail.com>
> > wrote:
> >
> >> hello,
> >>
> >> Here's other logs of the latest jm crash.
> >>
> >>
> >> Le lun. 6 sept. 2021 à 14:18, houssem <mejrihousse...@gmail.com> a
> >> écrit :
> >>
> >>> hello,
> >>>
> >>> I have three jobs running on my kubernetes cluster and each job has his
> >>> own cluster id.
> >>>
> >>> On 2021/09/06 03:28:10, Yangze Guo <karma...@gmail.com> wrote:
> >>> > Hi,
> >>> >
> >>> > The root cause is not "java.lang.NoClassDefFound". The job has been
> >>> > running but could not edit the config map
> >>> > "myJob-00000000000000000000000000000000-jobmanager-leader" and it
> >>> > seems finally disconnected with the API server. Is there another job
> >>> > with the same cluster id (myJob) ?
> >>> >
> >>> > I would also pull Yang Wang.
> >>> >
> >>> > Best,
> >>> > Yangze Guo
> >>> >
> >>> > On Mon, Sep 6, 2021 at 10:10 AM Caizhi Weng <tsreape...@gmail.com>
> >>> wrote:
> >>> > >
> >>> > > Hi!
> >>> > >
> >>> > > There is a message saying "java.lang.NoClassDefFound Error:
> >>> org/apache/hadoop/hdfs/HdfsConfiguration" in your log file. Are you
> >>> visiting HDFS in your job? If yes it seems that your Flink distribution or
> >>> your cluster is lacking hadoop classes. Please make sure that there are
> >>> hadoop jars in the lib directory of Flink, or your cluster has set the
> >>> HADOOP_CLASSPATH environment variable.
> >>> > >
> >>> > > mejri houssem <mejrihousse...@gmail.com> 于2021年9月4日周六 上午12:15写道：
> >>> > >>
> >>> > >>
> >>> > >> Hello ,
> >>> > >>
> >>> > >> I am facing a JM crash lately. I am deploying a flink application
> >>> cluster on kubernetes.
> >>> > >>
> >>> > >> When i install my chart using helm everything works fine but after
> >>> some time ,the Jm starts to crash
> >>> > >>
> >>> > >> and then it gets deleted eventually after 5 restarts.
> >>> > >>
> >>> > >> flink version: 1.12.5 (upgraded recently from 1.12.2)
> >>> > >> HA mode : k8s
> >>> > >>
> >>> > >> Here's the full log of the JM attached file.
> >>> >
> >>>
> >>
>

Re: Job manager crash

Reply via email to