Hello , with respect to the api-server i dotn re
On 2021/09/09 11:37:49, Yang Wang <danrtsey...@gmail.com> wrote: > I think @Robert Metzger <rmetz...@apache.org> is right. You need to check > whether your Kubernetes APIServer is working properly or not(e.g. > overloaded). > > Another hint is about the fullGC. Please use the following config option to > enable the GC logs and check the full gc time. > env.java.opts.jobmanager: -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -Xloggc:/opt/flink/log/jobmanager-gc.log > > Simply increasing the renew-deadline might help. But it could not solve the > problem completely. > high-availability.kubernetes.leader-election.lease-duration: 120 s > high-availability.kubernetes.leader-election.renew-deadline: 120 s > > > Best, > Yang > > Robert Metzger <rmetz...@apache.org> 于2021年9月9日周四 下午6:52写道: > > > Is the kubernetes server you are using particularly busy? Maybe these > > issues occur because the server is overloaded? > > > > "Triggering checkpoint 2193 (type=CHECKPOINT) @ 1630681482667 for job > > 00000000000000000000000000000000." > > "Completed checkpoint 2193 for job 00000000000000000000000000000000 (474 > > bytes in 195 ms)." > > "Triggering checkpoint 2194 (type=CHECKPOINT) @ 1630681492667 for job > > 00000000000000000000000000000000." > > "Completed checkpoint 2194 for job 00000000000000000000000000000000 (474 > > bytes in 161 ms)." > > "Renew deadline reached after 60 seconds while renewing lock > > ConfigMapLock: myNs - myJob-dispatcher-leader > > (1bcda6b0-8a5a-4969-b9e4-2257c4478572)" > > "Stopping SessionDispatcherLeaderProcess." > > > > At some point, the leader election mechanism in fabric8 seems to give up. > > > > > > On Tue, Sep 7, 2021 at 10:05 AM mejri houssem <mejrihousse...@gmail.com> > > wrote: > > > >> hello, > >> > >> Here's other logs of the latest jm crash. > >> > >> > >> Le lun. 6 sept. 2021 à 14:18, houssem <mejrihousse...@gmail.com> a > >> écrit : > >> > >>> hello, > >>> > >>> I have three jobs running on my kubernetes cluster and each job has his > >>> own cluster id. > >>> > >>> On 2021/09/06 03:28:10, Yangze Guo <karma...@gmail.com> wrote: > >>> > Hi, > >>> > > >>> > The root cause is not "java.lang.NoClassDefFound". The job has been > >>> > running but could not edit the config map > >>> > "myJob-00000000000000000000000000000000-jobmanager-leader" and it > >>> > seems finally disconnected with the API server. Is there another job > >>> > with the same cluster id (myJob) ? > >>> > > >>> > I would also pull Yang Wang. > >>> > > >>> > Best, > >>> > Yangze Guo > >>> > > >>> > On Mon, Sep 6, 2021 at 10:10 AM Caizhi Weng <tsreape...@gmail.com> > >>> wrote: > >>> > > > >>> > > Hi! > >>> > > > >>> > > There is a message saying "java.lang.NoClassDefFound Error: > >>> org/apache/hadoop/hdfs/HdfsConfiguration" in your log file. Are you > >>> visiting HDFS in your job? If yes it seems that your Flink distribution or > >>> your cluster is lacking hadoop classes. Please make sure that there are > >>> hadoop jars in the lib directory of Flink, or your cluster has set the > >>> HADOOP_CLASSPATH environment variable. > >>> > > > >>> > > mejri houssem <mejrihousse...@gmail.com> 于2021年9月4日周六 上午12:15写道: > >>> > >> > >>> > >> > >>> > >> Hello , > >>> > >> > >>> > >> I am facing a JM crash lately. I am deploying a flink application > >>> cluster on kubernetes. > >>> > >> > >>> > >> When i install my chart using helm everything works fine but after > >>> some time ,the Jm starts to crash > >>> > >> > >>> > >> and then it gets deleted eventually after 5 restarts. > >>> > >> > >>> > >> flink version: 1.12.5 (upgraded recently from 1.12.2) > >>> > >> HA mode : k8s > >>> > >> > >>> > >> Here's the full log of the JM attached file. > >>> > > >>> > >> >