我之前在另一个邮件里面回复过,我再拷贝过来。 目前我已经建了一个JIRA来跟进too old resource version的问题[1]
在Flink里面采用了Watcher来监控Pod的状态变化,当Watcher被异常close的时候就会触发fatal error进而导致JobManager的重启 我这边做过一些具体的测试,在minikube、自建的K8s集群、阿里云ACK集群,稳定运行一周以上都是正常的。这个问题复现是通过重启 K8s的APIServer来做到的。所以我怀疑你那边Pod和APIServer之间的网络是不是不稳定,从而导致这个问题经常出现。 [1]. https://issues.apache.org/jira/browse/FLINK-20417 Best, Yang lichunguang <lcg3234...@163.com> 于2020年12月21日周一 下午3:51写道: > Flink1.11.1版本job以Application Mode在K8S集群上运行,jobmanager每个小时会重启一次,报错【Fatal > error > occurred in > ResourceManager.io.fabric8.kubernetes.client.KubernetesClientException: too > old resource version】 > > pod重启: > <http://apache-flink.147419.n8.nabble.com/file/t1176/11.jpg> > > 重启原因: > 2020-12-10 07:21:19,290 ERROR > org.apache.flink.kubernetes.KubernetesResourceManager [] - Fatal > error occurred in ResourceManager. > io.fabric8.kubernetes.client.KubernetesClientException: too old resource > version: 247468999 (248117930) > at > > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket.onReadMessage(RealWebSocket.java:323) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .WebSocketReader.readMessageFrame(WebSocketReader.java:219) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .WebSocketReader.processNextFrame(WebSocketReader.java:105) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket.loopReader(RealWebSocket.java:274) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket$2.onResponse(RealWebSocket.java:214) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_202] > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_202] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202] > 2020-12-10 07:21:19,291 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal > error occurred in the cluster entrypoint. > io.fabric8.kubernetes.client.KubernetesClientException: too old resource > version: 247468999 (248117930) > at > > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket.onReadMessage(RealWebSocket.java:323) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .WebSocketReader.readMessageFrame(WebSocketReader.java:219) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .WebSocketReader.processNextFrame(WebSocketReader.java:105) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket.loopReader(RealWebSocket.java:274) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws > .RealWebSocket$2.onResponse(RealWebSocket.java:214) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > [flink-dist_2.11-1.11.1.jar:1.11.1] > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_202] > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_202] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202] > > > 网上查的原因是因为: > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient类中212行 > > @Override > public KubernetesWatch watchPodsAndDoCallback(Map<String, String> labels, > PodCallbackHandler podCallbackHandler) { > return new KubernetesWatch( > this.internalClient.pods() > .withLabels(labels) > .watch(new > KubernetesPodsWatcher(podCallbackHandler))); > } > > 而ETCD中只会保留一段时间的version信息 > 【 I think it's standard behavior of Kubernetes to give 410 after some time > during watch. It's usually client's responsibility to handle it. In the > context of a watch, it will return HTTP_GONE when you ask to see changes > for > a resourceVersion that is too old - i.e. when it can no longer tell you > what > has changed since that version, since too many things have changed. In that > case, you'll need to start again, by not specifying a resourceVersion in > which case the watch will send you the current state of the thing you are > watching and then send updates from that point.】 > > 大家有没遇到相同的问题,是怎么处理的?我有几个处理方式,希望能跟大家一起讨论一下。 > > > > > -- > Sent from: http://apache-flink.147419.n8.nabble.com/ >