Thanks Peter, we're looking into it... On Mon, Apr 25, 2022 at 11:54 AM Peter Schrott <pe...@bluerootlabs.io> wrote:
> Hi, > > sorry for the late reply. It took me quite some time to get the logs out > of the system. I have attached them now. > > Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1 that > the job starts crashing and recovering a few times. This happens > until 2022-04-20 12:12:14,607. After that the above described behavior can > be seen. > > I hope this helps. > > Best, Peter > > On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl <matth...@ververica.com> > wrote: > >> FYI: I created FLINK-27354 [1] to cover the issue of retrying to connect >> to the RM while shutting down the JobMaster. >> >> This doesn't explain your issue though, Peter. It's still unclear why the >> JobMaster is still around as stated in my previous email. >> >> Matthias >> >> [1] https://issues.apache.org/jira/browse/FLINK-27354 >> >> On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl <matth...@ververica.com> >> wrote: >> >>> Just by looking through the code, it appears that these logs could be >>> produced while stopping the job. The ResourceManager sends a confirmation >>> of the JobMaster being disconnected at the end back to the JobMaster. If >>> the JobMaster is still around to process the request, it would try to >>> reconnect (I'd consider that a bug because the JobMaster is in shutdown >>> mode already and wouldn't need to re-establish the connection). This method >>> would have been swallowed otherwise if the JobMaster was already terminated. >>> >>> The only explanation I can come up with right now (without having any >>> logs) is that stopping the JobMaster didn't finish for some reason. For >>> that it would be helpful to look at the logs to see whether there is some >>> other issue that causes the JobMaster to stop entirely. >>> >>> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl <matth...@ververica.com> >>> wrote: >>> >>>> ...if possible it would be good to get debug rather than only info >>>> logs. Did you encounter anything odd in the TaskManager logs as well. >>>> Sharing those might be of value as well. >>>> >>>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl <matth...@ververica.com> >>>> wrote: >>>> >>>>> Hi Peter, >>>>> thanks for sharing. That doesn't sound right. May you provide the >>>>> entire jobmanager logs? >>>>> >>>>> Best, >>>>> Matthias >>>>> >>>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott <pe...@bluerootlabs.io> >>>>> wrote: >>>>> >>>>>> Hi Flink-Users, >>>>>> >>>>>> I am not sure if this does something to my cluster or not. But since >>>>>> updating to Flink 1.15 (atm rc4) I find the following logs: >>>>>> >>>>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762 >>>>>> @akka.tcp:// >>>>>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job >>>>>> 5566648d9b1aac6c1a1b78187fd56975. >>>>>> >>>>>> as many times as number of parallelisms (here 10 times). These logs >>>>>> are triggered every 5 minutes. >>>>>> >>>>>> Then they are followed by: >>>>>> >>>>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762 >>>>>> @akka.tcp:// >>>>>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 failed. >>>>>> >>>>>> also 10 log entries. >>>>>> >>>>>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975), >>>>>> it was a long-running sql streaming job, started on Apr 13th on a >>>>>> standalone cluster. After some recovery attempts it finally failed (using >>>>>> the failover strategy) on the 20th Apr (yesterday) for good. Then those >>>>>> logs started to appear. Now there was no other job running on my cluster >>>>>> anymore but the logs appeared every 5 minutes until I restarted this >>>>>> jobmanager service. >>>>>> >>>>>> This job was just an example, it happens to other jobs too. >>>>>> >>>>>> It's just INFO logs but it does not look healthy either. >>>>>> >>>>>> Thanks & Best >>>>>> Peter >>>>>> >>>>>