Thanks Peter, we're looking into it...

On Mon, Apr 25, 2022 at 11:54 AM Peter Schrott <pe...@bluerootlabs.io>
wrote:

> Hi,
>
> sorry for the late reply. It took me quite some time to get the logs out
> of the system. I have attached them now.
>
> Its logs of 2 jobmanagers and 2 taskamangers. It can be seen on jm 1 that
> the job starts crashing and recovering a few times. This happens
> until 2022-04-20 12:12:14,607. After that the above described behavior can
> be seen.
>
> I hope this helps.
>
> Best, Peter
>
> On Fri, Apr 22, 2022 at 12:06 PM Matthias Pohl <matth...@ververica.com>
> wrote:
>
>> FYI: I created FLINK-27354 [1] to cover the issue of retrying to connect
>> to the RM while shutting down the JobMaster.
>>
>> This doesn't explain your issue though, Peter. It's still unclear why the
>> JobMaster is still around as stated in my previous email.
>>
>> Matthias
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-27354
>>
>> On Fri, Apr 22, 2022 at 11:54 AM Matthias Pohl <matth...@ververica.com>
>> wrote:
>>
>>> Just by looking through the code, it appears that these logs could be
>>> produced while stopping the job. The ResourceManager sends a confirmation
>>> of the JobMaster being disconnected at the end back to the JobMaster. If
>>> the JobMaster is still around to process the request, it would try to
>>> reconnect (I'd consider that a bug because the JobMaster is in shutdown
>>> mode already and wouldn't need to re-establish the connection). This method
>>> would have been swallowed otherwise if the JobMaster was already terminated.
>>>
>>> The only explanation I can come up with right now (without having any
>>> logs) is that stopping the JobMaster didn't finish for some reason. For
>>> that it would be helpful to look at the logs to see whether there is some
>>> other issue that causes the JobMaster to stop entirely.
>>>
>>> On Fri, Apr 22, 2022 at 10:14 AM Matthias Pohl <matth...@ververica.com>
>>> wrote:
>>>
>>>> ...if possible it would be good to get debug rather than only info
>>>> logs. Did you encounter anything odd in the TaskManager logs as well.
>>>> Sharing those might be of value as well.
>>>>
>>>> On Fri, Apr 22, 2022 at 8:57 AM Matthias Pohl <matth...@ververica.com>
>>>> wrote:
>>>>
>>>>> Hi Peter,
>>>>> thanks for sharing. That doesn't sound right. May you provide the
>>>>> entire jobmanager logs?
>>>>>
>>>>> Best,
>>>>> Matthias
>>>>>
>>>>> On Thu, Apr 21, 2022 at 6:08 PM Peter Schrott <pe...@bluerootlabs.io>
>>>>> wrote:
>>>>>
>>>>>> Hi Flink-Users,
>>>>>>
>>>>>> I am not sure if this does something to my cluster or not. But since
>>>>>> updating to Flink 1.15 (atm rc4) I find the following logs:
>>>>>>
>>>>>> INFO: Registering job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>> @akka.tcp://
>>>>>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 for job
>>>>>> 5566648d9b1aac6c1a1b78187fd56975.
>>>>>>
>>>>>> as many times as number of parallelisms (here 10 times). These logs
>>>>>> are triggered every 5 minutes.
>>>>>>
>>>>>> Then they are followed by:
>>>>>>
>>>>>> INFO: Registration of job manager ab7db9ff0ebd26b3b89c3e2e56684762
>>>>>> @akka.tcp://
>>>>>> fl...@flink-jobmanager-xxx.com:40015/user/rpc/jobmanager_2 failed.
>>>>>>
>>>>>> also 10 log entries.
>>>>>>
>>>>>> I followed the lifetime of the job (5566648d9b1aac6c1a1b78187fd56975),
>>>>>> it was a long-running sql streaming job, started on Apr 13th on a
>>>>>> standalone cluster. After some recovery attempts it finally failed (using
>>>>>> the failover strategy) on the 20th Apr (yesterday) for good. Then those
>>>>>> logs started to appear. Now there was no other job running on my cluster
>>>>>> anymore but the logs appeared every 5 minutes until I restarted this
>>>>>> jobmanager service.
>>>>>>
>>>>>> This job was just an example, it happens to other jobs too.
>>>>>>
>>>>>> It's just INFO logs but it does not look healthy either.
>>>>>>
>>>>>> Thanks & Best
>>>>>> Peter
>>>>>>
>>>>>

Reply via email to