Re: suggestion of FLINK-10868

Peter Huang Thu, 12 Sep 2019 01:00:33 -0700

Hi Anyang and Till,

I think we agreed on making the interval configurable in this case. Let me
revise the current PR. You can review it after that.




Best Regards
Peter Huang

On Thu, Sep 12, 2019 at 12:53 AM Anyang Hu <huanyang1...@gmail.com> wrote:

> Thanks Till, I will continue to follow this issue and see what we can do.
>
> Best regards,
> Anyang
>
> Till Rohrmann <trohrm...@apache.org> 于2019年9月11日周三 下午5:12写道：
>
>> Suggestion 1 makes sense. For the quick termination I think we need to
>> think a bit more about it to find a good solution also to support strict
>> SLA requirements.
>>
>> Cheers,
>> Till
>>
>> On Wed, Sep 11, 2019 at 11:11 AM Anyang Hu <huanyang1...@gmail.com>
>> wrote:
>>
>>> Hi Till,
>>>
>>> Some of our online batch tasks have strict SLA requirements, and they
>>> are not allowed to be stuck for a long time. Therefore, we take a rude way
>>> to make the job exit immediately. The way to wait for connection recovery
>>> is a better solution. Maybe we need to add a timeout to wait for JM to
>>> restore the connection?
>>>
>>> For suggestion 1, make interval configurable, given that we have done
>>> it, and if we can, we hope to give back to the community.
>>>
>>> Best regards,
>>> Anyang
>>>
>>> Till Rohrmann <trohrm...@apache.org> 于2019年9月9日周一 下午3:09写道：
>>>
>>>> Hi Anyang,
>>>>
>>>> I think we cannot take your proposal because this means that whenever
>>>> we want to call notifyAllocationFailure when there is a connection problem
>>>> between the RM and the JM, then we fail the whole cluster. This is
>>>> something a robust and resilient system should not do because connection
>>>> problems are expected and need to be handled gracefully. Instead if one
>>>> deems the notifyAllocationFailure message to be very important, then one
>>>> would need to keep it and tell the JM once it has connected back.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Sun, Sep 8, 2019 at 11:26 AM Anyang Hu <huanyang1...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Peter,
>>>>>
>>>>> For our online batch task, there is a scene where the failed Container
>>>>> reaches MAXIMUM_WORKERS_FAILURE_RATE but the client will not immediately
>>>>> exit (the probability of JM loss is greatly improved when thousands of
>>>>> Containers is to be started). It is found that the JM disconnection (the
>>>>> reason for JM loss is unknown) will cause the notifyAllocationFailure not
>>>>> to take effect.
>>>>>
>>>>> After the introduction of FLINK-13184
>>>>> <https://jira.apache.org/jira/browse/FLINK-13184> to start  the
>>>>> container with multi-threaded, the JM disconnection situation has been
>>>>> alleviated. In order to stably implement the client immediate exit, we use
>>>>> the following code to determine  whether call onFatalError when
>>>>> MaximumFailedTaskManagerExceedingException is occurd:
>>>>>
>>>>> @Override
>>>>> public void notifyAllocationFailure(JobID jobId, AllocationID 
>>>>> allocationId, Exception cause) {
>>>>>    validateRunsInMainThread();
>>>>>
>>>>>    JobManagerRegistration jobManagerRegistration = 
>>>>> jobManagerRegistrations.get(jobId);
>>>>>    if (jobManagerRegistration != null) {
>>>>>       
>>>>> jobManagerRegistration.getJobManagerGateway().notifyAllocationFailure(allocationId,
>>>>>  cause);
>>>>>    } else {
>>>>>       if (exitProcessOnJobManagerTimedout) {
>>>>>          ResourceManagerException exception = new 
>>>>> ResourceManagerException("Job Manager is lost, can not notify allocation 
>>>>> failure.");
>>>>>          onFatalError(exception);
>>>>>       }
>>>>>    }
>>>>> }
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Anyang
>>>>>
>>>>>

Re: suggestion of FLINK-10868

Reply via email to