You are welcome :)

Piotrek

śr., 30 cze 2021 o 08:34 Thomas Wang <w...@datability.io> napisał(a):

> Thanks Piotr. This is helpful.
>
> Thomas
>
> On Mon, Jun 28, 2021 at 8:29 AM Piotr Nowojski <pnowoj...@apache.org>
> wrote:
>
>> Hi,
>>
>> You should still be able to get the Flink logs via:
>>
>> > yarn logs -applicationId application_1623861596410_0010
>>
>> And it should give you more answers about what has happened.
>>
>> About the Flink and YARN behaviour, have you seen the documentation? [1]
>> Especially this part:
>>
>> > Failed containers (including the JobManager) are replaced by YARN. The
>> maximum number of JobManager container restarts is configured via
>> yarn.application-attempts (default 1). The YARN Application will fail once
>> all attempts are exhausted.
>>
>> ?
>>
>> Best,
>> Piotrek
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/yarn/#flink-on-yarn-reference
>>
>> pon., 28 cze 2021 o 02:26 Thomas Wang <w...@datability.io> napisał(a):
>>
>>> Just found some additional info. It looks like one of the EC2 instances
>>> got terminated at the time the crash happened and this job had 7 Task
>>> Managers running on that EC2 instance. Now I suspect it's possible
>>> that when Yarn tried to migrate the Task Managers, there were no idle
>>> containers as this job was using like 99% of the entire cluster. However in
>>> that case shouldn't Yarn wait for containers to become available? I'm not
>>> quite sure how Flink would behave in this case. Could someone provide some
>>> insights here? Thanks.
>>>
>>> Thomas
>>>
>>> On Sun, Jun 27, 2021 at 4:24 PM Thomas Wang <w...@datability.io> wrote:
>>>
>>>> Hi,
>>>>
>>>> I recently experienced a job crash due to the underlying Yarn
>>>> application failing for some reason. Here is the only error message I saw.
>>>> It seems I can no longer see any of the Flink job logs.
>>>>
>>>> Application application_1623861596410_0010 failed 1 times (global limit
>>>> =2; local limit is =1) due to ApplicationMaster for attempt
>>>> appattempt_1623861596410_0010_000001 timed out. Failing the application.
>>>>
>>>> I was running the Flink job using the Yarn session mode with the
>>>> following command.
>>>>
>>>> export HADOOP_CLASSPATH=`hadoop classpath` &&
>>>> /usr/lib/flink/bin/yarn-session.sh -jm 7g -tm 7g -s 4 --detached
>>>>
>>>> I didn't have HA setup, but I believe the underlying Yarn application
>>>> caused the crash because if, for some reason, the Flink job failed, the
>>>> Yarn application should still survive. Please correct me if this is not the
>>>> right assumption.
>>>>
>>>> My question is how I should find the root cause in this case and what's
>>>> the recommended way to avoid this going forward?
>>>>
>>>> Thanks.
>>>>
>>>> Thomas
>>>>
>>>

Reply via email to