Re: Flink job server with HA

Xintong Song Mon, 03 Jun 2019 20:19:40 -0700

If that is the case, then I would suggest you to check the following two
things:
1. Is the HA mode configured properly in Flink configuration? There should
be a config option "high-availability" in your flink-conf.yarml. If not
configured, the default value would be "NONE".
2. It "ClassPathJobGraphRetriever#retrieveJobGraph" actually invoked, and
is there any exceptions thrown from it. This is to verify whether the
correct code path for job cluster is invoked.


Thank you~

Xintong Song



On Tue, Jun 4, 2019 at 10:48 AM Boris Lublinsky <
boris.lublin...@lightbend.com> wrote:

> I am running on k8
> Job master runs as a deployment of 1, so just killing a pod restarts it
>
> Boris Lublinsky
> FDP Architect
> boris.lublin...@lightbend.com
> https://www.lightbend.com/
>
> On Jun 3, 2019, at 9:46 PM, Xintong Song <tonysong...@gmail.com> wrote:
>
> So here are my questions:
> 1. What environment do you run Flink in? Is it locally, on Yarn or Mesos?
> 2. How do you trigger "restart a Job Master"?
>
> Thank you~
> Xintong Song
>
>
>
> On Tue, Jun 4, 2019 at 10:35 AM Boris Lublinsky <
> boris.lublin...@lightbend.com> wrote:
>
>> Thanks,
>> Thats what I thought initially.
>> The issue is that because of this, during restart, it does not know which
>> job was running before (it is obtained from submitted job graph store).
>> Because this is empty, there is no restarted jobs and the cluster does
>> not even try to restore checkpoints.
>> I can see that checkpoints are stored correctly, but they are never
>> accessed.
>>
>> Boris Lublinsky
>> FDP Architect
>> boris.lublin...@lightbend.com
>> https://www.lightbend.com/
>>
>> On Jun 3, 2019, at 9:23 PM, Xintong Song <tonysong...@gmail.com> wrote:
>>
>> Hi Boris,
>>
>> I think what you described that putJobGraph is not invoked in Flink job
>> cluster is by design and should not cause a failure of job recovering. For
>> a Flink job cluster, there is only one job graph to execute. Instead of
>> uploading job graph to an already running cluster (like in a session
>> cluster), the job graph in a Flink job cluster is uploaded before the
>> cluster is started, together with the Flink framework jars. Please refer to
>> MiniDispatcher and SingleJobSubmittedJobGraphStore for the details.
>>
>> I think we need more information to find the root cause of your problem.
>> For example, can you explain what are the detailed operation steps do you
>> perform when you say "trying to restart a Job Master".
>>
>> Thank you~
>> Xintong Song
>>
>>
>>
>> On Mon, Jun 3, 2019 at 10:05 PM Boris Lublinsky <
>> boris.lublin...@lightbend.com> wrote:
>>
>>> I am trying to experiment with Flink Job server with HA and I am
>>> noticing, that in this case
>>> method putJobGraph in the class SubmittedJobGraphStore Is never
>>> invoked. (I can see that it is invoked in the case of session cluster when
>>> a job is added)
>>> As a result, when I am trying to restart a Job Master, it finds no
>>> running jobs and is not trying to restore it.
>>> Am I missing something?
>>>
>>>
>>>
>>> Boris Lublinsky
>>> FDP Architect
>>> boris.lublin...@lightbend.com
>>> https://www.lightbend.com/
>>>
>>>
>>
>

Re: Flink job server with HA

Reply via email to