Re: Flink Checkpoint on yarn

Stefano Baghino Sat, 19 Mar 2016 04:56:16 -0700

Hi Ufuk,

I've read the documentation
<https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/jobmanager_high_availability.html>
and it's exactly as you say, thanks for the clarification.


Assuming one wants to run several jobs in parallel with different users on
a secure cluster in HA mode, would you think setting the
recovery.zookeeper.path.root from the startup script could be regarded as a
good practice?

On Thu, Mar 17, 2016 at 11:51 AM, Stefano Baghino <
stefano.bagh...@radicalbit.io> wrote:

> Hi Ufuk,
>
> does the recovery.zookeeper.path.root property need to be set
> independently for each job that is run? Doesn't Flink take care of
> assigning some sort of identification to each job and storing their
> checkpoints independently?
>
> On Thu, Mar 17, 2016 at 11:43 AM, Ufuk Celebi <u...@apache.org> wrote:
>
>> Hey Simone! Did you set different recovery.zookeeper.path.root keys?
>> The default is /flink and if you don't change it for the 2nd cluster,
>> it will try to recover the jobs of the first one. Can you gather the
>> job manager logs as well please?
>>
>> – Ufuk
>>
>> On Thu, Mar 17, 2016 at 11:31 AM, Simone Robutti
>> <simone.robu...@radicalbit.io> wrote:
>> > Ok, i run another test.
>> >
>> > I launched two identical jobs, one after the other, on yarn (without the
>> > long running session). I then killed a job manager and both the jobs got
>> > problems and then resumed their work after a few seconds. The problem
>> is the
>> > first job restored the state of the second job and vice versa.
>> >
>> > Here are the logs: https://gist.github.com/chobeat/38f8eee753aeaca51acc
>> >
>> > At line 141 of the first job and at line 131 of the second job I killed
>> the
>> > job manager. As you can see, the first stopped at 48 and resumed at 39
>> while
>> > the second stopped at 38 and resumed at 48. I hope there's something
>> wrong
>> > with my configuration because otherwise this really looks like a bug.
>> >
>> > Thanks in advance,
>> >
>> > Simone
>> >
>> > 2016-03-16 18:55 GMT+01:00 Simone Robutti <simone.robu...@radicalbit.io
>> >:
>> >>
>> >> Actually the test was intended for a single job. The fact that there
>> are
>> >> more jobs is unexpected and it will be the first thing to verify.
>> >> Considering these problems we will go for deeper tests with multiple
>> jobs.
>> >>
>> >> The logs are collected with "yarn logs" but log aggregation is not
>> >> properly configured so I wouldn't rely too much on that. Before doing
>> the
>> >> tests tomorrow I will clear all the existing logs just to be sure.
>> >>
>> >> 2016-03-16 18:19 GMT+01:00 Ufuk Celebi <u...@apache.org>:
>> >>>
>> >>> OK, so you are submitting multiple jobs, but you submit them with -m
>> >>> yarn-cluster and therefore expect them to start separate YARN
>> >>> clusters. Makes sense and I would expect the same.
>> >>>
>> >>> I think that you can check in the client logs printed to stdout to
>> >>> which cluster the job is submitted.
>> >>>
>> >>> PS: The logs you have shared are out-of-order, how did you gather
>> >>> them? Do you have an idea why they are out of order? Maybe something
>> >>> is mixed up in the way we gather the logs and we only think that
>> >>> something is wrong because of this.
>> >>>
>> >>>
>> >>> On Wed, Mar 16, 2016 at 6:11 PM, Simone Robutti
>> >>> <simone.robu...@radicalbit.io> wrote:
>> >>> > I didn't resubmitted the job. Also the jobs are submitted one by one
>> >>> > with -m
>> >>> > yarn-master, not with a long running yarn session so I don't really
>> >>> > know if
>> >>> > they could mix up.
>> >>> >
>> >>> > I will repeat the test with a cleaned state because we saw that
>> killing
>> >>> > the
>> >>> > job with yarn application -kill left the "flink run" process alive
>> so
>> >>> > that
>> >>> > may be the problem. We just noticed a few minutes ago.
>> >>> >
>> >>> > If the problem persists, I will eventually come back with a full
>> log.
>> >>> >
>> >>> > Thanks for now,
>> >>> >
>> >>> > Simone
>> >>> >
>> >>> > 2016-03-16 18:04 GMT+01:00 Ufuk Celebi <u...@apache.org>:
>> >>> >>
>> >>> >> Hey Simone,
>> >>> >>
>> >>> >> from the logs it looks like multiple jobs have been submitted to
>> the
>> >>> >> cluster, not just one. The different files correspond to different
>> >>> >> jobs recovering. The filtered logs show three jobs
>> running/recovering
>> >>> >> (with IDs 10d8ccae6e87ac56bf763caf4bc4742f,
>> >>> >> 124f29322f9026ac1b35435d5de9f625,
>> 7f280b38065eaa6335f5c3de4fc82547).
>> >>> >>
>> >>> >> Did you manually re-submit the job after killing a job manager?
>> >>> >>
>> >>> >> Regarding the counts, it can happen that they are rolled back to a
>> >>> >> previous consistent state if the checkpoint was not completed yet
>> >>> >> (including the write to ZooKeeper). In that case the job state
>> will be
>> >>> >> rolled back to an earlier consistent state.
>> >>> >>
>> >>> >> Can you please share the complete job manager logs of your program?
>> >>> >> The most helpful thing will be to have a log for each started job
>> >>> >> manager container. I don't know if that is easily possible.
>> >>> >>
>> >>> >> – Ufuk
>> >>> >>
>> >>> >> On Wed, Mar 16, 2016 at 4:12 PM, Simone Robutti
>> >>> >> <simone.robu...@radicalbit.io> wrote:
>> >>> >> > This is the log filtered to check messages from
>> >>> >> > ZooKeeperCompletedCheckpointStore.
>> >>> >> >
>> >>> >> > https://gist.github.com/chobeat/0222b31b87df3fa46a23
>> >>> >> >
>> >>> >> > It looks like it finds only a checkpoint but I'm not sure if the
>> >>> >> > different
>> >>> >> > hashes and IDs of the checkpoints are meaningful or not.
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> > 2016-03-16 15:33 GMT+01:00 Ufuk Celebi <u...@apache.org>:
>> >>> >> >>
>> >>> >> >> Can you please have a look into the JobManager log file and
>> report
>> >>> >> >> which checkpoints are restored? You should see messages from
>> >>> >> >> ZooKeeperCompletedCheckpointStore like:
>> >>> >> >> - Found X checkpoints in ZooKeeper
>> >>> >> >> - Initialized with X. Removing all older checkpoints
>> >>> >> >>
>> >>> >> >> You can share the complete job manager log file as well if you
>> >>> >> >> like.
>> >>> >> >>
>> >>> >> >> – Ufuk
>> >>> >> >>
>> >>> >> >> On Wed, Mar 16, 2016 at 2:50 PM, Simone Robutti
>> >>> >> >> <simone.robu...@radicalbit.io> wrote:
>> >>> >> >> > Hello,
>> >>> >> >> >
>> >>> >> >> > I'm testing the checkpointing functionality with hdfs as a
>> >>> >> >> > backend.
>> >>> >> >> >
>> >>> >> >> > For what I can see it uses different checkpointing files and
>> >>> >> >> > resume
>> >>> >> >> > the
>> >>> >> >> > computation from different points and not from the latest
>> >>> >> >> > available.
>> >>> >> >> > This is
>> >>> >> >> > to me an unexpected behaviour.
>> >>> >> >> >
>> >>> >> >> > I log every second, for every worker, a counter that is
>> increased
>> >>> >> >> > by
>> >>> >> >> > 1
>> >>> >> >> > at
>> >>> >> >> > each step.
>> >>> >> >> >
>> >>> >> >> > So for example on node-1 the count goes up to 5, then I kill a
>> >>> >> >> > job
>> >>> >> >> > manager
>> >>> >> >> > or task manager and it resumes from 5 or 4 and it's ok. The
>> next
>> >>> >> >> > time
>> >>> >> >> > I
>> >>> >> >> > kill
>> >>> >> >> > a job manager the count is at 15 and it resumes at 14 or 15.
>> >>> >> >> > Sometimes
>> >>> >> >> > it
>> >>> >> >> > may happen that at a third kill the work resumes at 4 or 5 as
>> if
>> >>> >> >> > the
>> >>> >> >> > checkpoint resumed the second time wasn't there.
>> >>> >> >> >
>> >>> >> >> > Once I even saw it jump forward: the first kill is at 10 and
>> it
>> >>> >> >> > resumes
>> >>> >> >> > at
>> >>> >> >> > 9, the second kill is at 70 and it resumes at 9, the third
>> kill
>> >>> >> >> > is at
>> >>> >> >> > 15
>> >>> >> >> > but
>> >>> >> >> > it resumes at 69 as if it resumed from the second kill
>> >>> >> >> > checkpoint.
>> >>> >> >> >
>> >>> >> >> > This is clearly inconsistent.
>> >>> >> >> >
>> >>> >> >> > Also, in the logs I can find that sometimes it uses a
>> checkpoint
>> >>> >> >> > file
>> >>> >> >> > different from the previous, consistent resume.
>> >>> >> >> >
>> >>> >> >> > What am I doing wrong? Is it a known bug?
>> >>> >> >
>> >>> >> >
>> >>> >
>> >>> >
>> >>
>> >>
>> >
>>
>
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit
>



-- 
BR,
Stefano Baghino

Software Engineer @ Radicalbit

Re: Flink Checkpoint on yarn

Reply via email to