This is the log filtered to check messages from ZooKeeperCompletedCheckpointStore.
https://gist.github.com/chobeat/0222b31b87df3fa46a23 It looks like it finds only a checkpoint but I'm not sure if the different hashes and IDs of the checkpoints are meaningful or not. 2016-03-16 15:33 GMT+01:00 Ufuk Celebi <u...@apache.org>: > Can you please have a look into the JobManager log file and report > which checkpoints are restored? You should see messages from > ZooKeeperCompletedCheckpointStore like: > - Found X checkpoints in ZooKeeper > - Initialized with X. Removing all older checkpoints > > You can share the complete job manager log file as well if you like. > > – Ufuk > > On Wed, Mar 16, 2016 at 2:50 PM, Simone Robutti > <simone.robu...@radicalbit.io> wrote: > > Hello, > > > > I'm testing the checkpointing functionality with hdfs as a backend. > > > > For what I can see it uses different checkpointing files and resume the > > computation from different points and not from the latest available. > This is > > to me an unexpected behaviour. > > > > I log every second, for every worker, a counter that is increased by 1 at > > each step. > > > > So for example on node-1 the count goes up to 5, then I kill a job > manager > > or task manager and it resumes from 5 or 4 and it's ok. The next time I > kill > > a job manager the count is at 15 and it resumes at 14 or 15. Sometimes it > > may happen that at a third kill the work resumes at 4 or 5 as if the > > checkpoint resumed the second time wasn't there. > > > > Once I even saw it jump forward: the first kill is at 10 and it resumes > at > > 9, the second kill is at 70 and it resumes at 9, the third kill is at 15 > but > > it resumes at 69 as if it resumed from the second kill checkpoint. > > > > This is clearly inconsistent. > > > > Also, in the logs I can find that sometimes it uses a checkpoint file > > different from the previous, consistent resume. > > > > What am I doing wrong? Is it a known bug? >