Hi all,

We tried following the recovery instructions from
http://aurora.apache.org/documentation/latest/operations/backup-restore/

After our change from the Twitter commons ZK library to Apache Curator,
these instructions are no longer valid.

In order for Aurora to carry out a leader election in the current state,
Aurora has to first connect to a Mesos master. What we ended up doing was
connecting to Mesos master that was had nothing on it to bypass this new
requirement.

Next, wiping away -native_log_file_path did not seem to be enough to
recover from a corrupted mesos replicated log. We had to manually wipe away
entries in ZK and move the snapshot backup directory in order for the
leader to not fall back on either a snapshot or the mesos-log to rehydrate
the leader.

Finally, somehow triggering a manual snapshot generated a snapshot with an
invalid entry which then caused the scheduler to fail after a failover
while trying to catch up on current state.

We are trying to investigate why this took place (it could have been we
didn't give the system enough time to finish hydrating the snapshot), but
the invalid entry which looked something like a Task with all null or 0
values, caused our leaders to fail (which necessitated restoring from an
earlier snapshot) and note that this was only after we triggered the manual
snapshot and BEFORE we tried to restore.

Will report more details as they become available and will provide some doc
updates based on our experience.

-Renan

Reply via email to