Hi all, We tried following the recovery instructions from http://aurora.apache.org/documentation/latest/operations/backup-restore/
After our change from the Twitter commons ZK library to Apache Curator, these instructions are no longer valid. In order for Aurora to carry out a leader election in the current state, Aurora has to first connect to a Mesos master. What we ended up doing was connecting to Mesos master that was had nothing on it to bypass this new requirement. Next, wiping away -native_log_file_path did not seem to be enough to recover from a corrupted mesos replicated log. We had to manually wipe away entries in ZK and move the snapshot backup directory in order for the leader to not fall back on either a snapshot or the mesos-log to rehydrate the leader. Finally, somehow triggering a manual snapshot generated a snapshot with an invalid entry which then caused the scheduler to fail after a failover while trying to catch up on current state. We are trying to investigate why this took place (it could have been we didn't give the system enough time to finish hydrating the snapshot), but the invalid entry which looked something like a Task with all null or 0 values, caused our leaders to fail (which necessitated restoring from an earlier snapshot) and note that this was only after we triggered the manual snapshot and BEFORE we tried to restore. Will report more details as they become available and will provide some doc updates based on our experience. -Renan