Repository: accumulo Updated Branches: refs/heads/1.7 6b2e430dc -> ddc6203ad refs/heads/1.8 cf6c0ff09 -> bca75d356 refs/heads/master 05496627c -> 92c45a896
ACCUMULO-4627 Add corrupt WAL recovery instructions to user manual Signed-off-by: Josh Elser <els...@apache.org> Project: http://git-wip-us.apache.org/repos/asf/accumulo/repo Commit: http://git-wip-us.apache.org/repos/asf/accumulo/commit/ddc6203a Tree: http://git-wip-us.apache.org/repos/asf/accumulo/tree/ddc6203a Diff: http://git-wip-us.apache.org/repos/asf/accumulo/diff/ddc6203a Branch: refs/heads/1.7 Commit: ddc6203ad0e5ca9bbe553b5bad1f2498af634a7e Parents: 6b2e430 Author: Sean Busbey <bus...@cloudera.com> Authored: Thu Apr 20 22:39:56 2017 -0400 Committer: Josh Elser <els...@apache.org> Committed: Thu Apr 20 22:42:42 2017 -0400 ---------------------------------------------------------------------- .../main/asciidoc/chapters/troubleshooting.txt | 30 +++++++++++++++++++- 1 file changed, 29 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/accumulo/blob/ddc6203a/docs/src/main/asciidoc/chapters/troubleshooting.txt ---------------------------------------------------------------------- diff --git a/docs/src/main/asciidoc/chapters/troubleshooting.txt b/docs/src/main/asciidoc/chapters/troubleshooting.txt index cd2923c..359ed67 100644 --- a/docs/src/main/asciidoc/chapters/troubleshooting.txt +++ b/docs/src/main/asciidoc/chapters/troubleshooting.txt @@ -666,6 +666,35 @@ original and the new instances, but it can serve as a reference. rfiles to allow references in the metadata table and in the tablet servers to be resolved. Rebuild the metadata table if the corrupt files are metadata files. +*Write-Ahead Log(WAL) File Corruption* + +In certain versions of Accumulo, a corrupt WAL file (caused by HDFS corruption +or a bug in Accumulo that created the file) can block the successful recovery +of one to many Tablets. Accumulo can be stuck in a loop trying to recover the +WAL file, never being able to succeed. + +In the cases where the WAL file's original contents are unrecoverable or some degree +of data loss is acceptable (beware if the WAL file contains updates to the Accumulo +metadat table!), the following process can be followed to create an valid, empty +WAL file. Run the following commands as the Accumulo unix user (to ensure that +the proper file permissions in HDFS) + + $ echo -n -e '--- Log File Header (v2) ---\x00\x00\x00\x00' > empty.wal + +The above creates a file with the text "--- Log File Header (v2) ---" and then +four bytes. You should verify the contents of the file with a hexdump tool. + +Then, place this empty WAL in HDFS and then replace the corrupt WAL file in HDFS +with the empty WAL. + + $ hdfs dfs -moveFromLocal empty.wal /user/accumulo/empty.wal + $ hdfs dfs -mv /user/accumulo/empty.wal /accumulo/wal/tserver-4.example.com+10011/26abec5b-63e7-40dd-9fa1-b8ad2436606e + +After the corrupt WAL file has been replaced, the system should automatically recover. +It may be necessary to restart the Accumulo Master process as an exponential +backup policy is used which could lead to a long wait before Accumulo will +try to re-load the WAL file. + [[zookeeper_failure]] #### ZooKeeper Failure *Q*: I lost my ZooKeeper quorum (hardware failure), but HDFS is still intact. How can I recover my Accumulo instance? @@ -765,4 +794,3 @@ For example, if you see multiple files with +M+ prefixes, the tablet is, or was, maximum file limit, so it began merging memory updates with files to keep the file count reasonable. This slows down ingest performance, so knowing there are many files like this tells you that the system is struggling to keep up with ingest vs the compaction strategy which reduces the number of files. -