What kind of documentation can we put in the user manual about this? Recommend to only decom one rack at a time until we get the issue sorted out in Hadoop-land?

dlmar...@comcast.net wrote:
BLUF: There exists the possibility of data loss when performing DataNode 
decommissioning with Accumulo running. This note applies to installations of 
Accumulo 1.5.0+ and Hadoop 2.5.0+.

DETAILS: During DataNode decommissioning it is possible for the NameNode to 
report stale block locations (HDFS-8208). If Accumulo is running during this 
process then it is possible that files currently being written will not close 
properly. Accumulo is affected in two ways:

1. During compactions temporary rfiles are created, then closed, and renamed. 
If a failure happens during the close, the compaction will fail.
2. Write ahead log files are created, written to, and then closed. If a failure 
happens during the close, then the NameNode will have a walog file with no 
finalized blocks.

If either of these cases happen, decommissioning of the DataNode could hang 
(HDFS-3599, HDFS-5579) because the files are left in an open for write state. 
If Accumulo needs the write ahead log for recovery it will be unable to read 
the file and will not recover.

RECOMMENDATION: Assuming that the replication pipeline for the write ahead log 
is working properly, then you should not run into this issue if you only 
decommission one rack at a time.

Reply via email to