Just to close the loop on this ordeal… I started by clearing /hbase/splitWAL in ZK and restarting all the RS and the HM. This didn’t change anything.
On Wed, Jan 27, 2016 at 8:42 AM, tsuna <[email protected]> wrote: > 16/01/27 16:33:39 INFO namenode.FSNamesystem: Recovering [Lease. > Holder: DFSClient_NONMAPREDUCE_174538359_1, pendingcreates: 2], > src=/hbase/WALs/r12s1.sjc.aristanetworks.com,9104,1452811288618-splitting/ r12s1.sjc.aristanetworks.com%2C9104%2C1452811288618.default.1453728791276 > 16/01/27 16:33:39 WARN BlockStateChange: BLOCK* > BlockInfoUnderConstruction.initLeaseRecovery: No blocks found, lease > removed. I ran dfsadmin fsck -move to make sure that all the files that had lost blocks were moved to /lost+found, and this obviously didn’t help HBase, because as I stated earlier, only one WAL had lost a block, and 94% of the blocks lost affected the HFile of one of the regions. Yet, somehow, the error above appeared for every single one of the region servers, and I ended up having to move more WAL files manually to /lost+found: foo@r12s3:~/hadoop-2.7.1$ ./bin/hdfs dfs -ls /lost+found Found 15 items drwxr--r-- - foo supergroup 0 2016-01-28 05:56 /lost+found/hbase -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:33 /lost+found/ r12s1.sjc.aristanetworks.com%2C9104%2C1452811288618.default.1453728791276 -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:29 /lost+found/ r12s10.sjc.aristanetworks.com%2C9104%2C1452811286704.default.1453728581434 -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:31 /lost+found/ r12s11.sjc.aristanetworks.com%2C9104%2C1452811286222.default.1453728710303 -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:30 /lost+found/ r12s13.sjc.aristanetworks.com%2C9104%2C1452811287287.default.1453728621698 -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:25 /lost+found/ r12s14.sjc.aristanetworks.com%2C9104%2C1452811286288.default.1453728336644 -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:25 /lost+found/ r12s15.sjc.aristanetworks.com%2C9104%2C1453158959800.default.1453728342559 -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:26 /lost+found/ r12s16.sjc.aristanetworks.com%2C9104%2C1452811286456.default.1453728374800 -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:22 /lost+found/ r12s2.sjc.aristanetworks.com%2C9104%2C1452811286448.default.1453728137282 -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:26 /lost+found/ r12s3.sjc.aristanetworks.com%2C9104%2C1452811286093.default.1453728393926 -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:35 /lost+found/ r12s4.sjc.aristanetworks.com%2C9104%2C1452811289547.default.1453728949397 -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:30 /lost+found/ r12s5.sjc.aristanetworks.com%2C9104%2C1452811125084.default.1453728624262 -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:28 /lost+found/ r12s6.sjc.aristanetworks.com%2C9104%2C1452811286154.default.1453728483550 -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:28 /lost+found/ r12s7.sjc.aristanetworks.com%2C9104%2C1452811287528.default.1453728528180 -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:22 /lost+found/ r12s8.sjc.aristanetworks.com%2C9104%2C1452811287196.default.1453728125912 After doing this and restarting the HMaster, everything came back up fine. I don’t know if doing this caused any additional data loss – this is a dev cluster so data loss isn’t a big deal, but if I was to run into this issue in production, I would certainly be very nervous about this whole situation. This might turn more into an HDFS question at this point, so I’m Cc’ing hdfs-user@ just in case anybody has anything to say there. We’re going to upgrade to Hadoop 2.7.2 soon, just in case. -- Benoit "tsuna" Sigoure
