Re: All RegionServers stuck on BadVersion from ZK after cluster restart

tsuna Thu, 28 Jan 2016 10:51:59 -0800

Just to close the loop on this ordeal…

I started by clearing /hbase/splitWAL in ZK and restarting all the RS and
the HM.  This didn’t change anything.


On Wed, Jan 27, 2016 at 8:42 AM, tsuna <[email protected]> wrote:
> 16/01/27 16:33:39 INFO namenode.FSNamesystem: Recovering [Lease.
> Holder: DFSClient_NONMAPREDUCE_174538359_1, pendingcreates: 2],
> src=/hbase/WALs/r12s1.sjc.aristanetworks.com,9104,1452811288618-splitting/
r12s1.sjc.aristanetworks.com%2C9104%2C1452811288618.default.1453728791276
> 16/01/27 16:33:39 WARN BlockStateChange: BLOCK*
> BlockInfoUnderConstruction.initLeaseRecovery: No blocks found, lease
> removed.

I ran dfsadmin fsck -move to make sure that all the files that had lost
blocks were moved to /lost+found, and this obviously didn’t help HBase,
because as I stated earlier, only one WAL had lost a block, and 94% of the
blocks lost affected the HFile of one of the regions.

Yet, somehow, the error above appeared for every single one of the region
servers, and I ended up having to move more WAL files manually to
/lost+found:

foo@r12s3:~/hadoop-2.7.1$ ./bin/hdfs dfs -ls /lost+found
Found 15 items
drwxr--r--   - foo supergroup          0 2016-01-28 05:56 /lost+found/hbase
-rw-r--r--   3 foo supergroup         83 2016-01-25 13:33 /lost+found/
r12s1.sjc.aristanetworks.com%2C9104%2C1452811288618.default.1453728791276
-rw-r--r--   3 foo supergroup         83 2016-01-25 13:29 /lost+found/
r12s10.sjc.aristanetworks.com%2C9104%2C1452811286704.default.1453728581434
-rw-r--r--   3 foo supergroup         83 2016-01-25 13:31 /lost+found/
r12s11.sjc.aristanetworks.com%2C9104%2C1452811286222.default.1453728710303
-rw-r--r--   3 foo supergroup         83 2016-01-25 13:30 /lost+found/
r12s13.sjc.aristanetworks.com%2C9104%2C1452811287287.default.1453728621698
-rw-r--r--   3 foo supergroup         83 2016-01-25 13:25 /lost+found/
r12s14.sjc.aristanetworks.com%2C9104%2C1452811286288.default.1453728336644
-rw-r--r--   3 foo supergroup         83 2016-01-25 13:25 /lost+found/
r12s15.sjc.aristanetworks.com%2C9104%2C1453158959800.default.1453728342559
-rw-r--r--   3 foo supergroup         83 2016-01-25 13:26 /lost+found/
r12s16.sjc.aristanetworks.com%2C9104%2C1452811286456.default.1453728374800
-rw-r--r--   3 foo supergroup         83 2016-01-25 13:22 /lost+found/
r12s2.sjc.aristanetworks.com%2C9104%2C1452811286448.default.1453728137282
-rw-r--r--   3 foo supergroup         83 2016-01-25 13:26 /lost+found/
r12s3.sjc.aristanetworks.com%2C9104%2C1452811286093.default.1453728393926
-rw-r--r--   3 foo supergroup         83 2016-01-25 13:35 /lost+found/
r12s4.sjc.aristanetworks.com%2C9104%2C1452811289547.default.1453728949397
-rw-r--r--   3 foo supergroup         83 2016-01-25 13:30 /lost+found/
r12s5.sjc.aristanetworks.com%2C9104%2C1452811125084.default.1453728624262
-rw-r--r--   3 foo supergroup         83 2016-01-25 13:28 /lost+found/
r12s6.sjc.aristanetworks.com%2C9104%2C1452811286154.default.1453728483550
-rw-r--r--   3 foo supergroup         83 2016-01-25 13:28 /lost+found/
r12s7.sjc.aristanetworks.com%2C9104%2C1452811287528.default.1453728528180
-rw-r--r--   3 foo supergroup         83 2016-01-25 13:22 /lost+found/
r12s8.sjc.aristanetworks.com%2C9104%2C1452811287196.default.1453728125912

After doing this and restarting the HMaster, everything came back up fine.
I don’t know if doing this caused any additional data loss – this is a dev
cluster so data loss isn’t a big deal, but if I was to run into this issue
in production, I would certainly be very nervous about this whole situation.

This might turn more into an HDFS question at this point, so I’m Cc’ing
hdfs-user@ just in case anybody has anything to say there.

We’re going to upgrade to Hadoop 2.7.2 soon, just in case.

-- 
Benoit "tsuna" Sigoure

Re: All RegionServers stuck on BadVersion from ZK after cluster restart

Reply via email to