Resending to [email protected] now that I’m subscribed to that list.
On Thu, Jan 28, 2016 at 10:50 AM, tsuna <[email protected]> wrote: > Just to close the loop on this ordeal… > > I started by clearing /hbase/splitWAL in ZK and restarting all the RS and > the HM. This didn’t change anything. > > On Wed, Jan 27, 2016 at 8:42 AM, tsuna <[email protected]> wrote: > > 16/01/27 16:33:39 INFO namenode.FSNamesystem: Recovering [Lease. > > Holder: DFSClient_NONMAPREDUCE_174538359_1, pendingcreates: 2], > > src=/hbase/WALs/r12s1.sjc.aristanetworks.com > ,9104,1452811288618-splitting/r12s1.sjc.aristanetworks.com > %2C9104%2C1452811288618.default.1453728791276 > > 16/01/27 16:33:39 WARN BlockStateChange: BLOCK* > > BlockInfoUnderConstruction.initLeaseRecovery: No blocks found, lease > > removed. > > I ran dfsadmin fsck -move to make sure that all the files that had lost > blocks were moved to /lost+found, and this obviously didn’t help HBase, > because as I stated earlier, only one WAL had lost a block, and 94% of the > blocks lost affected the HFile of one of the regions. > > Yet, somehow, the error above appeared for every single one of the region > servers, and I ended up having to move more WAL files manually to > /lost+found: > > foo@r12s3:~/hadoop-2.7.1$ ./bin/hdfs dfs -ls /lost+found > Found 15 items > drwxr--r-- - foo supergroup 0 2016-01-28 05:56 /lost+found/hbase > -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:33 /lost+found/ > r12s1.sjc.aristanetworks.com%2C9104%2C1452811288618.default.1453728791276 > -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:29 /lost+found/ > r12s10.sjc.aristanetworks.com%2C9104%2C1452811286704.default.1453728581434 > -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:31 /lost+found/ > r12s11.sjc.aristanetworks.com%2C9104%2C1452811286222.default.1453728710303 > -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:30 /lost+found/ > r12s13.sjc.aristanetworks.com%2C9104%2C1452811287287.default.1453728621698 > -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:25 /lost+found/ > r12s14.sjc.aristanetworks.com%2C9104%2C1452811286288.default.1453728336644 > -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:25 /lost+found/ > r12s15.sjc.aristanetworks.com%2C9104%2C1453158959800.default.1453728342559 > -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:26 /lost+found/ > r12s16.sjc.aristanetworks.com%2C9104%2C1452811286456.default.1453728374800 > -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:22 /lost+found/ > r12s2.sjc.aristanetworks.com%2C9104%2C1452811286448.default.1453728137282 > -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:26 /lost+found/ > r12s3.sjc.aristanetworks.com%2C9104%2C1452811286093.default.1453728393926 > -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:35 /lost+found/ > r12s4.sjc.aristanetworks.com%2C9104%2C1452811289547.default.1453728949397 > -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:30 /lost+found/ > r12s5.sjc.aristanetworks.com%2C9104%2C1452811125084.default.1453728624262 > -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:28 /lost+found/ > r12s6.sjc.aristanetworks.com%2C9104%2C1452811286154.default.1453728483550 > -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:28 /lost+found/ > r12s7.sjc.aristanetworks.com%2C9104%2C1452811287528.default.1453728528180 > -rw-r--r-- 3 foo supergroup 83 2016-01-25 13:22 /lost+found/ > r12s8.sjc.aristanetworks.com%2C9104%2C1452811287196.default.1453728125912 > > After doing this and restarting the HMaster, everything came back up > fine. I don’t know if doing this caused any additional data loss – this is > a dev cluster so data loss isn’t a big deal, but if I was to run into this > issue in production, I would certainly be very nervous about this whole > situation. > > This might turn more into an HDFS question at this point, so I’m Cc’ing > hdfs-user@ just in case anybody has anything to say there. > > We’re going to upgrade to Hadoop 2.7.2 soon, just in case. > -- Benoit "tsuna" Sigoure
