I need to study better but it looks like stuck on the NN trying to recover
lease. The WAL is missing blocks and NN lease recovery is skipping out.
St.Ack

On Wed, Jan 27, 2016 at 12:44 PM, tsuna <[email protected]> wrote:

> Some NN logs attached.
>
> On Wed, Jan 27, 2016 at 8:42 AM, tsuna <[email protected]> wrote:
> > On Wed, Jan 27, 2016 at 3:31 AM, Stack <[email protected]> wrote:
> >> Can you try what Samir suggests? Can replay moved-aside logs using
> >> WALPlayer after cluster comes up.
> >
> > OK I’ll shut down the HM and all RS’es and clear the data in ZK and
> > restart everything and see how things go.
> >
> >> I see that at first we are making progress through the WALs ("... total
> >> tasks = 7..." ... it was 9 in your master log) but maybe they are
> getting
> >> stuck on particular WALs.  The stuck Master has same task list over
> last 24
> >> hours?
> >
> > It’s progressing veeerrryyyy slooowly.  I now have 34 regions online
> > (that’s 3 more than 10h ago! :P)
> >
> >> On the RS log, I see you are stuck trying to recover 1453728374800
> >>
> >> We are trying to recover the lease on the file from NN. Below looks bad.
> >> Over a minute and still not lease recovery.
> >>
> >> 140 2016-01-27 05:53:10,338 INFO  [RS_LOG_REPLAY_OPS-r12s4:9104-0]
> >> util.FSHDFSUtils: recoverLease=false, attempt=2 on file=hdfs://
> >>
> 0.namenode.sjc.aristanetworks.com:8820/hbase/WALs/r12s16.sjc.aristanetworks.com,9104,1452811286456-splitting/r12s16.sjc.aristanetworks.com%2C910
> >>    4%2C1452811286456.default.1453728374800 after 68218ms
> >>
> >> NN showing us stuck on lease recovery?
> >>
> >> NN went down during power outage?
> >
> > Yes, everything went down.
> >
> > The NN regularly shows this:
> >
> > 16/01/27 16:33:39 INFO namenode.FSNamesystem: recoverLease: [Lease.
> > Holder: DFSClient_NONMAPREDUCE_174538359_1, pendingcreates: 2],
> > src=/hbase/WALs/r12s1.sjc.aristanetworks.com
> ,9104,1452811288618-splitting/r12s1.sjc.aristanetworks.com
> %2C9104%2C1452811288618.default.1453728791276
> > from client DFSClient_NONMAPREDUCE_174538359_1
> > 16/01/27 16:33:39 INFO namenode.FSNamesystem: Recovering [Lease.
> > Holder: DFSClient_NONMAPREDUCE_174538359_1, pendingcreates: 2],
> > src=/hbase/WALs/r12s1.sjc.aristanetworks.com
> ,9104,1452811288618-splitting/r12s1.sjc.aristanetworks.com
> %2C9104%2C1452811288618.default.1453728791276
> > 16/01/27 16:33:39 WARN BlockStateChange: BLOCK*
> > BlockInfoUnderConstruction.initLeaseRecovery: No blocks found, lease
> > removed.
> > 16/01/27 16:33:39 WARN hdfs.StateChange: DIR*
> > NameSystem.internalReleaseLease: File
> > /hbase/WALs/r12s1.sjc.aristanetworks.com,9104,1452811288618-splitting/
> r12s1.sjc.aristanetworks.com%2C9104%2C1452811288618.default.1453728791276
> > has not been closed. Lease recovery is in progress. RecoveryId =
> > 1122295 for block blk_1074847505_1108430{UCState=UNDER_RECOVERY,
> > truncateBlock=null, primaryNodeIndex=-1, replicas=[]}
> >
> > I should perhaps mention at this point that as a result of the power
> > outage we lost a bunch of hard disk drives, and this caused us to
> > permanently lose approximately 0.3% of the blocks in HDFS.  I know
> > that at least one of WALs was affected by data loss
> > (/hbase/WALs/r12s12.sjc.aristanetworks.com,9104,1452811287315-splitting/
> r12s12.sjc.aristanetworks.com
> %2C9104%2C1452811287315.default.1453724924825).
> > Might be worth mentioning also that 94% of the 0.3% blocks we lost are
> > all from one unlucky region.
> >
> >> The BadVersion is probably because task has been assumed by another
> because
> >> lease recovery taking too long.
> >>
> >> I see 1453728374800 is still in the list you paste from Master later in
> >> this thread.
> >>
> >> Try moving it aside. Might have to restart master for it to notice the
> >> move-aside... Try not restarting first.
> >
> > --
> > Benoit "tsuna" Sigoure
>
>
>
> --
> Benoit "tsuna" Sigoure
>

Reply via email to