Re: Slow region server recoveries

Nicolas Liochon Fri, 19 Apr 2013 04:00:49 -0700

Thanks for the detailed scenario and analysis. I'm going to have a look.
I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com
timeouts), could you please send them directly to me?


Thanks,

Nicolas


On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <[email protected]> wrote:

> Hi Nicholas,
>
> Here is the failure scenario, I have dug up the logs.
>
> A machine fails and stops accepting/transmitting traffic. The HMaster
> starts the distributed split for 13 tasks. There are 12 region servers. 12
> tasks succeed but the 13th one takes a looong time.
>
> Zookeeper timeout is set to 30 seconds. Stale node timeout is 20 seconds.
> Both patches are there.
>
> a) Machine fails around 27:30
> b) Master starts the split around 27:40 and submits the tasks. The one task
> which fails seems to be the one which contains the WAL being currently
> written to:
>
> 2013-04-19 00:27:44,325 INFO
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog:
> hdfs://
>
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> ,
> length=0
>
> Basically this region server picks up the task but finds the length of this
> file to be 0 and drops. This happens again
>
> c) Finally another region server picks up the task but it ends up going to
> the bad datanode which should not happen because of the stale node timeout)
> Unfortunately it hits the 45 retries and a connect timeout of 20 seconds
> every time. This delays recovery significantly. Now I guess reducing # of
> retries to 1 is one possibility.
> But then the namenode logs are very interesting.
>
> d) Namenode seems to be in cyclic lease recovery loop until the node is
> marked dead. There is this one last block which exhibits this.
>
> 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
> 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
> primaryNodeIndex=1,
> replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery started,
> primary=10.156.192.106:50010
> 2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.internalReleaseLease: File
>
> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> has not been closed. Lease recovery is in progress. RecoveryId = 174413 for
> block blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
> primaryNodeIndex=1,
> replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
>
> I see this over and over again in the logs until the datanode is marked
> dead. It seems to be cycling through the replicas for this WAL block and
> trying to add it to the recovery list. I looked at the code and it says:
>
>       // Cannot close file right now, since the last block requires
> recovery.
>       // This may potentially cause infinite loop in lease recovery
>       // if there are no valid replicas on data-nodes.
>       NameNode.stateChangeLog.warn(
>                 "DIR* NameSystem.internalReleaseLease: " +
>                 "File " + src + " has not been closed." +
>                " Lease recovery is in progress. " +
>                 "RecoveryId = " + blockRecoveryId + " for block " +
> lastBlock);
>       break;
>
> Eventually for this block, we get
>
> 2013-04-19 00:41:20,736 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>
> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
> *5723958680970112840_174056*, newgenerationstamp=174413,
> newlength=119148648, newtargets=[10.156.192.106:50010, 10.156.195.38:50010
> ],
> closeFile=true, deleteBlock=false)
> 2013-04-19 00:41:20,736 ERROR
> org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException
> as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> not found
> 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 35 on 8020, call
>
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
> from 10.156.192.106:53271: error: java.io.IOException: Block
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> not found
>
> On the datanode side, i see a call for recover blocks - I see that a write
> pipeline is there, which gets terminated with some socket timeouts...
>
> 00:28:11,471 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: NameNode
> at ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020 calls
>
> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
> targets=[10.156.194.94:50010, 10.156.192.106:50010, 10.156.195.38:50010],
> newGenerationStamp=174413)
>
> Not sure but this looks like a case where data could be lost   ?
>
> Varun
>
>
> On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <[email protected]>
> wrote:
>
> > Hey Varun,
> >
> > Could you please share the logs and the configuration (hdfs / hbase
> > settings + cluster description). What's the failure scenario?
> > From an HDFS pov, HDFS 3703 does not change the dead node status. But
> these
> > node will be given the lowest priority when reading.
> >
> >
> > Cheers,
> >
> > Nicolas
> >
> >
> > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <[email protected]>
> wrote:
> >
> > > Hi,
> > >
> > > We are facing problems with really slow HBase region server recoveries
> ~
> > 20
> > > minuted. Version is hbase 0.94.3 compiled with hadoop.profile=2.0.
> > >
> > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched and
> stale
> > > node timeouts configured correctly. Time for dead node detection is
> still
> > > 10 minutes.
> > >
> > > We see that our region server is trying to read an HLog is stuck there
> > for
> > > a long time. Logs here:
> > >
> > > 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient: Failed
> to
> > > connect to /10.156.194.251:50010 for file
> > >
> > >
> >
> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
> > > for block
> > >
> > >
> >
> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
> > > 15000 millis timeout while waiting for channel to be ready for read.
> ch :
> > > java.nio.channels.SocketChannel[connected local=/10.156.192.173:52818
> > > remote=/
> > > 10.156.194.251:50010]
> > >
> > > I would think that HDFS 3703 would make the server fail fast and go to
> > the
> > > third datanode. Currently, the recovery seems way too slow for
> production
> > > usage...
> > >
> > > Varun
> > >
> >
>

Re: Slow region server recoveries

Reply via email to