Slow recovery on lost data node?

Juhani Connolly Wed, 08 Dec 2010 00:55:28 -0800

Hi there,

We're currently running a cluster under expected load, and testingvarious hardware failure cases. Among them is a lostregionServer/dataNode, which results in our writer process(in our case aservlet under tomcat) just waiting indefinitely on put flushes until theregion becomes available again(in the process the threads stack up untilthe server limit). I've included logs of the relevant time period fromone of my regionservers at http://pastie.org/1358217 .

During the 15minutes from around 16:12->16:27 all writes failed.Incidentally, during this time I am still able to read data fine withanother process which is only reading from hbase.

Is this period of not being available to write to for 15 working asintended, or is something wrong with the way I'm trying to access hbase?The main access code I'm using can be seen at http://pastie.org/1358224. tPool is an initialised HTablePool, and the general idea is to storeputs without flushing until they have been held onto for a while(tobatch the flushes a little bit)

If it is working as intended, what would be the correct steps to reduceit(perhaps reducing configuration for region sizes)?

Is there anything I can do to just make the writes fail when the regionisn't available for writing? As is, threads keep getting generated tillthe container max is reached, waiting for something(presumably theregion to become available again?). I expected thathbase.client.retries.number would be appropriate, but based on the lackof any logs for failed writes, the current writes simply aren't aborting.

Everything is running off the latest CDH3(hbase-0.89.20100924+28,hadoop-0.20.2+737-core) and works well under normal conditions


Any advice/information would be appreciated.
Thanks,
 Juhani

Slow recovery on lost data node?

Reply via email to