Question about WAL writes after region server "soft failures"

Nick Puz Fri, 07 Sep 2012 12:19:39 -0700

I'm new to HBase and HDFS and have a question about what happens when failure is detected and a new region server takes over a region. If the old region server hasn't really failed and "comes back" will it still accept writes?

Here's a specific sequence of events:

1) region R is currently being served by region server RS1.

2) RS1 hangs for some reason (long GC, network hiccup, etc)

3) the region master gets notified that RS1 is down so it splits logs and reassigns. Looking at the code splitting logs renames the log directory so if RS1 tries to create a new log file it will fail.

4) region server RS2 is assigned the region, replays the log, and all is well.

5) RS1 comes back to life.

After 5 happens:

- if it had inflight requests will it write the to the WAL and eventually flush the memtables?

- if it gets new requests will it service them as long as it is still appending to the same block in the WAL file?

One way to prevent the clients getting acks would be to set the client timeout to be less than the zookeeper session timeout (zookeeper.session.timeout) which seems like a logical thing to do.

But even if the timeouts were such the client got a timeout are there scenarios when the edits would be readable by other clients? (say if that log file was rescanned)

Thanks,

-Nick

Question about WAL writes after region server "soft failures"

Reply via email to