WAL, failover questions

Dmitriy Lyubimov Mon, 27 Sep 2010 19:52:53 -0700

Hi,

i would be very grateful if somebody could clarify the following for me
please.  (0.20.5)


yesterday we lost a short table (~100 rows) in production without a trace.
no matter how deep i looked in the logs of regionservers and the master, i
haven't got a clue how it might have happened.

When i looked at the table though, i did not find any files, which may mean
that it never got flushed from WAL and got compacted. When i reconstituted
it and ran compaction, the file did  finally appear .

also, in the .log (hlog) directory, i noticed some duplicate entries with
both long and short host names (i.e. something like 'data4,...' and
'data4.foo.bar,...' ) which may result from the moment we decided to switch
to /etc/hosts name resolution instead of dns (just to see if that'll improve
our networking issues).

So i have the following questions :

1 -- when is WAL (or hlog, if it's the same?) is triggered and the actual
tablet file is built? is it a size-based threshold? is there an max age
threshold?
2 -- if the region server crashes, its hlog is supposed to be split and
recovered, right? is there a situation when hlog can be lost? I suppose it
doesn't matter that the region server with the same name never goes online
again (e.g. if it starts using short name instead of FQDN)?

Thanks.

-Dmitriy

WAL, failover questions

Reply via email to