Re: WAL, failover questions

Dmitriy Lyubimov Tue, 28 Sep 2010 10:53:20 -0700

Thank you, St. Ack.

by 'reconstituting' i meant table had no records, so we had to re-fill it
with the data we kept elsewhere. I couldn't find or figure any technique
that might help me to scavenge it off the hbase files.


so it sounds like the migrate is in order.

-Dmitriy


On Tue, Sep 28, 2010 at 9:15 AM, Stack <[email protected]> wrote:

> On Mon, Sep 27, 2010 at 7:52 PM, Dmitriy Lyubimov <[email protected]>
> wrote:
> > Hi,
> >
> > i would be very grateful if somebody could clarify the following for me
> > please.  (0.20.5)
> >
> > yesterday we lost a short table (~100 rows) in production without a
> trace.
> > no matter how deep i looked in the logs of regionservers and the master,
> i
> > haven't got a clue how it might have happened.
> >
> > When i looked at the table though, i did not find any files, which may
> mean
> > that it never got flushed from WAL and got compacted. When i
> reconstituted
> > it and ran compaction, the file did  finally appear .
> >
>
> What did you do to 'reconstitute'?
>
> FYI, edits go first to WAL and then to memstore.  A file will not
> appear in the filesystem until memstore flushes.  100 rows is probably
> not enough to bring on a flush.
>
>
> > also, in the .log (hlog) directory, i noticed some duplicate entries with
> > both long and short host names (i.e. something like 'data4,...' and
> > 'data4.foo.bar,...' ) which may result from the moment we decided to
> switch
> > to /etc/hosts name resolution instead of dns (just to see if that'll
> improve
> > our networking issues).
> >
>
> Yeah, probably.
>
> Our hostname lookup is done once in 0.90.x and forever after we keep
> on w/ that name regardless so this should not happen going forward.
>
>
> > So i have the following questions :
> >
> > 1 -- when is WAL (or hlog, if it's the same?) is triggered and the actual
> > tablet file is built? is it a size-based threshold? is there an max age
> > threshold?
>
> Yes, the hbase WAL is also referred to as hlog and our WAL is
> implemented by the o.a.h.h.regionserver.wal.HLog class.
>
> Every edit first is appended to the WAL that each regionserver keeps up.
>
> But, in 0.20 hbase, our WAL is mostly ineffective given as there is no
> append support in hadoop 0.20 hdfs; basically only if the file is
> successfully closed will edits be preserved (This state has changed in
> 0.89.x in that the WAL append now works  Make sure you are running
> with the hadoop 0.20-append branch or CDH3b2 to ensure your 0.89.x
> install WAL works properly).
>
>
> > 2 -- if the region server crashes, its hlog is supposed to be split and
> > recovered, right?
>
> Yes
>
>
>  is there a situation when hlog can be lost? I suppose it
> > doesn't matter that the region server with the same name never goes
> online
> > again (e.g. if it starts using short name instead of FQDN)?
> >
>
> I'd have to check the code but I think the messing w/ hostnames would
> not be a reason to lose edits.  I think we go ahead and split the logs
> for files even if they are not associated with a particular host; i.e.
> though you "changed" hostnames, and our WALs are associated with
> hosts, we pick up any straggler log files anyways.  Check your master
> logs to confirm.
>
> St.Ack
>

Re: WAL, failover questions

Reply via email to