Re: Weird Replication exception

Himanshu Vashishtha Sun, 02 Jun 2013 19:37:11 -0700

Hey Asaf,

It looks like you only need 7122. Either upgrade, or you could also patch
it up.


Syncing up the master and slave cluster is also advised, but that stands
good in case you are using master-master replication.


bq. 172.25.98.74,60020,
1369903540894/172.25.98.74
%2C60020%2C1369903540894.1369925171871

bq.    Meaning I lost data

Did the log whose znode you deleted had any data?
You can do a cat to see if there is any data on it. You could copy-table
for that time range, (or a hacky way is to re-create a znode for that log
under a regionserver noticing the format of its current log znodes, and let
the replicationSource pick it up in its normal run).

Thanks,
Himanshu



On Sun, Jun 2, 2013 at 12:38 PM, Ted Yu <[email protected]> wrote:

> bq. Is 0.94.8 production ready?
>
> I think so. Lars released 0.94.8 Friday evening.
>
> On Sun, Jun 2, 2013 at 12:26 PM, Asaf Mesika <[email protected]>
> wrote:
>
> > I use 0.94.7.
> > Is 0.94.8 production ready?
> >
> > So in summary I have two issues:
> > 1. Clocks are out of sync
> > 2. I need to upgrade to 0.94.8 to avoid seeing this WARN messages?
> >
> > On Jun 2, 2013, at 5:46 PM, Ted Yu <[email protected]> wrote:
> >
> > > What is the HBase version you're using ?
> > >
> > > In another thread, I mentioned this:
> > >
> > > There was a recently integrated JIRA (0.94.8):
> > > HBASE-7122 Proper warning message when opening a log file with no
> entries
> > > (idle cluster)
> > >
> > > Does the HBase you're using contain HBASE-7122 ?
> > >
> > > Cheers
> > >
> > > On Sat, Jun 1, 2013 at 10:20 PM, Asaf Mesika <[email protected]>
> > wrote:
> > >
> > >> Hi,
> > >>
> > >> I have a weird error in a cluster I'm checking Replication with.
> > >>
> > >> I have two clusters set up, each on its own DC (different continents).
> > Each
> > >> has 1 master, and 3 RS.
> > >>
> > >> I've done all required setup, started replication and pushed in some
> > data
> > >> into the master. I had an issue where the slave (peer) cluster went
> dead
> > >> (all RS failed contacting the master), thus replication couldn't work.
> > This
> > >> happened right before the weekend, so it was out for 3 days.
> > >>
> > >> Now I'm back in the office - got slave cluster back up (just the RS),
> > and I
> > >> got some nasty exception in one of the RS of the master cluster:
> > >>
> > >> 2013-06-02 04:40:45,903 INFO
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Getting
> > >> 0 rs from peer cluster # c
> > >> 2013-06-02 04:40:45,903 INFO
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Slave
> > >> cluster looks down: c has 0 region servers
> > >> 2013-06-02 04:40:46,903 DEBUG
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Since
> > >> we are unable to replicate, sleeping 1000 times 10
> > >> 2013-06-02 04:40:57,019 INFO
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Getting
> > >> 0 rs from peer cluster # c
> > >> 2013-06-02 04:40:57,019 INFO
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Slave
> > >> cluster looks down: c has 0 region servers
> > >> 2013-06-02 04:40:58,019 DEBUG
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Since
> > >> we are unable to replicate, sleeping 1000 times 10
> > >> 2013-06-02 04:41:08,134 INFO
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Getting
> > >> 1 rs from peer cluster # c
> > >> 2013-06-02 04:41:08,134 INFO
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > >> Choosing peer a72-246-95-86,60020,1370147274693
> > >> 2013-06-02 04:41:08,672 DEBUG
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > >> Replicating 1
> > >> 2013-06-02 04:41:08,971 INFO
> > >>
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> > >> Going to report log
> #172.25.98.74%2C60020%2C1369903540894.1369921571842
> > for
> > >> position
> > >> 49882 in hdfs://
> > >>
> > >>
> >
> 172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.74%2C60020%2C1369903540894.1369921571842
> > >> 2013-06-02 04:41:08,973 DEBUG
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > >> Replicated in total: 1
> > >> 2013-06-02 04:41:08,973 DEBUG
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Opening
> > >> log for replication 172.25.98.74%2C60020%2C1369903540894.1369921571842
> > at
> > >> 49882
> > >> 2013-06-02 04:41:08,977 DEBUG
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > >> currentNbOperations:0 and seenEntries:4 and size: 0
> > >> 2013-06-02 04:41:08,977 INFO
> > >>
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> > >> Going to report log
> #172.25.98.74%2C60020%2C1369903540894.1369921571842
> > for
> > >> position
> > >> 69368 in hdfs://
> > >>
> > >>
> >
> 172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.74%2C60020%2C1369903540894.1369921571842
> > >> 2013-06-02 04:41:08,978 INFO
> > >>
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> > >> Removing 0 logs in the list: []
> > >> 2013-06-02 04:41:08,978 DEBUG
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Nothing
> > >> to replicate, sleeping 1000 times 1
> > >> 2013-06-02 04:41:09,978 DEBUG
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Opening
> > >> log for replication 172.25.98.74%2C60020%2C1369903540894.1369921571842
> > at
> > >> 69368
> > >> 2013-06-02 04:41:09,980 DEBUG
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > >> currentNbOperations:0 and seenEntries:0 and size: 0
> > >> 2013-06-02 04:41:09,980 DEBUG
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Opening
> > >> log for replication 172.25.98.74%2C60020%2C1369903540894.1369925171871
> > at 0
> > >> 2013-06-02 04:41:09,981 WARN
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c
> > Got:
> > >> java.io.EOFException
> > >>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
> > >>        at java.io.DataInputStream.readFully(DataInputStream.java:152)
> > >>        at
> > >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
> > >>        at
> > >>
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
> > >>        at
> > >>
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
> > >>        at
> > >>
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
> > >>        at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
> > >>        at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
> > >>        at
> > >> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742)
> > >>        at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
> > >>        at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563)
> > >>        at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:369)
> > >> 2013-06-02 04:41:09,981 DEBUG
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Unable
> > >> to open a reader, sleeping 1000 times 1
> > >> 2013-06-02 04:41:10,981 DEBUG
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Opening
> > >> log for replication 172.25.98.74%2C60020%2C1369903540894.1369925171871
> > at 0
> > >>
> > >>
> > >> Ok, so I took a look in HDFS for this logs, and here's what I had in
> > store
> > >> for RS 74:
> > >>
> > >> [email protected]:~$ hadoop fs -ls /hbase/.logs
> > >> Found 3 items
> > >> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 14:46
> > /hbase/.logs/
> > >> 172.25.98.74,60020,1369903540894
> > >> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 12:46
> > /hbase/.logs/
> > >> 172.25.98.75,60020,1369903540900
> > >> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 14:46
> > /hbase/.logs/
> > >> 172.25.98.76,60020,1369903540898
> > >> [email protected]:~$ hadoop fs -ls /hbase/.logs//hbase/.logs/
> > >> 172.25.98.74
> > >> ,60020,1369903540894
> > >> [email protected]:~$ hadoop fs -ls /hbase/.logs/172.25.98.74
> > >> ,60020,1369903540894
> > >> Found 5 items
> > >> -rw-r--r--   3 hadoop supergroup       1380 2013-05-30 08:46
> > /hbase/.logs/
> > >> 172.25.98.74,60020,1369903540894/172.25.98.74
> > >> %2C60020%2C1369903540894.1369903571677
> > >> -rw-r--r--   3 hadoop supergroup      59110 2013-05-30 09:46
> > /hbase/.logs/
> > >> 172.25.98.74,60020,1369903540894/172.25.98.74
> > >> %2C60020%2C1369903540894.1369907171734
> > >> -rw-r--r--   3 hadoop supergroup      80508 2013-05-30 12:46
> > /hbase/.logs/
> > >> 172.25.98.74,60020,1369903540894/172.25.98.74
> > >> %2C60020%2C1369903540894.1369917971811
> > >> -rw-r--r--   3 hadoop supergroup      69368 2013-05-30 13:46
> > /hbase/.logs/
> > >> 172.25.98.74,60020,1369903540894/172.25.98.74
> > >> %2C60020%2C1369903540894.1369921571842
> > >> -rw-r--r--   3 hadoop supergroup          0 2013-05-30 14:46
> > /hbase/.logs/
> > >> 172.25.98.74,60020,1369903540894/172.25.98.74
> > >> %2C60020%2C1369903540894.1369925171871
> > >>
> > >> So I saw the log ending with 871 is of 0 length - so I've decided to
> > delete
> > >> it from ZK and restart RS 74, so its in-memory list will get refreshed
> > from
> > >> ZK.
> > >>
> > >> After that:
> > >>
> > >>   - I lost the logs for /hbase/.logs/172.25.98.74,60020,1369903540894
> -
> > >>   Meaning I lost data
> > >>
> > >> This is HDFS afterwards:
> > >>
> > >> [email protected]:~$ hadoop fs -ls /hbase/.logs/
> > >>
> > >> Found 3 items
> > >> drwxr-xr-x   - hadoop supergroup          0 2013-06-02 05:02
> > /hbase/.logs/
> > >> 172.25.98.74,60020,1370149320299
> > >> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 12:46
> > /hbase/.logs/
> > >> 172.25.98.75,60020,1369903540900
> > >> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 14:46
> > /hbase/.logs/
> > >> 172.25.98.76,60020,1369903540898
> > >> [email protected]:~$ hadoop fs -ls /hbase/.logs/172.25.98.74
> > >> ,60020,1370149320299
> > >> Found 1 items
> > >> -rw-r--r--   3 hadoop supergroup          0 2013-06-02 05:02
> > /hbase/.logs/
> > >> 172.25.98.74,60020,1370149320299/172.25.98.74
> > >> %2C60020%2C1370149320299.1370149321515
> > >>
> > >>
> > >>   - RS 74 is now stuck at same exception, but on the new WAL which
> just
> > >>   opened but it empty (since no data has been pushed yet):
> > >>
> > >> 2013-06-02 05:19:18,475 WARN
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Waited
> > >> too long for this file, considering dumping
> > >> 2013-06-02 05:19:18,475 DEBUG
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Unable
> > >> to open a reader, sleeping 1000 times 10
> > >> 2013-06-02 05:19:28,475 DEBUG
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Opening
> > >> log for replication 172.25.98.74%2C60020%2C1370149320299.1370149321515
> > at 0
> > >> 2013-06-02 05:19:28,477 WARN
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c
> > Got:
> > >> java.io.EOFException
> > >>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
> > >>        at java.io.DataInputStream.readFully(DataInputStream.java:152)
> > >>        at
> > >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
> > >>        at
> > >>
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
> > >>        at
> > >>
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
> > >>        at
> > >>
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
> > >>        at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
> > >>        at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
> > >>        at
> > >> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742)
> > >>        at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
> > >>        at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563)
> > >>        at
> > >>
> > >>
> >
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:369)
> > >> 2013-06-02 05:19:28,477 WARN
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Waited
> > >> too long for this file, considering dumping
> > >> 2013-06-02 05:19:28,477 DEBUG
> > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> > Unable
> > >> to open a reader, sleeping 1000 times 10
> > >>
> > >>
> > >> So, after this long explanation - is this a bug?
> > >>
> >
> >
>

Re: Weird Replication exception

Reply via email to