Re: Weird Replication exception

Ted Yu Sun, 02 Jun 2013 07:47:57 -0700

What is the HBase version you're using ?

In another thread, I mentioned this:


There was a recently integrated JIRA (0.94.8):
HBASE-7122 Proper warning message when opening a log file with no entries
(idle cluster)

Does the HBase you're using contain HBASE-7122 ?

Cheers

On Sat, Jun 1, 2013 at 10:20 PM, Asaf Mesika <[email protected]> wrote:

> Hi,
>
> I have a weird error in a cluster I'm checking Replication with.
>
> I have two clusters set up, each on its own DC (different continents). Each
> has 1 master, and 3 RS.
>
> I've done all required setup, started replication and pushed in some data
> into the master. I had an issue where the slave (peer) cluster went dead
> (all RS failed contacting the master), thus replication couldn't work. This
> happened right before the weekend, so it was out for 3 days.
>
> Now I'm back in the office - got slave cluster back up (just the RS), and I
> got some nasty exception in one of the RS of the master cluster:
>
> 2013-06-02 04:40:45,903 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting
> 0 rs from peer cluster # c
> 2013-06-02 04:40:45,903 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Slave
> cluster looks down: c has 0 region servers
> 2013-06-02 04:40:46,903 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Since
> we are unable to replicate, sleeping 1000 times 10
> 2013-06-02 04:40:57,019 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting
> 0 rs from peer cluster # c
> 2013-06-02 04:40:57,019 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Slave
> cluster looks down: c has 0 region servers
> 2013-06-02 04:40:58,019 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Since
> we are unable to replicate, sleeping 1000 times 10
> 2013-06-02 04:41:08,134 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting
> 1 rs from peer cluster # c
> 2013-06-02 04:41:08,134 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Choosing peer a72-246-95-86,60020,1370147274693
> 2013-06-02 04:41:08,672 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Replicating 1
> 2013-06-02 04:41:08,971 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> Going to report log #172.25.98.74%2C60020%2C1369903540894.1369921571842 for
> position
>  49882 in hdfs://
>
> 172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.74%2C60020%2C1369903540894.1369921571842
> 2013-06-02 04:41:08,973 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Replicated in total: 1
> 2013-06-02 04:41:08,973 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
> log for replication 172.25.98.74%2C60020%2C1369903540894.1369921571842 at
> 49882
> 2013-06-02 04:41:08,977 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> currentNbOperations:0 and seenEntries:4 and size: 0
> 2013-06-02 04:41:08,977 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> Going to report log #172.25.98.74%2C60020%2C1369903540894.1369921571842 for
> position
>  69368 in hdfs://
>
> 172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.74%2C60020%2C1369903540894.1369921571842
> 2013-06-02 04:41:08,978 INFO
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> Removing 0 logs in the list: []
> 2013-06-02 04:41:08,978 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Nothing
> to replicate, sleeping 1000 times 1
> 2013-06-02 04:41:09,978 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
> log for replication 172.25.98.74%2C60020%2C1369903540894.1369921571842 at
> 69368
> 2013-06-02 04:41:09,980 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> currentNbOperations:0 and seenEntries:0 and size: 0
> 2013-06-02 04:41:09,980 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
> log for replication 172.25.98.74%2C60020%2C1369903540894.1369925171871 at 0
> 2013-06-02 04:41:09,981 WARN
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c Got:
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
>         at
>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
>         at
>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
>         at
> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742)
>         at
>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
>         at
>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563)
>         at
>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:369)
> 2013-06-02 04:41:09,981 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable
> to open a reader, sleeping 1000 times 1
> 2013-06-02 04:41:10,981 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
> log for replication 172.25.98.74%2C60020%2C1369903540894.1369925171871 at 0
>
>
> Ok, so I took a look in HDFS for this logs, and here's what I had in store
> for RS 74:
>
> [email protected]:~$ hadoop fs -ls /hbase/.logs
> Found 3 items
> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 14:46 /hbase/.logs/
> 172.25.98.74,60020,1369903540894
> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 12:46 /hbase/.logs/
> 172.25.98.75,60020,1369903540900
> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 14:46 /hbase/.logs/
> 172.25.98.76,60020,1369903540898
> [email protected]:~$ hadoop fs -ls /hbase/.logs//hbase/.logs/
> 172.25.98.74
> ,60020,1369903540894
> [email protected]:~$ hadoop fs -ls /hbase/.logs/172.25.98.74
> ,60020,1369903540894
> Found 5 items
> -rw-r--r--   3 hadoop supergroup       1380 2013-05-30 08:46 /hbase/.logs/
> 172.25.98.74,60020,1369903540894/172.25.98.74
> %2C60020%2C1369903540894.1369903571677
> -rw-r--r--   3 hadoop supergroup      59110 2013-05-30 09:46 /hbase/.logs/
> 172.25.98.74,60020,1369903540894/172.25.98.74
> %2C60020%2C1369903540894.1369907171734
> -rw-r--r--   3 hadoop supergroup      80508 2013-05-30 12:46 /hbase/.logs/
> 172.25.98.74,60020,1369903540894/172.25.98.74
> %2C60020%2C1369903540894.1369917971811
> -rw-r--r--   3 hadoop supergroup      69368 2013-05-30 13:46 /hbase/.logs/
> 172.25.98.74,60020,1369903540894/172.25.98.74
> %2C60020%2C1369903540894.1369921571842
> -rw-r--r--   3 hadoop supergroup          0 2013-05-30 14:46 /hbase/.logs/
> 172.25.98.74,60020,1369903540894/172.25.98.74
> %2C60020%2C1369903540894.1369925171871
>
> So I saw the log ending with 871 is of 0 length - so I've decided to delete
> it from ZK and restart RS 74, so its in-memory list will get refreshed from
> ZK.
>
> After that:
>
>    - I lost the logs for /hbase/.logs/172.25.98.74,60020,1369903540894 -
>    Meaning I lost data
>
> This is HDFS afterwards:
>
> [email protected]:~$ hadoop fs -ls /hbase/.logs/
>
> Found 3 items
> drwxr-xr-x   - hadoop supergroup          0 2013-06-02 05:02 /hbase/.logs/
> 172.25.98.74,60020,1370149320299
> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 12:46 /hbase/.logs/
> 172.25.98.75,60020,1369903540900
> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 14:46 /hbase/.logs/
> 172.25.98.76,60020,1369903540898
> [email protected]:~$ hadoop fs -ls /hbase/.logs/172.25.98.74
> ,60020,1370149320299
> Found 1 items
> -rw-r--r--   3 hadoop supergroup          0 2013-06-02 05:02 /hbase/.logs/
> 172.25.98.74,60020,1370149320299/172.25.98.74
> %2C60020%2C1370149320299.1370149321515
>
>
>    - RS 74 is now stuck at same exception, but on the new WAL which just
>    opened but it empty (since no data has been pushed yet):
>
> 2013-06-02 05:19:18,475 WARN
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited
> too long for this file, considering dumping
> 2013-06-02 05:19:18,475 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable
> to open a reader, sleeping 1000 times 10
> 2013-06-02 05:19:28,475 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
> log for replication 172.25.98.74%2C60020%2C1370149320299.1370149321515 at 0
> 2013-06-02 05:19:28,477 WARN
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c Got:
> java.io.EOFException
>         at java.io.DataInputStream.readFully(DataInputStream.java:180)
>         at java.io.DataInputStream.readFully(DataInputStream.java:152)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
>         at
> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
>         at
>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
>         at
>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
>         at
> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742)
>         at
>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
>         at
>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563)
>         at
>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:369)
> 2013-06-02 05:19:28,477 WARN
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited
> too long for this file, considering dumping
> 2013-06-02 05:19:28,477 DEBUG
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable
> to open a reader, sleeping 1000 times 10
>
>
> So, after this long explanation - is this a bug?
>

Re: Weird Replication exception

Reply via email to