What is the HBase version you're using ? In another thread, I mentioned this:
There was a recently integrated JIRA (0.94.8): HBASE-7122 Proper warning message when opening a log file with no entries (idle cluster) Does the HBase you're using contain HBASE-7122 ? Cheers On Sat, Jun 1, 2013 at 10:20 PM, Asaf Mesika <[email protected]> wrote: > Hi, > > I have a weird error in a cluster I'm checking Replication with. > > I have two clusters set up, each on its own DC (different continents). Each > has 1 master, and 3 RS. > > I've done all required setup, started replication and pushed in some data > into the master. I had an issue where the slave (peer) cluster went dead > (all RS failed contacting the master), thus replication couldn't work. This > happened right before the weekend, so it was out for 3 days. > > Now I'm back in the office - got slave cluster back up (just the RS), and I > got some nasty exception in one of the RS of the master cluster: > > 2013-06-02 04:40:45,903 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting > 0 rs from peer cluster # c > 2013-06-02 04:40:45,903 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Slave > cluster looks down: c has 0 region servers > 2013-06-02 04:40:46,903 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Since > we are unable to replicate, sleeping 1000 times 10 > 2013-06-02 04:40:57,019 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting > 0 rs from peer cluster # c > 2013-06-02 04:40:57,019 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Slave > cluster looks down: c has 0 region servers > 2013-06-02 04:40:58,019 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Since > we are unable to replicate, sleeping 1000 times 10 > 2013-06-02 04:41:08,134 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting > 1 rs from peer cluster # c > 2013-06-02 04:41:08,134 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Choosing peer a72-246-95-86,60020,1370147274693 > 2013-06-02 04:41:08,672 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Replicating 1 > 2013-06-02 04:41:08,971 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: > Going to report log #172.25.98.74%2C60020%2C1369903540894.1369921571842 for > position > 49882 in hdfs:// > > 172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.74%2C60020%2C1369903540894.1369921571842 > 2013-06-02 04:41:08,973 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Replicated in total: 1 > 2013-06-02 04:41:08,973 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening > log for replication 172.25.98.74%2C60020%2C1369903540894.1369921571842 at > 49882 > 2013-06-02 04:41:08,977 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > currentNbOperations:0 and seenEntries:4 and size: 0 > 2013-06-02 04:41:08,977 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: > Going to report log #172.25.98.74%2C60020%2C1369903540894.1369921571842 for > position > 69368 in hdfs:// > > 172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.74%2C60020%2C1369903540894.1369921571842 > 2013-06-02 04:41:08,978 INFO > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: > Removing 0 logs in the list: [] > 2013-06-02 04:41:08,978 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Nothing > to replicate, sleeping 1000 times 1 > 2013-06-02 04:41:09,978 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening > log for replication 172.25.98.74%2C60020%2C1369903540894.1369921571842 at > 69368 > 2013-06-02 04:41:09,980 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > currentNbOperations:0 and seenEntries:0 and size: 0 > 2013-06-02 04:41:09,980 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening > log for replication 172.25.98.74%2C60020%2C1369903540894.1369925171871 at 0 > 2013-06-02 04:41:09,981 WARN > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c Got: > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at java.io.DataInputStream.readFully(DataInputStream.java:152) > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470) > at > > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55) > at > > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177) > at > org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742) > at > > org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67) > at > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563) > at > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:369) > 2013-06-02 04:41:09,981 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable > to open a reader, sleeping 1000 times 1 > 2013-06-02 04:41:10,981 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening > log for replication 172.25.98.74%2C60020%2C1369903540894.1369925171871 at 0 > > > Ok, so I took a look in HDFS for this logs, and here's what I had in store > for RS 74: > > [email protected]:~$ hadoop fs -ls /hbase/.logs > Found 3 items > drwxr-xr-x - hadoop supergroup 0 2013-05-30 14:46 /hbase/.logs/ > 172.25.98.74,60020,1369903540894 > drwxr-xr-x - hadoop supergroup 0 2013-05-30 12:46 /hbase/.logs/ > 172.25.98.75,60020,1369903540900 > drwxr-xr-x - hadoop supergroup 0 2013-05-30 14:46 /hbase/.logs/ > 172.25.98.76,60020,1369903540898 > [email protected]:~$ hadoop fs -ls /hbase/.logs//hbase/.logs/ > 172.25.98.74 > ,60020,1369903540894 > [email protected]:~$ hadoop fs -ls /hbase/.logs/172.25.98.74 > ,60020,1369903540894 > Found 5 items > -rw-r--r-- 3 hadoop supergroup 1380 2013-05-30 08:46 /hbase/.logs/ > 172.25.98.74,60020,1369903540894/172.25.98.74 > %2C60020%2C1369903540894.1369903571677 > -rw-r--r-- 3 hadoop supergroup 59110 2013-05-30 09:46 /hbase/.logs/ > 172.25.98.74,60020,1369903540894/172.25.98.74 > %2C60020%2C1369903540894.1369907171734 > -rw-r--r-- 3 hadoop supergroup 80508 2013-05-30 12:46 /hbase/.logs/ > 172.25.98.74,60020,1369903540894/172.25.98.74 > %2C60020%2C1369903540894.1369917971811 > -rw-r--r-- 3 hadoop supergroup 69368 2013-05-30 13:46 /hbase/.logs/ > 172.25.98.74,60020,1369903540894/172.25.98.74 > %2C60020%2C1369903540894.1369921571842 > -rw-r--r-- 3 hadoop supergroup 0 2013-05-30 14:46 /hbase/.logs/ > 172.25.98.74,60020,1369903540894/172.25.98.74 > %2C60020%2C1369903540894.1369925171871 > > So I saw the log ending with 871 is of 0 length - so I've decided to delete > it from ZK and restart RS 74, so its in-memory list will get refreshed from > ZK. > > After that: > > - I lost the logs for /hbase/.logs/172.25.98.74,60020,1369903540894 - > Meaning I lost data > > This is HDFS afterwards: > > [email protected]:~$ hadoop fs -ls /hbase/.logs/ > > Found 3 items > drwxr-xr-x - hadoop supergroup 0 2013-06-02 05:02 /hbase/.logs/ > 172.25.98.74,60020,1370149320299 > drwxr-xr-x - hadoop supergroup 0 2013-05-30 12:46 /hbase/.logs/ > 172.25.98.75,60020,1369903540900 > drwxr-xr-x - hadoop supergroup 0 2013-05-30 14:46 /hbase/.logs/ > 172.25.98.76,60020,1369903540898 > [email protected]:~$ hadoop fs -ls /hbase/.logs/172.25.98.74 > ,60020,1370149320299 > Found 1 items > -rw-r--r-- 3 hadoop supergroup 0 2013-06-02 05:02 /hbase/.logs/ > 172.25.98.74,60020,1370149320299/172.25.98.74 > %2C60020%2C1370149320299.1370149321515 > > > - RS 74 is now stuck at same exception, but on the new WAL which just > opened but it empty (since no data has been pushed yet): > > 2013-06-02 05:19:18,475 WARN > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited > too long for this file, considering dumping > 2013-06-02 05:19:18,475 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable > to open a reader, sleeping 1000 times 10 > 2013-06-02 05:19:28,475 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening > log for replication 172.25.98.74%2C60020%2C1370149320299.1370149321515 at 0 > 2013-06-02 05:19:28,477 WARN > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c Got: > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at java.io.DataInputStream.readFully(DataInputStream.java:152) > at > org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475) > at > org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470) > at > > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55) > at > > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177) > at > org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742) > at > > org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67) > at > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563) > at > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:369) > 2013-06-02 05:19:28,477 WARN > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited > too long for this file, considering dumping > 2013-06-02 05:19:28,477 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable > to open a reader, sleeping 1000 times 10 > > > So, after this long explanation - is this a bug? >
