I use 0.94.7. Is 0.94.8 production ready? So in summary I have two issues: 1. Clocks are out of sync 2. I need to upgrade to 0.94.8 to avoid seeing this WARN messages?
On Jun 2, 2013, at 5:46 PM, Ted Yu <[email protected]> wrote: > What is the HBase version you're using ? > > In another thread, I mentioned this: > > There was a recently integrated JIRA (0.94.8): > HBASE-7122 Proper warning message when opening a log file with no entries > (idle cluster) > > Does the HBase you're using contain HBASE-7122 ? > > Cheers > > On Sat, Jun 1, 2013 at 10:20 PM, Asaf Mesika <[email protected]> wrote: > >> Hi, >> >> I have a weird error in a cluster I'm checking Replication with. >> >> I have two clusters set up, each on its own DC (different continents). Each >> has 1 master, and 3 RS. >> >> I've done all required setup, started replication and pushed in some data >> into the master. I had an issue where the slave (peer) cluster went dead >> (all RS failed contacting the master), thus replication couldn't work. This >> happened right before the weekend, so it was out for 3 days. >> >> Now I'm back in the office - got slave cluster back up (just the RS), and I >> got some nasty exception in one of the RS of the master cluster: >> >> 2013-06-02 04:40:45,903 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting >> 0 rs from peer cluster # c >> 2013-06-02 04:40:45,903 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Slave >> cluster looks down: c has 0 region servers >> 2013-06-02 04:40:46,903 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Since >> we are unable to replicate, sleeping 1000 times 10 >> 2013-06-02 04:40:57,019 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting >> 0 rs from peer cluster # c >> 2013-06-02 04:40:57,019 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Slave >> cluster looks down: c has 0 region servers >> 2013-06-02 04:40:58,019 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Since >> we are unable to replicate, sleeping 1000 times 10 >> 2013-06-02 04:41:08,134 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting >> 1 rs from peer cluster # c >> 2013-06-02 04:41:08,134 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> Choosing peer a72-246-95-86,60020,1370147274693 >> 2013-06-02 04:41:08,672 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> Replicating 1 >> 2013-06-02 04:41:08,971 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: >> Going to report log #172.25.98.74%2C60020%2C1369903540894.1369921571842 for >> position >> 49882 in hdfs:// >> >> 172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.74%2C60020%2C1369903540894.1369921571842 >> 2013-06-02 04:41:08,973 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> Replicated in total: 1 >> 2013-06-02 04:41:08,973 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening >> log for replication 172.25.98.74%2C60020%2C1369903540894.1369921571842 at >> 49882 >> 2013-06-02 04:41:08,977 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> currentNbOperations:0 and seenEntries:4 and size: 0 >> 2013-06-02 04:41:08,977 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: >> Going to report log #172.25.98.74%2C60020%2C1369903540894.1369921571842 for >> position >> 69368 in hdfs:// >> >> 172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.74%2C60020%2C1369903540894.1369921571842 >> 2013-06-02 04:41:08,978 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: >> Removing 0 logs in the list: [] >> 2013-06-02 04:41:08,978 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Nothing >> to replicate, sleeping 1000 times 1 >> 2013-06-02 04:41:09,978 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening >> log for replication 172.25.98.74%2C60020%2C1369903540894.1369921571842 at >> 69368 >> 2013-06-02 04:41:09,980 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> currentNbOperations:0 and seenEntries:0 and size: 0 >> 2013-06-02 04:41:09,980 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening >> log for replication 172.25.98.74%2C60020%2C1369903540894.1369925171871 at 0 >> 2013-06-02 04:41:09,981 WARN >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c Got: >> java.io.EOFException >> at java.io.DataInputStream.readFully(DataInputStream.java:180) >> at java.io.DataInputStream.readFully(DataInputStream.java:152) >> at >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) >> at >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486) >> at >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475) >> at >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470) >> at >> >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55) >> at >> >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177) >> at >> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742) >> at >> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67) >> at >> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563) >> at >> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:369) >> 2013-06-02 04:41:09,981 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable >> to open a reader, sleeping 1000 times 1 >> 2013-06-02 04:41:10,981 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening >> log for replication 172.25.98.74%2C60020%2C1369903540894.1369925171871 at 0 >> >> >> Ok, so I took a look in HDFS for this logs, and here's what I had in store >> for RS 74: >> >> [email protected]:~$ hadoop fs -ls /hbase/.logs >> Found 3 items >> drwxr-xr-x - hadoop supergroup 0 2013-05-30 14:46 /hbase/.logs/ >> 172.25.98.74,60020,1369903540894 >> drwxr-xr-x - hadoop supergroup 0 2013-05-30 12:46 /hbase/.logs/ >> 172.25.98.75,60020,1369903540900 >> drwxr-xr-x - hadoop supergroup 0 2013-05-30 14:46 /hbase/.logs/ >> 172.25.98.76,60020,1369903540898 >> [email protected]:~$ hadoop fs -ls /hbase/.logs//hbase/.logs/ >> 172.25.98.74 >> ,60020,1369903540894 >> [email protected]:~$ hadoop fs -ls /hbase/.logs/172.25.98.74 >> ,60020,1369903540894 >> Found 5 items >> -rw-r--r-- 3 hadoop supergroup 1380 2013-05-30 08:46 /hbase/.logs/ >> 172.25.98.74,60020,1369903540894/172.25.98.74 >> %2C60020%2C1369903540894.1369903571677 >> -rw-r--r-- 3 hadoop supergroup 59110 2013-05-30 09:46 /hbase/.logs/ >> 172.25.98.74,60020,1369903540894/172.25.98.74 >> %2C60020%2C1369903540894.1369907171734 >> -rw-r--r-- 3 hadoop supergroup 80508 2013-05-30 12:46 /hbase/.logs/ >> 172.25.98.74,60020,1369903540894/172.25.98.74 >> %2C60020%2C1369903540894.1369917971811 >> -rw-r--r-- 3 hadoop supergroup 69368 2013-05-30 13:46 /hbase/.logs/ >> 172.25.98.74,60020,1369903540894/172.25.98.74 >> %2C60020%2C1369903540894.1369921571842 >> -rw-r--r-- 3 hadoop supergroup 0 2013-05-30 14:46 /hbase/.logs/ >> 172.25.98.74,60020,1369903540894/172.25.98.74 >> %2C60020%2C1369903540894.1369925171871 >> >> So I saw the log ending with 871 is of 0 length - so I've decided to delete >> it from ZK and restart RS 74, so its in-memory list will get refreshed from >> ZK. >> >> After that: >> >> - I lost the logs for /hbase/.logs/172.25.98.74,60020,1369903540894 - >> Meaning I lost data >> >> This is HDFS afterwards: >> >> [email protected]:~$ hadoop fs -ls /hbase/.logs/ >> >> Found 3 items >> drwxr-xr-x - hadoop supergroup 0 2013-06-02 05:02 /hbase/.logs/ >> 172.25.98.74,60020,1370149320299 >> drwxr-xr-x - hadoop supergroup 0 2013-05-30 12:46 /hbase/.logs/ >> 172.25.98.75,60020,1369903540900 >> drwxr-xr-x - hadoop supergroup 0 2013-05-30 14:46 /hbase/.logs/ >> 172.25.98.76,60020,1369903540898 >> [email protected]:~$ hadoop fs -ls /hbase/.logs/172.25.98.74 >> ,60020,1370149320299 >> Found 1 items >> -rw-r--r-- 3 hadoop supergroup 0 2013-06-02 05:02 /hbase/.logs/ >> 172.25.98.74,60020,1370149320299/172.25.98.74 >> %2C60020%2C1370149320299.1370149321515 >> >> >> - RS 74 is now stuck at same exception, but on the new WAL which just >> opened but it empty (since no data has been pushed yet): >> >> 2013-06-02 05:19:18,475 WARN >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited >> too long for this file, considering dumping >> 2013-06-02 05:19:18,475 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable >> to open a reader, sleeping 1000 times 10 >> 2013-06-02 05:19:28,475 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening >> log for replication 172.25.98.74%2C60020%2C1370149320299.1370149321515 at 0 >> 2013-06-02 05:19:28,477 WARN >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c Got: >> java.io.EOFException >> at java.io.DataInputStream.readFully(DataInputStream.java:180) >> at java.io.DataInputStream.readFully(DataInputStream.java:152) >> at >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) >> at >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486) >> at >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475) >> at >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470) >> at >> >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55) >> at >> >> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177) >> at >> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742) >> at >> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67) >> at >> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563) >> at >> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:369) >> 2013-06-02 05:19:28,477 WARN >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited >> too long for this file, considering dumping >> 2013-06-02 05:19:28,477 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable >> to open a reader, sleeping 1000 times 10 >> >> >> So, after this long explanation - is this a bug? >>
