bq. Is 0.94.8 production ready? I think so. Lars released 0.94.8 Friday evening.
On Sun, Jun 2, 2013 at 12:26 PM, Asaf Mesika <[email protected]> wrote: > I use 0.94.7. > Is 0.94.8 production ready? > > So in summary I have two issues: > 1. Clocks are out of sync > 2. I need to upgrade to 0.94.8 to avoid seeing this WARN messages? > > On Jun 2, 2013, at 5:46 PM, Ted Yu <[email protected]> wrote: > > > What is the HBase version you're using ? > > > > In another thread, I mentioned this: > > > > There was a recently integrated JIRA (0.94.8): > > HBASE-7122 Proper warning message when opening a log file with no entries > > (idle cluster) > > > > Does the HBase you're using contain HBASE-7122 ? > > > > Cheers > > > > On Sat, Jun 1, 2013 at 10:20 PM, Asaf Mesika <[email protected]> > wrote: > > > >> Hi, > >> > >> I have a weird error in a cluster I'm checking Replication with. > >> > >> I have two clusters set up, each on its own DC (different continents). > Each > >> has 1 master, and 3 RS. > >> > >> I've done all required setup, started replication and pushed in some > data > >> into the master. I had an issue where the slave (peer) cluster went dead > >> (all RS failed contacting the master), thus replication couldn't work. > This > >> happened right before the weekend, so it was out for 3 days. > >> > >> Now I'm back in the office - got slave cluster back up (just the RS), > and I > >> got some nasty exception in one of the RS of the master cluster: > >> > >> 2013-06-02 04:40:45,903 INFO > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Getting > >> 0 rs from peer cluster # c > >> 2013-06-02 04:40:45,903 INFO > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Slave > >> cluster looks down: c has 0 region servers > >> 2013-06-02 04:40:46,903 DEBUG > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Since > >> we are unable to replicate, sleeping 1000 times 10 > >> 2013-06-02 04:40:57,019 INFO > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Getting > >> 0 rs from peer cluster # c > >> 2013-06-02 04:40:57,019 INFO > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Slave > >> cluster looks down: c has 0 region servers > >> 2013-06-02 04:40:58,019 DEBUG > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Since > >> we are unable to replicate, sleeping 1000 times 10 > >> 2013-06-02 04:41:08,134 INFO > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Getting > >> 1 rs from peer cluster # c > >> 2013-06-02 04:41:08,134 INFO > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > >> Choosing peer a72-246-95-86,60020,1370147274693 > >> 2013-06-02 04:41:08,672 DEBUG > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > >> Replicating 1 > >> 2013-06-02 04:41:08,971 INFO > >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: > >> Going to report log #172.25.98.74%2C60020%2C1369903540894.1369921571842 > for > >> position > >> 49882 in hdfs:// > >> > >> > 172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.74%2C60020%2C1369903540894.1369921571842 > >> 2013-06-02 04:41:08,973 DEBUG > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > >> Replicated in total: 1 > >> 2013-06-02 04:41:08,973 DEBUG > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Opening > >> log for replication 172.25.98.74%2C60020%2C1369903540894.1369921571842 > at > >> 49882 > >> 2013-06-02 04:41:08,977 DEBUG > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > >> currentNbOperations:0 and seenEntries:4 and size: 0 > >> 2013-06-02 04:41:08,977 INFO > >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: > >> Going to report log #172.25.98.74%2C60020%2C1369903540894.1369921571842 > for > >> position > >> 69368 in hdfs:// > >> > >> > 172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.74%2C60020%2C1369903540894.1369921571842 > >> 2013-06-02 04:41:08,978 INFO > >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager: > >> Removing 0 logs in the list: [] > >> 2013-06-02 04:41:08,978 DEBUG > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Nothing > >> to replicate, sleeping 1000 times 1 > >> 2013-06-02 04:41:09,978 DEBUG > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Opening > >> log for replication 172.25.98.74%2C60020%2C1369903540894.1369921571842 > at > >> 69368 > >> 2013-06-02 04:41:09,980 DEBUG > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > >> currentNbOperations:0 and seenEntries:0 and size: 0 > >> 2013-06-02 04:41:09,980 DEBUG > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Opening > >> log for replication 172.25.98.74%2C60020%2C1369903540894.1369925171871 > at 0 > >> 2013-06-02 04:41:09,981 WARN > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c > Got: > >> java.io.EOFException > >> at java.io.DataInputStream.readFully(DataInputStream.java:180) > >> at java.io.DataInputStream.readFully(DataInputStream.java:152) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470) > >> at > >> > >> > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55) > >> at > >> > >> > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177) > >> at > >> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742) > >> at > >> > >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67) > >> at > >> > >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563) > >> at > >> > >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:369) > >> 2013-06-02 04:41:09,981 DEBUG > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Unable > >> to open a reader, sleeping 1000 times 1 > >> 2013-06-02 04:41:10,981 DEBUG > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Opening > >> log for replication 172.25.98.74%2C60020%2C1369903540894.1369925171871 > at 0 > >> > >> > >> Ok, so I took a look in HDFS for this logs, and here's what I had in > store > >> for RS 74: > >> > >> [email protected]:~$ hadoop fs -ls /hbase/.logs > >> Found 3 items > >> drwxr-xr-x - hadoop supergroup 0 2013-05-30 14:46 > /hbase/.logs/ > >> 172.25.98.74,60020,1369903540894 > >> drwxr-xr-x - hadoop supergroup 0 2013-05-30 12:46 > /hbase/.logs/ > >> 172.25.98.75,60020,1369903540900 > >> drwxr-xr-x - hadoop supergroup 0 2013-05-30 14:46 > /hbase/.logs/ > >> 172.25.98.76,60020,1369903540898 > >> [email protected]:~$ hadoop fs -ls /hbase/.logs//hbase/.logs/ > >> 172.25.98.74 > >> ,60020,1369903540894 > >> [email protected]:~$ hadoop fs -ls /hbase/.logs/172.25.98.74 > >> ,60020,1369903540894 > >> Found 5 items > >> -rw-r--r-- 3 hadoop supergroup 1380 2013-05-30 08:46 > /hbase/.logs/ > >> 172.25.98.74,60020,1369903540894/172.25.98.74 > >> %2C60020%2C1369903540894.1369903571677 > >> -rw-r--r-- 3 hadoop supergroup 59110 2013-05-30 09:46 > /hbase/.logs/ > >> 172.25.98.74,60020,1369903540894/172.25.98.74 > >> %2C60020%2C1369903540894.1369907171734 > >> -rw-r--r-- 3 hadoop supergroup 80508 2013-05-30 12:46 > /hbase/.logs/ > >> 172.25.98.74,60020,1369903540894/172.25.98.74 > >> %2C60020%2C1369903540894.1369917971811 > >> -rw-r--r-- 3 hadoop supergroup 69368 2013-05-30 13:46 > /hbase/.logs/ > >> 172.25.98.74,60020,1369903540894/172.25.98.74 > >> %2C60020%2C1369903540894.1369921571842 > >> -rw-r--r-- 3 hadoop supergroup 0 2013-05-30 14:46 > /hbase/.logs/ > >> 172.25.98.74,60020,1369903540894/172.25.98.74 > >> %2C60020%2C1369903540894.1369925171871 > >> > >> So I saw the log ending with 871 is of 0 length - so I've decided to > delete > >> it from ZK and restart RS 74, so its in-memory list will get refreshed > from > >> ZK. > >> > >> After that: > >> > >> - I lost the logs for /hbase/.logs/172.25.98.74,60020,1369903540894 - > >> Meaning I lost data > >> > >> This is HDFS afterwards: > >> > >> [email protected]:~$ hadoop fs -ls /hbase/.logs/ > >> > >> Found 3 items > >> drwxr-xr-x - hadoop supergroup 0 2013-06-02 05:02 > /hbase/.logs/ > >> 172.25.98.74,60020,1370149320299 > >> drwxr-xr-x - hadoop supergroup 0 2013-05-30 12:46 > /hbase/.logs/ > >> 172.25.98.75,60020,1369903540900 > >> drwxr-xr-x - hadoop supergroup 0 2013-05-30 14:46 > /hbase/.logs/ > >> 172.25.98.76,60020,1369903540898 > >> [email protected]:~$ hadoop fs -ls /hbase/.logs/172.25.98.74 > >> ,60020,1370149320299 > >> Found 1 items > >> -rw-r--r-- 3 hadoop supergroup 0 2013-06-02 05:02 > /hbase/.logs/ > >> 172.25.98.74,60020,1370149320299/172.25.98.74 > >> %2C60020%2C1370149320299.1370149321515 > >> > >> > >> - RS 74 is now stuck at same exception, but on the new WAL which just > >> opened but it empty (since no data has been pushed yet): > >> > >> 2013-06-02 05:19:18,475 WARN > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Waited > >> too long for this file, considering dumping > >> 2013-06-02 05:19:18,475 DEBUG > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Unable > >> to open a reader, sleeping 1000 times 10 > >> 2013-06-02 05:19:28,475 DEBUG > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Opening > >> log for replication 172.25.98.74%2C60020%2C1370149320299.1370149321515 > at 0 > >> 2013-06-02 05:19:28,477 WARN > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c > Got: > >> java.io.EOFException > >> at java.io.DataInputStream.readFully(DataInputStream.java:180) > >> at java.io.DataInputStream.readFully(DataInputStream.java:152) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475) > >> at > >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470) > >> at > >> > >> > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55) > >> at > >> > >> > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177) > >> at > >> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742) > >> at > >> > >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67) > >> at > >> > >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563) > >> at > >> > >> > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:369) > >> 2013-06-02 05:19:28,477 WARN > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Waited > >> too long for this file, considering dumping > >> 2013-06-02 05:19:28,477 DEBUG > >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Unable > >> to open a reader, sleeping 1000 times 10 > >> > >> > >> So, after this long explanation - is this a bug? > >> > >
