Dear all, We have met problem with hbase these days after a network update. Basically, the behavior is that after 3-4 hours of the cluster startup. Some of the RegionServer try to find the data from a deleted block.
And if we restarted the cluster, the problem just went away, and the data is not missing. The detail description of the problem could be found at http://search-hadoop.com/m/ZpgJ623GoyU1/.META.+inconsistency&subj=The+META+data+inconsistency+issue I just found some doubt issues in the network configuration of our cluster. I found some of the cluster node has different broadcast address and Mask comparing to other nodes, for example, as the following, the hadoopsh11092 use Bcast for 10.255.255.255 and Mask 255.0.0.0, and hadoopsh11103 use Bcast for 10.0.2.255 and Mask 255.255.255.0 hadoopsh11092 eth0 Link encap:Ethernet HWaddr 00:A0:D1:EE:C1:7C inet addr:10.0.2.19 Bcast:10.255.255.255 Mask:255.0.0.0 inet6 addr: fe80::2a0:d1ff:feee:c17c/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1864321949 errors:0 dropped:1465 overruns:0 frame:0 TX packets:1867202791 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1811900116811 (1.6 TiB) TX bytes:1879509303203 (1.7 TiB) Memory:face0000-fad00000 hadoopsh11103 eth0 Link encap:Ethernet HWaddr 00:A0:D1:EE:AE:C4 inet addr:10.0.2.30 Bcast:10.0.2.255 Mask:255.255.255.0 inet6 addr: fe80::2a0:d1ff:feee:aec4/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1726779928 errors:0 dropped:0 overruns:0 frame:0 TX packets:1716762766 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1804202744690 (1.6 TiB) TX bytes:1824085255121 (1.6 TiB) Memory:face0000-fad00000 But with these settings, we could have the cluster startup successfully and the cluster works pretty fine after startup, the problem comes after 3-4 hours. And I could connect to different machine by SSH with their hosts name correctly. I knew that Zookeeper has some kind of broadcast during communication. I am wondering if our settings should work, or it should be the root cause of our problem? Thanks in advance. Best wishes, Stanley Xu
