And another difference we have is that, during the network upgrade, we made each cluster node has two network cards. One for 192.168.11.* and another for 10.0.2.*, and we found for some of the machines, the ip_forward is turned off( 5 in 37).
I knew almost nothing about the network, so it might be a stupid question. I am interested in if we didn't have ip_forward turned on, will it also impact the hbase communication? Thanks. On Sat, May 14, 2011 at 7:39 PM, Stanley Xu <[email protected]> wrote: > Dear all, > > We have met problem with hbase these days after a network update. > Basically, the behavior is that after 3-4 hours of the cluster startup. Some > of the RegionServer try to find the data from a deleted block. > > And if we restarted the cluster, the problem just went away, and the data > is not missing. > > The detail description of the problem could be found at > > http://search-hadoop.com/m/ZpgJ623GoyU1/.META.+inconsistency&subj=The+META+data+inconsistency+issue > > I just found some doubt issues in the network configuration of our cluster. > I found some of the cluster node has different broadcast address and Mask > comparing to other nodes, for example, as the following, the hadoopsh11092 > use Bcast for 10.255.255.255 and Mask 255.0.0.0, and hadoopsh11103 use Bcast > for 10.0.2.255 and Mask 255.255.255.0 > > hadoopsh11092 > eth0 Link encap:Ethernet HWaddr 00:A0:D1:EE:C1:7C > inet addr:10.0.2.19 Bcast:10.255.255.255 Mask:255.0.0.0 > inet6 addr: fe80::2a0:d1ff:feee:c17c/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:1864321949 errors:0 dropped:1465 overruns:0 frame:0 > TX packets:1867202791 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:1811900116811 (1.6 TiB) TX bytes:1879509303203 (1.7 > TiB) > Memory:face0000-fad00000 > > > hadoopsh11103 > eth0 Link encap:Ethernet HWaddr 00:A0:D1:EE:AE:C4 > inet addr:10.0.2.30 Bcast:10.0.2.255 Mask:255.255.255.0 > inet6 addr: fe80::2a0:d1ff:feee:aec4/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:1726779928 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1716762766 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:1804202744690 (1.6 TiB) TX bytes:1824085255121 (1.6 > TiB) > Memory:face0000-fad00000 > > But with these settings, we could have the cluster startup successfully and > the cluster works pretty fine after startup, the problem comes after 3-4 > hours. And I could connect to different machine by SSH with their hosts name > correctly. > > I knew that Zookeeper has some kind of broadcast during communication. I am > wondering if our settings should work, or it should be the root cause of our > problem? > > Thanks in advance. > > Best wishes, > Stanley Xu > >
