Getting DNS right can be hard if you're not used to it, and distributed systems rely on that. Making sure that machines don't report themselves to others as being at address 127.0.0.1 is of outmost importance.
J-D On Fri, Nov 12, 2010 at 1:47 AM, Hari Sreekumar <[email protected]> wrote: > Solved the issue now. I had to remove the hostname from the 127.0.0.1 line > in /etc/hosts on all nodes. > > e.g, lets say the node has IP a.b.c.d and name master > So if initially the /etc/hosts was like this: > 127.0.0.1 master localhost localhost.localdomain > ... > ... > ... > > Now the new /etc/hosts file looks like this: > > 127.0.0.1 localhost localhost.localdomain > .. > > .. > .. > a.b.c.d master > > Doesn't look like a clean solution, but it works for now.. > > thanks, > hari > > On Fri, Nov 12, 2010 at 1:01 PM, Hari Sreekumar > <[email protected]>wrote: > >> The problem seems to be that the regionservers on the other 2 nodes are not >> getting connected to the master. The master never sees these other 2 nodes. >> What could be the reason? >> >> >> On Fri, Nov 12, 2010 at 12:58 PM, Hari Sreekumar <[email protected] >> > wrote: >> >>> Also when I stop using* stop-hbase.sh*, the regionservers on my other 2 >>> nodes don't get stopped. I have to separately execute *hbase-daemons.sh >>> stop regionservers* to stop RS on the other 2 nodes. >>> >>> >>> On Fri, Nov 12, 2010 at 12:57 PM, Hari Sreekumar < >>> [email protected]> wrote: >>> >>>> Yes, I found this in the regionserver log: >>>> >>>> 2010-11-12 18:13:29,094 WARN >>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to >>>> master. Retrying. Error was: >>>> java.net.ConnectException: Connection refused >>>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >>>> at >>>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) >>>> at >>>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) >>>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) >>>> at >>>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308) >>>> at >>>> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:844) >>>> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:716) >>>> at >>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333) >>>> at $Proxy0.getProtocolVersion(Unknown Source) >>>> at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:489) >>>> at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:465) >>>> at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:512) >>>> at >>>> org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:423) >>>> at >>>> org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1299) >>>> at >>>> org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1317) >>>> at >>>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:434) >>>> at java.lang.Thread.run(Thread.java:662) >>>> >>>> But master is running fine, and I am able to ping the nodes from each >>>> other. >>>> >>>> I found this in the master's log: >>>> 2010-11-12 12:47:38,908 INFO >>>> org.apache.hadoop.hbase.master.ServerManager: 1 region servers, 0 dead, >>>> average load 4.0 >>>> 2010-11-12 12:48:25,685 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.rootScanner scanning meta region {server: 127.0.0.1:60020, >>>> regionname: -ROOT-,,0, startKey: <>} >>>> 2010-11-12 12:48:25,694 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.rootScanner scan of 1 row(s) of meta region {server: >>>> 127.0.0.1:60020, regionname: -ROOT-,,0, startKey: <>} complete >>>> 2010-11-12 12:48:26,672 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.metaScanner scanning meta region {server: 127.0.0.1:60020, >>>> regionname: .META.,,1, startKey: <>} >>>> 2010-11-12 12:48:26,682 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.metaScanner scan of 2 row(s) of meta region {server: >>>> 127.0.0.1:60020, regionname: .META.,,1, startKey: <>} complete >>>> 2010-11-12 12:48:26,682 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> All 1 .META. region(s) scanned >>>> 2010-11-12 12:48:38,908 INFO >>>> org.apache.hadoop.hbase.master.ServerManager: 1 region servers, 0 dead, >>>> average load 4.0 >>>> 2010-11-12 12:49:25,686 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.rootScanner scanning meta region {server: 127.0.0.1:60020, >>>> regionname: -ROOT-,,0, startKey: <>} >>>> 2010-11-12 12:49:25,694 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.rootScanner scan of 1 row(s) of meta region {server: >>>> 127.0.0.1:60020, regionname: -ROOT-,,0, startKey: <>} complete >>>> 2010-11-12 12:49:26,672 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.metaScanner scanning meta region {server: 127.0.0.1:60020, >>>> regionname: .META.,,1, startKey: <>} >>>> 2010-11-12 12:49:26,682 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.metaScanner scan of 2 row(s) of meta region {server: >>>> 127.0.0.1:60020, regionname: .META.,,1, startKey: <>} complete >>>> 2010-11-12 12:49:26,682 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> All 1 .META. region(s) scanned >>>> 2010-11-12 12:49:38,908 INFO >>>> org.apache.hadoop.hbase.master.ServerManager: 1 region servers, 0 dead, >>>> average load 4.0 >>>> 2010-11-12 12:50:25,685 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.rootScanner scanning meta region {server: 127.0.0.1:60020, >>>> regionname: -ROOT-,,0, startKey: <>} >>>> 2010-11-12 12:50:25,694 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.rootScanner scan of 1 row(s) of meta region {server: >>>> 127.0.0.1:60020, regionname: -ROOT-,,0, startKey: <>} complete >>>> >>>> This is missing something, because I remember earlier when everything was >>>> working fine I used to get 3 regionservers instead of 1. >>>> A sample from the earlier log when everything was working fine: >>>> 2010-11-01 16:00:13,103 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.rootScanner scanning meta region {server: >>>> 192.168.0.134:60020, regionname: -ROOT-,,0, startKey: <>} >>>> 2010-11-01 16:00:13,111 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.rootScanner scan of 1 row(s) of meta region {server: >>>> 192.168.0.134:60020, regionname: -ROOT-,,0, startKey: <>} complete >>>> 2010-11-01 16:00:15,702 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect >>>> header or version mismatch from 192.168.0.158:35240 got version 47 >>>> expected version 3 >>>> 2010-11-01 16:00:20,440 DEBUG >>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit >>>> for row <> in tableName .META.: location server 192.168.0.99:60020, >>>> location region name .META.,,1 >>>> 2010-11-01 16:00:59,677 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.metaScanner scanning meta region {server: >>>> 192.168.0.99:60020, regionname: .META.,,1, startKey: <>} >>>> 2010-11-01 16:00:59,688 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.metaScanner scan of 1 row(s) of meta region {server: >>>> 192.168.0.99:60020, regionname: .META.,,1, startKey: <>} complete >>>> 2010-11-01 16:00:59,688 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> All 1 .META. region(s) scanned >>>> 2010-11-01 16:01:11,858 INFO >>>> org.apache.hadoop.hbase.master.ServerManager: 3 region servers, 0 dead, >>>> average load 1.0 >>>> 2010-11-01 16:01:13,103 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.rootScanner scanning meta region {server: >>>> 192.168.0.134:60020, regionname: -ROOT-,,0, startKey: <>} >>>> 2010-11-01 16:01:13,111 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.rootScanner scan of 1 row(s) of meta region {server: >>>> 192.168.0.134:60020, regionname: -ROOT-,,0, startKey: <>} complete >>>> 2010-11-01 16:01:59,677 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.metaScanner scanning meta region {server: >>>> 192.168.0.99:60020, regionname: .META.,,1, startKey: <>} >>>> 2010-11-01 16:01:59,688 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.metaScanner scan of 1 row(s) of meta region {server: >>>> 192.168.0.99:60020, regionname: .META.,,1, startKey: <>} complete >>>> 2010-11-01 16:01:59,688 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> All 1 .META. region(s) scanned >>>> 2010-11-01 16:02:11,859 INFO >>>> org.apache.hadoop.hbase.master.ServerManager: 3 region servers, 0 dead, >>>> average load 1.0 >>>> 2010-11-01 16:02:13,104 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.rootScanner scanning meta region {server: >>>> 192.168.0.134:60020, regionname: -ROOT-,,0, startKey: <>} >>>> 2010-11-01 16:02:13,117 INFO org.apache.hadoop.hbase.master.BaseScanner: >>>> RegionManager.rootScanner scan of 1 row(s) of meta region {server: >>>> 192.168.0.134:60020, regionname: -ROOT-,,0, startKey: <>} complete >>>> 2010-11-01 16:02:55,254 INFO org.apache.hadoop.hbase.regionserver.HLog: >>>> HLog configuration: blocksize=67108864, rollsize=63753420, enabled=true, >>>> flushlogentries=100, optionallogflushinternal=10000ms >>>> 2010-11-01 16:02:55,262 INFO org.apache.hadoop.hbase.regionserver.HLog: >>>> New hlog /hbase/Webevent/913558333/.logs/hlog.dat.1288607575254 >>>> 2010-11-01 16:02:55,263 DEBUG >>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region >>>> Webevent,,1288607575222, encoded=913558333 >>>> >>>> >>>> >>>> >>>> On Thu, Nov 11, 2010 at 10:58 PM, Jean-Daniel Cryans < >>>> [email protected]> wrote: >>>> >>>>> Did you take a look at that region server's log at the time your MR >>>>> job was running? See any obvious exceptions? Was the machine swapping >>>>> at that time? >>>>> >>>>> J-D >>>>> >>>>> On Thu, Nov 11, 2010 at 12:43 AM, Hari Sreekumar >>>>> <[email protected]> wrote: >>>>> > Hi, >>>>> > >>>>> > I am getting a lot of these RetriesExhaustedExceptions when I run my >>>>> m/r >>>>> > job. This happens with the 116 server only. What could be the issue? I >>>>> have >>>>> > checked that RS is running on that server, and 192.168.1.116:60030 is >>>>> also >>>>> > working fine.. >>>>> > >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to >>>>> contact >>>>> > region server 192.168.1.116:60020 for region >>>>> > >>>>> Webevent,de6c33d0-4e17-47e5-af8a-f88f0af32235_1273198490000_a53c83e4-7a80-418c-bc99-f2f955bda9b2,1289462602425, >>>>> > row >>>>> > >>>>> 'e8f3e3c3-606e-4d1b-a84f-94c5421d153f_1273198296000_23717002-51e3-48e8-9fa4-7618e9728b93', >>>>> > but failed after 10 attempts. >>>>> > Exceptions: >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up >>>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up >>>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up >>>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up >>>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up >>>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up >>>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up >>>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up >>>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up >>>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>>> setting up >>>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>>> > >>>>> > at >>>>> > >>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1045) >>>>> > at >>>>> > >>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers$3.doCall(HConnectionManager.java:1230) >>>>> > at >>>>> > >>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1152) >>>>> > at >>>>> > >>>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1238) >>>>> > at >>>>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:666) >>>>> > at org.apache.hadoop.hbase.client.HTable.put(HTable.java:510) >>>>> > at >>>>> > >>>>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:94) >>>>> > at >>>>> > >>>>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:55) >>>>> > at >>>>> > >>>>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:498) >>>>> > at >>>>> > >>>>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) >>>>> > at BulkUpload$BulkUploadMapper.map(Unknown Source) >>>>> > at BulkUpload$BulkUploadMapper.map(Unknown Source) >>>>> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) >>>>> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) >>>>> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) >>>>> > at org.apache.hadoop.mapred.Child.main(Child.java:170) >>>>> > >>>>> > thanks, >>>>> > hari >>>>> > >>>>> >>>> >>>> >>> >> >
