Solved the issue now. I had to remove the hostname from the 127.0.0.1 line in /etc/hosts on all nodes.
e.g, lets say the node has IP a.b.c.d and name master So if initially the /etc/hosts was like this: 127.0.0.1 master localhost localhost.localdomain ... ... ... Now the new /etc/hosts file looks like this: 127.0.0.1 localhost localhost.localdomain .. .. .. a.b.c.d master Doesn't look like a clean solution, but it works for now.. thanks, hari On Fri, Nov 12, 2010 at 1:01 PM, Hari Sreekumar <[email protected]>wrote: > The problem seems to be that the regionservers on the other 2 nodes are not > getting connected to the master. The master never sees these other 2 nodes. > What could be the reason? > > > On Fri, Nov 12, 2010 at 12:58 PM, Hari Sreekumar <[email protected] > > wrote: > >> Also when I stop using* stop-hbase.sh*, the regionservers on my other 2 >> nodes don't get stopped. I have to separately execute *hbase-daemons.sh >> stop regionservers* to stop RS on the other 2 nodes. >> >> >> On Fri, Nov 12, 2010 at 12:57 PM, Hari Sreekumar < >> [email protected]> wrote: >> >>> Yes, I found this in the regionserver log: >>> >>> 2010-11-12 18:13:29,094 WARN >>> org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to >>> master. Retrying. Error was: >>> java.net.ConnectException: Connection refused >>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >>> at >>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) >>> at >>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) >>> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) >>> at >>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308) >>> at >>> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:844) >>> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:716) >>> at >>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333) >>> at $Proxy0.getProtocolVersion(Unknown Source) >>> at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:489) >>> at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:465) >>> at org.apache.hadoop.hbase.ipc.HBaseRPC.getProxy(HBaseRPC.java:512) >>> at >>> org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:423) >>> at >>> org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:1299) >>> at >>> org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1317) >>> at >>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:434) >>> at java.lang.Thread.run(Thread.java:662) >>> >>> But master is running fine, and I am able to ping the nodes from each >>> other. >>> >>> I found this in the master's log: >>> 2010-11-12 12:47:38,908 INFO >>> org.apache.hadoop.hbase.master.ServerManager: 1 region servers, 0 dead, >>> average load 4.0 >>> 2010-11-12 12:48:25,685 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.rootScanner scanning meta region {server: 127.0.0.1:60020, >>> regionname: -ROOT-,,0, startKey: <>} >>> 2010-11-12 12:48:25,694 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.rootScanner scan of 1 row(s) of meta region {server: >>> 127.0.0.1:60020, regionname: -ROOT-,,0, startKey: <>} complete >>> 2010-11-12 12:48:26,672 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.metaScanner scanning meta region {server: 127.0.0.1:60020, >>> regionname: .META.,,1, startKey: <>} >>> 2010-11-12 12:48:26,682 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.metaScanner scan of 2 row(s) of meta region {server: >>> 127.0.0.1:60020, regionname: .META.,,1, startKey: <>} complete >>> 2010-11-12 12:48:26,682 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> All 1 .META. region(s) scanned >>> 2010-11-12 12:48:38,908 INFO >>> org.apache.hadoop.hbase.master.ServerManager: 1 region servers, 0 dead, >>> average load 4.0 >>> 2010-11-12 12:49:25,686 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.rootScanner scanning meta region {server: 127.0.0.1:60020, >>> regionname: -ROOT-,,0, startKey: <>} >>> 2010-11-12 12:49:25,694 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.rootScanner scan of 1 row(s) of meta region {server: >>> 127.0.0.1:60020, regionname: -ROOT-,,0, startKey: <>} complete >>> 2010-11-12 12:49:26,672 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.metaScanner scanning meta region {server: 127.0.0.1:60020, >>> regionname: .META.,,1, startKey: <>} >>> 2010-11-12 12:49:26,682 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.metaScanner scan of 2 row(s) of meta region {server: >>> 127.0.0.1:60020, regionname: .META.,,1, startKey: <>} complete >>> 2010-11-12 12:49:26,682 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> All 1 .META. region(s) scanned >>> 2010-11-12 12:49:38,908 INFO >>> org.apache.hadoop.hbase.master.ServerManager: 1 region servers, 0 dead, >>> average load 4.0 >>> 2010-11-12 12:50:25,685 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.rootScanner scanning meta region {server: 127.0.0.1:60020, >>> regionname: -ROOT-,,0, startKey: <>} >>> 2010-11-12 12:50:25,694 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.rootScanner scan of 1 row(s) of meta region {server: >>> 127.0.0.1:60020, regionname: -ROOT-,,0, startKey: <>} complete >>> >>> This is missing something, because I remember earlier when everything was >>> working fine I used to get 3 regionservers instead of 1. >>> A sample from the earlier log when everything was working fine: >>> 2010-11-01 16:00:13,103 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.rootScanner scanning meta region {server: >>> 192.168.0.134:60020, regionname: -ROOT-,,0, startKey: <>} >>> 2010-11-01 16:00:13,111 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.rootScanner scan of 1 row(s) of meta region {server: >>> 192.168.0.134:60020, regionname: -ROOT-,,0, startKey: <>} complete >>> 2010-11-01 16:00:15,702 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect >>> header or version mismatch from 192.168.0.158:35240 got version 47 >>> expected version 3 >>> 2010-11-01 16:00:20,440 DEBUG >>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit >>> for row <> in tableName .META.: location server 192.168.0.99:60020, >>> location region name .META.,,1 >>> 2010-11-01 16:00:59,677 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.metaScanner scanning meta region {server: >>> 192.168.0.99:60020, regionname: .META.,,1, startKey: <>} >>> 2010-11-01 16:00:59,688 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.metaScanner scan of 1 row(s) of meta region {server: >>> 192.168.0.99:60020, regionname: .META.,,1, startKey: <>} complete >>> 2010-11-01 16:00:59,688 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> All 1 .META. region(s) scanned >>> 2010-11-01 16:01:11,858 INFO >>> org.apache.hadoop.hbase.master.ServerManager: 3 region servers, 0 dead, >>> average load 1.0 >>> 2010-11-01 16:01:13,103 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.rootScanner scanning meta region {server: >>> 192.168.0.134:60020, regionname: -ROOT-,,0, startKey: <>} >>> 2010-11-01 16:01:13,111 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.rootScanner scan of 1 row(s) of meta region {server: >>> 192.168.0.134:60020, regionname: -ROOT-,,0, startKey: <>} complete >>> 2010-11-01 16:01:59,677 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.metaScanner scanning meta region {server: >>> 192.168.0.99:60020, regionname: .META.,,1, startKey: <>} >>> 2010-11-01 16:01:59,688 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.metaScanner scan of 1 row(s) of meta region {server: >>> 192.168.0.99:60020, regionname: .META.,,1, startKey: <>} complete >>> 2010-11-01 16:01:59,688 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> All 1 .META. region(s) scanned >>> 2010-11-01 16:02:11,859 INFO >>> org.apache.hadoop.hbase.master.ServerManager: 3 region servers, 0 dead, >>> average load 1.0 >>> 2010-11-01 16:02:13,104 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.rootScanner scanning meta region {server: >>> 192.168.0.134:60020, regionname: -ROOT-,,0, startKey: <>} >>> 2010-11-01 16:02:13,117 INFO org.apache.hadoop.hbase.master.BaseScanner: >>> RegionManager.rootScanner scan of 1 row(s) of meta region {server: >>> 192.168.0.134:60020, regionname: -ROOT-,,0, startKey: <>} complete >>> 2010-11-01 16:02:55,254 INFO org.apache.hadoop.hbase.regionserver.HLog: >>> HLog configuration: blocksize=67108864, rollsize=63753420, enabled=true, >>> flushlogentries=100, optionallogflushinternal=10000ms >>> 2010-11-01 16:02:55,262 INFO org.apache.hadoop.hbase.regionserver.HLog: >>> New hlog /hbase/Webevent/913558333/.logs/hlog.dat.1288607575254 >>> 2010-11-01 16:02:55,263 DEBUG >>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region >>> Webevent,,1288607575222, encoded=913558333 >>> >>> >>> >>> >>> On Thu, Nov 11, 2010 at 10:58 PM, Jean-Daniel Cryans < >>> [email protected]> wrote: >>> >>>> Did you take a look at that region server's log at the time your MR >>>> job was running? See any obvious exceptions? Was the machine swapping >>>> at that time? >>>> >>>> J-D >>>> >>>> On Thu, Nov 11, 2010 at 12:43 AM, Hari Sreekumar >>>> <[email protected]> wrote: >>>> > Hi, >>>> > >>>> > I am getting a lot of these RetriesExhaustedExceptions when I run my >>>> m/r >>>> > job. This happens with the 116 server only. What could be the issue? I >>>> have >>>> > checked that RS is running on that server, and 192.168.1.116:60030 is >>>> also >>>> > working fine.. >>>> > >>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to >>>> contact >>>> > region server 192.168.1.116:60020 for region >>>> > >>>> Webevent,de6c33d0-4e17-47e5-af8a-f88f0af32235_1273198490000_a53c83e4-7a80-418c-bc99-f2f955bda9b2,1289462602425, >>>> > row >>>> > >>>> 'e8f3e3c3-606e-4d1b-a84f-94c5421d153f_1273198296000_23717002-51e3-48e8-9fa4-7618e9728b93', >>>> > but failed after 10 attempts. >>>> > Exceptions: >>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up >>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up >>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up >>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up >>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up >>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up >>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up >>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up >>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up >>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed >>>> setting up >>>> > proxy to /192.168.1.116:60020 after attempts=1 >>>> > >>>> > at >>>> > >>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:1045) >>>> > at >>>> > >>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers$3.doCall(HConnectionManager.java:1230) >>>> > at >>>> > >>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1152) >>>> > at >>>> > >>>> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1238) >>>> > at >>>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:666) >>>> > at org.apache.hadoop.hbase.client.HTable.put(HTable.java:510) >>>> > at >>>> > >>>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:94) >>>> > at >>>> > >>>> org.apache.hadoop.hbase.mapreduce.TableOutputFormat$TableRecordWriter.write(TableOutputFormat.java:55) >>>> > at >>>> > >>>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:498) >>>> > at >>>> > >>>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) >>>> > at BulkUpload$BulkUploadMapper.map(Unknown Source) >>>> > at BulkUpload$BulkUploadMapper.map(Unknown Source) >>>> > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) >>>> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) >>>> > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) >>>> > at org.apache.hadoop.mapred.Child.main(Child.java:170) >>>> > >>>> > thanks, >>>> > hari >>>> > >>>> >>> >>> >> >
