Filed https://issues.apache.org/jira/browse/HBASE-4109
On Fri, Jul 15, 2011 at 1:53 PM, Stack <[email protected]> wrote: > Good on you lads. Can we get a fix in for 0.90.4? > St.Ack > > On Fri, Jul 15, 2011 at 1:02 PM, Shrijeet Paliwal > <[email protected]> wrote: > > So the problem is if you are using an interface anything other than > > 'default' (literally that keyword) DNS.java 's getDefaultHost will return > a > > string which will > > have a trailing period at the end. Now to me it seems javadoc of > reverseDns > > in DNS.java (see below) is conflicting with what that function is > actually > > doing. > > It is returning a PTR record while claims it returns a hostname. The PTR > > record always has period at the end , RFC: > > http://irbs.net/bog-4.9.5/bog47.html > > > > /** > > * Returns the hostname associated with the specified IP address by the > > * provided nameserver. > > * > > * @param hostIp > > * The address to reverse lookup > > * @param ns > > * The host name of a reachable DNS server > > * * @return The host name associated with the provided IP* > > * @throws NamingException > > * If a NamingException is encountered > > */ > > public static String reverseDns(InetAddress hostIp, String ns) > > throws NamingException { > > // > > // Builds the reverse IP lookup form > > // This is formed by reversing the IP numbers and appending > in-addr.arpa > > // > > String[] parts = hostIp.getHostAddress().split("\\."); > > String reverseIP = parts[3] + "." + parts[2] + "." + parts[1] + "." > > + parts[0] + ".in-addr.arpa"; > > > > System.out.println("reverse ip is :" + reverseIP); > > > > DirContext ictx = new InitialDirContext(); > > Attributes attribute = > > ictx.getAttributes("dns://" // Use "dns:///" if the > > default > > + ((ns == null) ? "" : ns) + > > // nameserver is to be used > > "/" + reverseIP, new String[] { "PTR" }); > > ictx.close(); > > > > * return attribute.get("PTR").get().toString();* > > } > > > > > > Related issue (I havent gone through it completely but glancing hints it > is > > related). > > https://issues.apache.org/jira/browse/HBASE-2599 . Thanks Karthick for > > pointing this out. > > > > A quicky is to recognize that default host has a trailing period and drop > it > > when we call it here: > > String machineName = DNS.getDefaultHost(conf.get( > > "hbase.regionserver.dns.interface", "default"), conf.get( > > "hbase.regionserver.dns.nameserver", "default")); > > > > I will open an issue shortly. Thoughts? > > > > -Shrijeet > > On Fri, Jul 15, 2011 at 10:25 AM, Stack <[email protected]> wrote: > > > >> Thanks for digging in Shrijeet. We don't do this name matching well > >> in 0.90.x Sorry for pain caused. on your observation below about > >> RegionServerTracker, if you figure an improvement, that'd be great. > >> > >> Thanks, > >> St.Ack > >> > >> On Thu, Jul 14, 2011 at 9:07 PM, Shrijeet Paliwal > >> <[email protected]> wrote: > >> > I have narrowed it down to following : > >> > > >> > // Server to handle client requests > >> > String machineName = DNS.getDefaultHost(conf.get( > >> > "hbase.regionserver.dns.interface", "default"), conf.get( > >> > "hbase.regionserver.dns.nameserver", "default")); > >> > > >> > I am not using the default interface for RS. I have changed it to > 'eth1' > >> > . The machineName is getting set as 'server-2.rfiserve.net.' > >> > Notice the extra period in the end. > >> > > >> > Because of above there is an inconsistency in the way zookeeper > recorded > >> the > >> > regionserver address and way ServerManager had it in its cached list > of > >> > onlineservers. > >> > You will notice the extra dot in zookeeper entry but not in the > >> ServerManager > >> > list. > >> > > >> > [zk: localhost:2181(CONNECTED) 3] ls /hbase/rs > >> > [server-2.domain.net.,60020,1310684522383,server-1.domain.net > >> > .,60020,1310680203359] > >> > > >> > > >> > In ServerManager we do following : > >> > > >> > void recordNewServer(HServerInfo info, boolean useInfoLoad, > >> > HRegionInterface hri) { > >> > HServerLoad load = useInfoLoad? info.getLoad(): new HServerLoad(); > >> > String serverName = info.getServerName(); > >> > LOG.info("Registering server=" + serverName + ", regionCount=" + > >> > load.getLoad() + ", userLoad=" + useInfoLoad); > >> > info.setLoad(load); > >> > // TODO: Why did we update the RS location ourself? Shouldn't RS > do > >> > this? > >> > // masterStatus.getZooKeeper().updateRSLocationGetWatch(info, > >> watcher); > >> > // -- If I understand the question, the RS does not update the > >> location > >> > // because could be disagreement over locations because of DNS > issues; > >> > only > >> > // master does DNS now -- St.Ack 20100929. > >> > this.onlineServers.put(serverName, info); > >> > ...... > >> > > >> > In RegionServerTracker after node deletion but pre server expiration a > >> map > >> > lookup happens, it will lookup for server-2.domain.net > >> .,60020,1310684522383 > >> > (with an extra period) but actual key in map is > >> > server-2.domain.net,60020,1310684522383 > >> > (without the extra period) > >> > > >> > > >> > @Override > >> > public void nodeDeleted(String path) { > >> > if(path.startsWith(watcher.rsZNode)) { > >> > String serverName = ZKUtil.getNodeName(path); > >> > LOG.info("RegionServer ephemeral node deleted, processing > expiration > >> > [" + > >> > serverName + "]"); > >> > HServerInfo hsi = serverManager.getServerInfo(serverName); > >> > if(hsi == null) { > >> > LOG.info("No HServerInfo found for " + serverName); > >> > return; > >> > } > >> > serverManager.expireServer(hsi); > >> > } > >> > } > >> > > >> > The lookup will fail and expiration will never happen. I will get back > >> when > >> > I have more details on why the DNS is being returned as such. > >> > An interesting question is - is it ok to not expire the region server > >> when > >> > we already deleted the entry of the RS from zookeeper. > >> > > >> > On Thu, Jul 14, 2011 at 4:32 PM, Shrijeet Paliwal > >> > <[email protected]>wrote: > >> > > >> >> Hi Everyone, > >> >> > >> >> Hbase Version: 0.90.3 > >> >> Hadoop Version: cdh3u0 > >> >> 2 region servers, zookeeper quorum managed by hbase. > >> >> > >> >> I was doing some tests and it seemed regions are not getting > reassigned > >> by > >> >> master if RS is brought down. > >> >> Here are the steps: > >> >> > >> >> 0. Cluster in a steady state. Pick a random key: k1 belonging to a > RS: > >> rs1 > >> >> and perform a get from shell. Result comes back fine. > >> >> 1. Bring down rs1 using [/usr/lib/hbase-0.20/bin/hbase-daemon.sh > >> --config > >> >> /usr/lib/hbase-0.20/conf/ stop regionserver] > >> >> 2. Wait few second and do a get from shell for k1 again. k1 is still > >> being > >> >> located at rs1 and RetriesExhaustedException occurs. > >> >> 3. Wait few minutes and do a get from shell for k1 again. k1 is still > >> being > >> >> located at rs1 and RetriesExhaustedException occurs. > >> >> 4. Bring up rs1 using [/usr/lib/hbase-0.20/bin/hbase-daemon.sh > --config > >> >> /usr/lib/hbase-0.20/conf/ start regionserver] > >> >> 5. A get from shell brings back the result just fine. > >> >> > >> >> My hope at step (3) was a reassignment of regions and get should have > >> >> succeeded. 0.90.2 has introduced process to do things more gracefully > >> which > >> >> is great, > >> >> but that (graceful shutdown) is not always possible. > >> >> I have pastebin-ed the relevant logs. Can anyone help me understand > the > >> >> scenario? > >> >> > >> >> Hbase Shell after RS brought down > >> >> http://pastebin.com/8bvk5RFV > >> >> > >> >> RS log around time it was brought down > >> >> http://pastebin.com/sgVRVCCj > >> >> > >> >> Zkdump after RS brought down > >> >> http://pastebin.com/meyqCVJ0 > >> >> > >> >> Hmaster log around time RS was brought down > >> >> http://pastebin.com/jBGKuy74 > >> >> > >> >> hbck after RS brought down > >> >> http://pastebin.com/bxvyTTF5 > >> >> > >> >> hbck after RS brought up > >> >> http://pastebin.com/FPxvT9qW > >> >> > >> > > >> > > >
