So the problem is if you are using an interface anything other than
'default' (literally that keyword) DNS.java 's getDefaultHost will return a
string which will
have a trailing period at the end. Now to me it seems javadoc of reverseDns
in DNS.java (see below) is conflicting with what that function is actually
doing.
It is returning a PTR record while claims it returns a hostname. The PTR
record always has period at the end , RFC:
http://irbs.net/bog-4.9.5/bog47.html
/**
* Returns the hostname associated with the specified IP address by the
* provided nameserver.
*
* @param hostIp
* The address to reverse lookup
* @param ns
* The host name of a reachable DNS server
* * @return The host name associated with the provided IP*
* @throws NamingException
* If a NamingException is encountered
*/
public static String reverseDns(InetAddress hostIp, String ns)
throws NamingException {
//
// Builds the reverse IP lookup form
// This is formed by reversing the IP numbers and appending in-addr.arpa
//
String[] parts = hostIp.getHostAddress().split("\\.");
String reverseIP = parts[3] + "." + parts[2] + "." + parts[1] + "."
+ parts[0] + ".in-addr.arpa";
System.out.println("reverse ip is :" + reverseIP);
DirContext ictx = new InitialDirContext();
Attributes attribute =
ictx.getAttributes("dns://" // Use "dns:///" if the
default
+ ((ns == null) ? "" : ns) +
// nameserver is to be used
"/" + reverseIP, new String[] { "PTR" });
ictx.close();
* return attribute.get("PTR").get().toString();*
}
Related issue (I havent gone through it completely but glancing hints it is
related).
https://issues.apache.org/jira/browse/HBASE-2599 . Thanks Karthick for
pointing this out.
A quicky is to recognize that default host has a trailing period and drop it
when we call it here:
String machineName = DNS.getDefaultHost(conf.get(
"hbase.regionserver.dns.interface", "default"), conf.get(
"hbase.regionserver.dns.nameserver", "default"));
I will open an issue shortly. Thoughts?
-Shrijeet
On Fri, Jul 15, 2011 at 10:25 AM, Stack <[email protected]> wrote:
> Thanks for digging in Shrijeet. We don't do this name matching well
> in 0.90.x Sorry for pain caused. on your observation below about
> RegionServerTracker, if you figure an improvement, that'd be great.
>
> Thanks,
> St.Ack
>
> On Thu, Jul 14, 2011 at 9:07 PM, Shrijeet Paliwal
> <[email protected]> wrote:
> > I have narrowed it down to following :
> >
> > // Server to handle client requests
> > String machineName = DNS.getDefaultHost(conf.get(
> > "hbase.regionserver.dns.interface", "default"), conf.get(
> > "hbase.regionserver.dns.nameserver", "default"));
> >
> > I am not using the default interface for RS. I have changed it to 'eth1'
> > . The machineName is getting set as 'server-2.rfiserve.net.'
> > Notice the extra period in the end.
> >
> > Because of above there is an inconsistency in the way zookeeper recorded
> the
> > regionserver address and way ServerManager had it in its cached list of
> > onlineservers.
> > You will notice the extra dot in zookeeper entry but not in the
> ServerManager
> > list.
> >
> > [zk: localhost:2181(CONNECTED) 3] ls /hbase/rs
> > [server-2.domain.net.,60020,1310684522383,server-1.domain.net
> > .,60020,1310680203359]
> >
> >
> > In ServerManager we do following :
> >
> > void recordNewServer(HServerInfo info, boolean useInfoLoad,
> > HRegionInterface hri) {
> > HServerLoad load = useInfoLoad? info.getLoad(): new HServerLoad();
> > String serverName = info.getServerName();
> > LOG.info("Registering server=" + serverName + ", regionCount=" +
> > load.getLoad() + ", userLoad=" + useInfoLoad);
> > info.setLoad(load);
> > // TODO: Why did we update the RS location ourself? Shouldn't RS do
> > this?
> > // masterStatus.getZooKeeper().updateRSLocationGetWatch(info,
> watcher);
> > // -- If I understand the question, the RS does not update the
> location
> > // because could be disagreement over locations because of DNS issues;
> > only
> > // master does DNS now -- St.Ack 20100929.
> > this.onlineServers.put(serverName, info);
> > ......
> >
> > In RegionServerTracker after node deletion but pre server expiration a
> map
> > lookup happens, it will lookup for server-2.domain.net
> .,60020,1310684522383
> > (with an extra period) but actual key in map is
> > server-2.domain.net,60020,1310684522383
> > (without the extra period)
> >
> >
> > @Override
> > public void nodeDeleted(String path) {
> > if(path.startsWith(watcher.rsZNode)) {
> > String serverName = ZKUtil.getNodeName(path);
> > LOG.info("RegionServer ephemeral node deleted, processing expiration
> > [" +
> > serverName + "]");
> > HServerInfo hsi = serverManager.getServerInfo(serverName);
> > if(hsi == null) {
> > LOG.info("No HServerInfo found for " + serverName);
> > return;
> > }
> > serverManager.expireServer(hsi);
> > }
> > }
> >
> > The lookup will fail and expiration will never happen. I will get back
> when
> > I have more details on why the DNS is being returned as such.
> > An interesting question is - is it ok to not expire the region server
> when
> > we already deleted the entry of the RS from zookeeper.
> >
> > On Thu, Jul 14, 2011 at 4:32 PM, Shrijeet Paliwal
> > <[email protected]>wrote:
> >
> >> Hi Everyone,
> >>
> >> Hbase Version: 0.90.3
> >> Hadoop Version: cdh3u0
> >> 2 region servers, zookeeper quorum managed by hbase.
> >>
> >> I was doing some tests and it seemed regions are not getting reassigned
> by
> >> master if RS is brought down.
> >> Here are the steps:
> >>
> >> 0. Cluster in a steady state. Pick a random key: k1 belonging to a RS:
> rs1
> >> and perform a get from shell. Result comes back fine.
> >> 1. Bring down rs1 using [/usr/lib/hbase-0.20/bin/hbase-daemon.sh
> --config
> >> /usr/lib/hbase-0.20/conf/ stop regionserver]
> >> 2. Wait few second and do a get from shell for k1 again. k1 is still
> being
> >> located at rs1 and RetriesExhaustedException occurs.
> >> 3. Wait few minutes and do a get from shell for k1 again. k1 is still
> being
> >> located at rs1 and RetriesExhaustedException occurs.
> >> 4. Bring up rs1 using [/usr/lib/hbase-0.20/bin/hbase-daemon.sh --config
> >> /usr/lib/hbase-0.20/conf/ start regionserver]
> >> 5. A get from shell brings back the result just fine.
> >>
> >> My hope at step (3) was a reassignment of regions and get should have
> >> succeeded. 0.90.2 has introduced process to do things more gracefully
> which
> >> is great,
> >> but that (graceful shutdown) is not always possible.
> >> I have pastebin-ed the relevant logs. Can anyone help me understand the
> >> scenario?
> >>
> >> Hbase Shell after RS brought down
> >> http://pastebin.com/8bvk5RFV
> >>
> >> RS log around time it was brought down
> >> http://pastebin.com/sgVRVCCj
> >>
> >> Zkdump after RS brought down
> >> http://pastebin.com/meyqCVJ0
> >>
> >> Hmaster log around time RS was brought down
> >> http://pastebin.com/jBGKuy74
> >>
> >> hbck after RS brought down
> >> http://pastebin.com/bxvyTTF5
> >>
> >> hbck after RS brought up
> >> http://pastebin.com/FPxvT9qW
> >>
> >
>