Filed https://issues.apache.org/jira/browse/HBASE-4109

On Fri, Jul 15, 2011 at 1:53 PM, Stack <[email protected]> wrote:

> Good on you lads.  Can we get a fix in for 0.90.4?
> St.Ack
>
> On Fri, Jul 15, 2011 at 1:02 PM, Shrijeet Paliwal
> <[email protected]> wrote:
> > So the problem is if you are using an interface anything other than
> > 'default' (literally that keyword) DNS.java 's getDefaultHost will return
> a
> > string which will
> > have a trailing period at the end. Now to me it seems javadoc of
> reverseDns
> > in DNS.java (see below) is conflicting with what that function is
> actually
> > doing.
> > It is returning a PTR record while claims it returns a hostname. The PTR
> > record always has period at the end , RFC:
> > http://irbs.net/bog-4.9.5/bog47.html
> >
> >  /**
> >   * Returns the hostname associated with the specified IP address by the
> >   * provided nameserver.
> >   *
> >   * @param hostIp
> >   *            The address to reverse lookup
> >   * @param ns
> >   *            The host name of a reachable DNS server
> > *   * @return The host name associated with the provided IP*
> >   * @throws NamingException
> >   *             If a NamingException is encountered
> >   */
> >  public static String reverseDns(InetAddress hostIp, String ns)
> >    throws NamingException {
> >    //
> >    // Builds the reverse IP lookup form
> >    // This is formed by reversing the IP numbers and appending
> in-addr.arpa
> >    //
> >    String[] parts = hostIp.getHostAddress().split("\\.");
> >    String reverseIP = parts[3] + "." + parts[2] + "." + parts[1] + "."
> >      + parts[0] + ".in-addr.arpa";
> >
> >    System.out.println("reverse ip is :" + reverseIP);
> >
> >    DirContext ictx = new InitialDirContext();
> >    Attributes attribute =
> >      ictx.getAttributes("dns://"               // Use "dns:///" if the
> > default
> >                         + ((ns == null) ? "" : ns) +
> >                         // nameserver is to be used
> >                         "/" + reverseIP, new String[] { "PTR" });
> >    ictx.close();
> >
> > *    return attribute.get("PTR").get().toString();*
> >  }
> >
> >
> > Related issue (I havent gone through it completely but glancing hints it
> is
> > related).
> > https://issues.apache.org/jira/browse/HBASE-2599 . Thanks Karthick for
> > pointing this out.
> >
> > A quicky is to recognize that default host has a trailing period and drop
> it
> > when we call it here:
> >  String machineName = DNS.getDefaultHost(conf.get(
> >        "hbase.regionserver.dns.interface", "default"), conf.get(
> >        "hbase.regionserver.dns.nameserver", "default"));
> >
> > I will open an issue shortly.  Thoughts?
> >
> > -Shrijeet
> > On Fri, Jul 15, 2011 at 10:25 AM, Stack <[email protected]> wrote:
> >
> >> Thanks for digging in Shrijeet.  We don't do this name matching well
> >> in 0.90.x  Sorry for pain caused.  on your observation below about
> >> RegionServerTracker, if you figure an improvement, that'd be great.
> >>
> >> Thanks,
> >> St.Ack
> >>
> >> On Thu, Jul 14, 2011 at 9:07 PM, Shrijeet Paliwal
> >> <[email protected]> wrote:
> >> > I have narrowed it down to following :
> >> >
> >> >  // Server to handle client requests
> >> >    String machineName = DNS.getDefaultHost(conf.get(
> >> >        "hbase.regionserver.dns.interface", "default"), conf.get(
> >> >        "hbase.regionserver.dns.nameserver", "default"));
> >> >
> >> > I am not using the default interface for RS. I have changed it to
> 'eth1'
> >> > . The machineName is getting set as 'server-2.rfiserve.net.'
> >> > Notice the extra period in the end.
> >> >
> >> > Because of above there is an inconsistency in the way zookeeper
> recorded
> >> the
> >> > regionserver address and way ServerManager had it in its cached list
> of
> >> > onlineservers.
> >> > You will notice the extra dot in zookeeper entry but not in the
> >> ServerManager
> >> > list.
> >> >
> >> > [zk: localhost:2181(CONNECTED) 3] ls /hbase/rs
> >> > [server-2.domain.net.,60020,1310684522383,server-1.domain.net
> >> > .,60020,1310680203359]
> >> >
> >> >
> >> > In ServerManager we do following :
> >> >
> >> > void recordNewServer(HServerInfo info, boolean useInfoLoad,
> >> >      HRegionInterface hri) {
> >> >    HServerLoad load = useInfoLoad? info.getLoad(): new HServerLoad();
> >> >    String serverName = info.getServerName();
> >> >    LOG.info("Registering server=" + serverName + ", regionCount=" +
> >> >      load.getLoad() + ", userLoad=" + useInfoLoad);
> >> >    info.setLoad(load);
> >> >    // TODO: Why did we update the RS location ourself?  Shouldn't RS
> do
> >> > this?
> >> >    // masterStatus.getZooKeeper().updateRSLocationGetWatch(info,
> >> watcher);
> >> >    // -- If I understand the question, the RS does not update the
> >> location
> >> >    // because could be disagreement over locations because of DNS
> issues;
> >> > only
> >> >    // master does DNS now -- St.Ack 20100929.
> >> >    this.onlineServers.put(serverName, info);
> >> > ......
> >> >
> >> > In RegionServerTracker after node deletion but pre server expiration a
> >> map
> >> > lookup happens, it will lookup for server-2.domain.net
> >> .,60020,1310684522383
> >> > (with an extra period) but actual key in map is
> >> > server-2.domain.net,60020,1310684522383
> >> > (without the extra period)
> >> >
> >> >
> >> >  @Override
> >> >  public void nodeDeleted(String path) {
> >> >    if(path.startsWith(watcher.rsZNode)) {
> >> >      String serverName = ZKUtil.getNodeName(path);
> >> >      LOG.info("RegionServer ephemeral node deleted, processing
> expiration
> >> > [" +
> >> >          serverName + "]");
> >> >      HServerInfo hsi = serverManager.getServerInfo(serverName);
> >> >      if(hsi == null) {
> >> >        LOG.info("No HServerInfo found for " + serverName);
> >> >        return;
> >> >      }
> >> >      serverManager.expireServer(hsi);
> >> >    }
> >> >  }
> >> >
> >> > The lookup will fail and expiration will never happen. I will get back
> >> when
> >> > I have more details on why the DNS is being returned as such.
> >> > An interesting question is - is it ok to not expire the region server
> >> when
> >> > we already deleted the entry of the RS from zookeeper.
> >> >
> >> > On Thu, Jul 14, 2011 at 4:32 PM, Shrijeet Paliwal
> >> > <[email protected]>wrote:
> >> >
> >> >> Hi Everyone,
> >> >>
> >> >> Hbase Version: 0.90.3
> >> >> Hadoop Version: cdh3u0
> >> >> 2 region servers, zookeeper quorum managed by hbase.
> >> >>
> >> >> I was doing some tests and it seemed regions are not getting
> reassigned
> >> by
> >> >> master if RS is brought down.
> >> >> Here are the steps:
> >> >>
> >> >> 0. Cluster in a steady state. Pick a random key: k1 belonging to a
> RS:
> >> rs1
> >> >> and perform a get from shell. Result comes back fine.
> >> >> 1. Bring down rs1 using [/usr/lib/hbase-0.20/bin/hbase-daemon.sh
> >> --config
> >> >> /usr/lib/hbase-0.20/conf/ stop regionserver]
> >> >> 2. Wait few second and do a get from shell for k1 again. k1 is still
> >> being
> >> >> located at rs1 and RetriesExhaustedException occurs.
> >> >> 3. Wait few minutes and do a get from shell for k1 again. k1 is still
> >> being
> >> >> located at rs1 and RetriesExhaustedException occurs.
> >> >> 4. Bring up rs1 using [/usr/lib/hbase-0.20/bin/hbase-daemon.sh
> --config
> >> >> /usr/lib/hbase-0.20/conf/ start regionserver]
> >> >> 5. A get from shell brings back the result just fine.
> >> >>
> >> >> My hope at step (3) was a reassignment of regions and get should have
> >> >> succeeded. 0.90.2 has introduced process to do things more gracefully
> >> which
> >> >> is great,
> >> >> but that (graceful shutdown) is not always possible.
> >> >> I have pastebin-ed the relevant logs. Can anyone help me understand
> the
> >> >> scenario?
> >> >>
> >> >> Hbase Shell after RS brought down
> >> >> http://pastebin.com/8bvk5RFV
> >> >>
> >> >> RS log around time it was brought down
> >> >> http://pastebin.com/sgVRVCCj
> >> >>
> >> >> Zkdump after RS brought down
> >> >> http://pastebin.com/meyqCVJ0
> >> >>
> >> >> Hmaster log around time RS was brought down
> >> >> http://pastebin.com/jBGKuy74
> >> >>
> >> >> hbck after RS brought down
> >> >> http://pastebin.com/bxvyTTF5
> >> >>
> >> >> hbck after RS brought up
> >> >> http://pastebin.com/FPxvT9qW
> >> >>
> >> >
> >>
> >
>

Reply via email to