Nope, $HOSTNAME is *.service.consul. But I actually got it working now. The problem was that the sequenceio ambari installed a consul docker instance with a nameserver which answered with the wrong hostname. When I removed the nameserver from /etc/resolve.conf and instead added the correct *.service.consul hostnames to /etc/hosts... everything worked! :-)
Thanks for your help, much appreciated! On Tue, Nov 24, 2015 at 1:21 PM, Samir Ahmic <[email protected]> wrote: > Your hosts file looks fine. If i understand correctly value of $HOSTNAME > env variable is *.node.dc1.consul ? Try changing servers hostname to > *.service.consul. > Also try to disable resolution by DNS server, Comment all lines in > /etc/resolve.conf. > > Regards > Samir > > On Tue, Nov 24, 2015 at 12:29 PM, Kristoffer Sjögren <[email protected]> > wrote: > >> Only one network interface on all machines. The ping is interesting, >> both machines respond with *.node.dc1.consul but internally >> *.service.consul. >> >> amb1.service.consul /etc/hosts >> 172.17.0.89 amb1.service.consul amb1 >> 127.0.0.1 localhost >> ::1 localhost ip6-localhost ip6-loopback >> fe00::0 ip6-localnet >> ff00::0 ip6-mcastprefix >> ff02::1 ip6-allnodes >> ff02::2 ip6-allrouters >> >> amb2.service.consul /etc/hosts >> 172.17.0.90 amb2.service.consul amb2 >> 127.0.0.1 localhost >> ::1 localhost ip6-localhost ip6-loopback >> fe00::0 ip6-localnet >> ff00::0 ip6-mcastprefix >> ff02::1 ip6-allnodes >> ff02::2 ip6-allrouters >> >> >> ping amb1 from amb1.service.consul >> >> PING amb1.service.consul (172.17.0.89) 56(84) bytes of data. >> 64 bytes from amb1.service.consul (172.17.0.89): icmp_seq=1 ttl=64 >> time=0.059 ms >> >> ping amb2 from amb1.service.consul >> >> PING amb2.service.consul (172.17.0.90) 56(84) bytes of data. >> 64 bytes from amb2.node.dc1.consul (172.17.0.90): icmp_seq=1 ttl=64 >> time=0.069 ms >> >> ping amb1 from amb2.service.consul >> >> PING amb1.service.consul (172.17.0.89) 56(84) bytes of data. >> 64 bytes from amb1.node.dc1.consul (172.17.0.89): icmp_seq=1 ttl=64 >> time=0.070 ms >> >> ping amb2 from amb2.service.consul >> >> PING amb2.service.consul (172.17.0.90) 56(84) bytes of data. >> 64 bytes from amb2.service.consul (172.17.0.90): icmp_seq=1 ttl=64 >> time=0.054 ms >> >> On Tue, Nov 24, 2015 at 11:58 AM, Samir Ahmic <[email protected]> >> wrote: >> > As I can see from logs you also have issue with connecting to zk. >> > Configuration points to correct server but server resolution produce >> wrong >> > values. Do you have multiple network interfaces on servers? What ping >> > $HOSTNAME returns? What do you have in /etc/hosts file? Do you have some >> > local nameserver running on servers ? >> > >> > Regards >> > Samir >> > On Nov 24, 2015 11:21 AM, "Kristoffer Sjögren" <[email protected]> wrote: >> > >> >> The logs on the region server [1] is also quite interesting. >> >> >> >> Before I restarted the cluster, the region server complains about >> >> hijacked amb2.node.dc1.consul hijacked the regions from >> >> amb2.service.consul. >> >> >> >> 2015-11-24 08:26:45,099 WARN [RS_OPEN_META-amb2:16020-0] >> >> zookeeper.ZKAssign: regionserver:16020-0x1513899be420000, >> >> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to >> >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE >> >> to RS_ZK_REGION_OPENING failed, the server that tried to transition >> >> was amb2.node.dc1.consul,16020,1448353564099 not the expected >> >> amb2.service.consul,16020,1448353564099 >> >> 2015-11-24 08:26:45,099 WARN [RS_OPEN_META-amb2:16020-0] >> >> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE >> >> to OPENING for region=1588230740 >> >> 2015-11-24 08:26:45,099 WARN [RS_OPEN_META-amb2:16020-0] >> >> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for >> >> encodedName=1588230740 >> >> 2015-11-24 08:26:45,100 INFO [RS_OPEN_META-amb2:16020-0] >> >> coordination.ZkOpenRegionCoordination: Opening of region {ENCODED => >> >> 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} >> >> failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting >> >> version 0 >> >> 2015-11-24 08:26:45,101 WARN [RS_OPEN_META-amb2:16020-0] >> >> zookeeper.ZKAssign: regionserver:16020-0x1513899be420000, >> >> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to >> >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE >> >> to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to >> >> transition was amb2.node.dc1.consul,16020,1448353564099 not the >> >> expected amb2.service.consul,16020,1448353564099 >> >> >> >> >> >> After editing resolv.conf and restarted the cluster it still complains >> >> about amb2.node.dc1.consul trying to transition the regions instead of >> >> amb2.service.consul. >> >> >> >> 2015-11-24 09:32:26,334 WARN [RS_OPEN_META-amb2:16020-0] >> >> zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d, >> >> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to >> >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE >> >> to RS_ZK_REGION_OPENING failed, the server that tried to transition >> >> was amb2.node.dc1.consul,16020,1448357534179 not the expected >> >> amb2.service.consul,16020,1448357534179 >> >> 2015-11-24 09:32:26,335 WARN [RS_OPEN_META-amb2:16020-0] >> >> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE >> >> to OPENING for region=1588230740 >> >> 2015-11-24 09:32:26,335 WARN [RS_OPEN_META-amb2:16020-0] >> >> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for >> >> encodedName=1588230740 >> >> 2015-11-24 09:32:26,335 INFO [RS_OPEN_META-amb2:16020-0] >> >> coordination.ZkOpenRegionCoordination: Opening of region {ENCODED => >> >> 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} >> >> failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting >> >> version 2 >> >> 2015-11-24 09:32:26,336 WARN [RS_OPEN_META-amb2:16020-0] >> >> zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d, >> >> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to >> >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE >> >> to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to >> >> transition was amb2.node.dc1.consul,16020,1448357534179 not the >> >> expected amb2.service.consul,16020,1448357534179 >> >> >> >> >> >> [1] http://pastebin.com/z93p8Mdu >> >> >> >> On Tue, Nov 24, 2015 at 10:48 AM, Kristoffer Sjögren <[email protected]> >> >> wrote: >> >> > I removed the node.dc1.consul from resolve.conf and restarted the >> >> > cluster but it still shows up on the master UI. >> >> > >> >> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500 >> >> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500 >> >> > >> >> > The logs report [1] that the meta region fails to assign to >> >> > node.dc1.consul and then tries to assign it to amb2.service.consul and >> >> > gets stuck in PENDING_OPEN again. >> >> > >> >> > --- >> >> > 1588230740hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24 >> >> > 09:32:26 UTC 2015 (450s ago), >> >> > server=amb2.service.consul,16020,1448357534179450511 >> >> > --- >> >> > >> >> > Before I restarted the cluster, the master log [2] complained about >> >> > not being able to connect to amb2.node.dc1.consul/172.17.0.85:16020. >> >> > >> >> > Im not sure but somehow it feels as if amb2.node.dc1.consul shadows >> >> > the real host amb2.service.consul. >> >> > >> >> > I was looking into the source code and found the configuration >> >> > 'hbase.regionserver.hostname' - could that be of help here to remove >> >> > the node.dc1 host? >> >> > >> >> > [1] http://pastebin.com/uZKqK9BJ >> >> > [2] http://pastebin.com/s10E2rtA >> >> > >> >> > On Tue, Nov 24, 2015 at 10:23 AM, Samir Ahmic <[email protected]> >> >> wrote: >> >> >> Hi Kristoffer, >> >> >> It looks like you have some issue with name resolution. Try to remove >> >> >> incorrect value from reslove.conf (node.dc1.consul) and then restart >> >> hbase >> >> >> cluster. >> >> >> Regarding issue with region in transition check master log for >> >> >> "hbase:meta,,1.1588230740" >> >> >> there should be exception explaining why hbase:meta can to be >> transition >> >> >> from PENDING_OPEN to OPEN state, if hbase:meta table is unavailable >> >> master >> >> >> can not finish initialization. >> >> >> >> >> >> Regards >> >> >> Samir >> >> >> >> >> >> On Tue, Nov 24, 2015 at 10:11 AM, Kristoffer Sjögren < >> [email protected]> >> >> >> wrote: >> >> >> >> >> >>> Sorry, I should mention that this is HBase 1.1.2. >> >> >>> >> >> >>> Zookeeper only report one region server. >> >> >>> >> >> >>> $ ls /hbase-unsecure/rs >> >> >>> [amb2.service.consul,16020,1448353564099] >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> On Tue, Nov 24, 2015 at 9:55 AM, Kristoffer Sjögren < >> [email protected]> >> >> >>> wrote: >> >> >>> > Hi >> >> >>> > >> >> >>> > I'm trying to install a HBase cluster with 1 master >> >> >>> > (amb1.service.consul) and 1 region server (amb2.service.consul) >> using >> >> >>> > Ambari on docker containers provided by sequenceiq [1] using a >> custom >> >> >>> > blueprint [2]. >> >> >>> > >> >> >>> > Every component installs correctly except for HBase which get >> stuck >> >> >>> > with regions in transition: >> >> >>> > >> >> >>> > --- >> >> >>> > hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24 >> 08:26:45 >> >> >>> > UTC 2015 (1098s ago), >> server=amb2.service.consul,16020,1448353564099 >> >> >>> > --- >> >> >>> > >> >> >>> > And for some reason 2 region servers (instead of 1) are >> discovered by >> >> >>> > the master with the exact same timestamp but with different >> >> hostnames. >> >> >>> > I'm not sure if this is the reason why the regions get stuck. >> >> >>> > >> >> >>> > ---- >> >> >>> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC >> >> 201500 >> >> >>> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC >> 201500 >> >> >>> > ---- >> >> >>> > >> >> >>> > The only place I can find "amb2.node.dc1.consul" on the ambari >> >> >>> > agent/server hosts is in /etc/resolv.conf which looks like this. >> >> >>> > >> >> >>> > ---- >> >> >>> > nameserver 172.17.0.82 >> >> >>> > search service.consul node.dc1.consul >> >> >>> > ---- >> >> >>> > >> >> >>> > Is there some way that I can manually tell the master to disregard >> >> the >> >> >>> > "phantom" host amb2.node.dc1.consul? >> >> >>> > >> >> >>> > Any help or tips appreciated. >> >> >>> > >> >> >>> > Cheers, >> >> >>> > -Kristoffer >> >> >>> > >> >> >>> > >> >> >>> > [1] https://github.com/sequenceiq/docker-ambari >> >> >>> > [2] >> >> >>> >> >> >> https://gist.githubusercontent.com/krisskross/901ed8223c1ed1db80e3/raw/869327be9ad15e6a9f099a7591323244cd245357/ambari-hdp2.3 >> >> >>> >> >> >>
