Re: Phantom region server and PENDING_OPEN regions

Kristoffer Sjögren Tue, 24 Nov 2015 03:31:21 -0800

Only one network interface on all machines. The ping is interesting,
both machines respond with *.node.dc1.consul but internally
*.service.consul.


amb1.service.consul /etc/hosts
172.17.0.89 amb1.service.consul amb1
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

amb2.service.consul /etc/hosts
172.17.0.90 amb2.service.consul amb2
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters


ping amb1 from amb1.service.consul

PING amb1.service.consul (172.17.0.89) 56(84) bytes of data.
64 bytes from amb1.service.consul (172.17.0.89): icmp_seq=1 ttl=64 time=0.059 ms

ping amb2 from amb1.service.consul

PING amb2.service.consul (172.17.0.90) 56(84) bytes of data.
64 bytes from amb2.node.dc1.consul (172.17.0.90): icmp_seq=1 ttl=64
time=0.069 ms

ping amb1 from amb2.service.consul

PING amb1.service.consul (172.17.0.89) 56(84) bytes of data.
64 bytes from amb1.node.dc1.consul (172.17.0.89): icmp_seq=1 ttl=64
time=0.070 ms

ping amb2 from amb2.service.consul

PING amb2.service.consul (172.17.0.90) 56(84) bytes of data.
64 bytes from amb2.service.consul (172.17.0.90): icmp_seq=1 ttl=64 time=0.054 ms

On Tue, Nov 24, 2015 at 11:58 AM, Samir Ahmic <[email protected]> wrote:
> As I can see from logs you also have issue with connecting to zk.
> Configuration points to correct server but  server resolution produce wrong
> values.  Do you have multiple network interfaces on servers?  What ping
> $HOSTNAME returns? What do you have in /etc/hosts file? Do you have some
> local nameserver running on servers ?
>
> Regards
> Samir
> On Nov 24, 2015 11:21 AM, "Kristoffer Sjögren" <[email protected]> wrote:
>
>> The logs on the region server [1] is also quite interesting.
>>
>> Before I restarted the cluster, the region server complains about
>> hijacked amb2.node.dc1.consul hijacked the regions from
>> amb2.service.consul.
>>
>> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
>> zookeeper.ZKAssign: regionserver:16020-0x1513899be420000,
>> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
>> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
>> to RS_ZK_REGION_OPENING failed, the server that tried to transition
>> was amb2.node.dc1.consul,16020,1448353564099 not the expected
>> amb2.service.consul,16020,1448353564099
>> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
>> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE
>> to OPENING for region=1588230740
>> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
>> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
>> encodedName=1588230740
>> 2015-11-24 08:26:45,100 INFO  [RS_OPEN_META-amb2:16020-0]
>> coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =>
>> 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
>> failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting
>> version 0
>> 2015-11-24 08:26:45,101 WARN  [RS_OPEN_META-amb2:16020-0]
>> zookeeper.ZKAssign: regionserver:16020-0x1513899be420000,
>> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
>> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
>> to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to
>> transition was amb2.node.dc1.consul,16020,1448353564099 not the
>> expected amb2.service.consul,16020,1448353564099
>>
>>
>> After editing resolv.conf and restarted the cluster it still complains
>> about amb2.node.dc1.consul trying to transition the regions instead of
>> amb2.service.consul.
>>
>> 2015-11-24 09:32:26,334 WARN  [RS_OPEN_META-amb2:16020-0]
>> zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d,
>> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
>> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
>> to RS_ZK_REGION_OPENING failed, the server that tried to transition
>> was amb2.node.dc1.consul,16020,1448357534179 not the expected
>> amb2.service.consul,16020,1448357534179
>> 2015-11-24 09:32:26,335 WARN  [RS_OPEN_META-amb2:16020-0]
>> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE
>> to OPENING for region=1588230740
>> 2015-11-24 09:32:26,335 WARN  [RS_OPEN_META-amb2:16020-0]
>> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
>> encodedName=1588230740
>> 2015-11-24 09:32:26,335 INFO  [RS_OPEN_META-amb2:16020-0]
>> coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =>
>> 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
>> failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting
>> version 2
>> 2015-11-24 09:32:26,336 WARN  [RS_OPEN_META-amb2:16020-0]
>> zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d,
>> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
>> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
>> to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to
>> transition was amb2.node.dc1.consul,16020,1448357534179 not the
>> expected amb2.service.consul,16020,1448357534179
>>
>>
>> [1] http://pastebin.com/z93p8Mdu
>>
>> On Tue, Nov 24, 2015 at 10:48 AM, Kristoffer Sjögren <[email protected]>
>> wrote:
>> > I removed the node.dc1.consul from resolve.conf and restarted the
>> > cluster but it still shows up on the master UI.
>> >
>> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
>> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
>> >
>> > The logs report [1] that the meta region fails to assign to
>> > node.dc1.consul and then tries to assign it to amb2.service.consul and
>> > gets stuck in PENDING_OPEN again.
>> >
>> > ---
>> > 1588230740hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24
>> > 09:32:26 UTC 2015 (450s ago),
>> > server=amb2.service.consul,16020,1448357534179450511
>> > ---
>> >
>> > Before I restarted the cluster, the master log [2] complained about
>> > not being able to connect to amb2.node.dc1.consul/172.17.0.85:16020.
>> >
>> > Im not sure but somehow it feels as if amb2.node.dc1.consul shadows
>> > the real host amb2.service.consul.
>> >
>> > I was looking into the source code and found the configuration
>> > 'hbase.regionserver.hostname' - could that be of help here to remove
>> > the node.dc1 host?
>> >
>> > [1] http://pastebin.com/uZKqK9BJ
>> > [2] http://pastebin.com/s10E2rtA
>> >
>> > On Tue, Nov 24, 2015 at 10:23 AM, Samir Ahmic <[email protected]>
>> wrote:
>> >> Hi Kristoffer,
>> >> It looks like you have some issue with name resolution. Try to remove
>> >> incorrect value from reslove.conf (node.dc1.consul) and then restart
>> hbase
>> >> cluster.
>> >> Regarding issue with region in transition check master log for
>> >> "hbase:meta,,1.1588230740"
>> >> there should be exception explaining why hbase:meta can to be transition
>> >> from PENDING_OPEN to OPEN state, if hbase:meta table is unavailable
>> master
>> >> can not finish initialization.
>> >>
>> >> Regards
>> >> Samir
>> >>
>> >> On Tue, Nov 24, 2015 at 10:11 AM, Kristoffer Sjögren <[email protected]>
>> >> wrote:
>> >>
>> >>> Sorry, I should mention that this is HBase 1.1.2.
>> >>>
>> >>> Zookeeper only report one region server.
>> >>>
>> >>> $ ls /hbase-unsecure/rs
>> >>> [amb2.service.consul,16020,1448353564099]
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Nov 24, 2015 at 9:55 AM, Kristoffer Sjögren <[email protected]>
>> >>> wrote:
>> >>> > Hi
>> >>> >
>> >>> > I'm trying to install a HBase cluster with 1 master
>> >>> > (amb1.service.consul) and 1 region server (amb2.service.consul) using
>> >>> > Ambari on docker containers provided by sequenceiq [1] using a custom
>> >>> > blueprint [2].
>> >>> >
>> >>> > Every component installs correctly except for HBase which get stuck
>> >>> > with regions in transition:
>> >>> >
>> >>> > ---
>> >>> > hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24 08:26:45
>> >>> > UTC 2015 (1098s ago), server=amb2.service.consul,16020,1448353564099
>> >>> > ---
>> >>> >
>> >>> > And for some reason 2 region servers (instead of 1) are discovered by
>> >>> > the master with the exact same timestamp but with different
>> hostnames.
>> >>> > I'm not sure if this is the reason why the regions get stuck.
>> >>> >
>> >>> > ----
>> >>> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC
>> 201500
>> >>> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
>> >>> > ----
>> >>> >
>> >>> > The only place I can find "amb2.node.dc1.consul" on the ambari
>> >>> > agent/server hosts is in /etc/resolv.conf which looks like this.
>> >>> >
>> >>> > ----
>> >>> > nameserver 172.17.0.82
>> >>> > search service.consul node.dc1.consul
>> >>> > ----
>> >>> >
>> >>> > Is there some way that I can manually tell the master to disregard
>> the
>> >>> > "phantom" host amb2.node.dc1.consul?
>> >>> >
>> >>> > Any help or tips appreciated.
>> >>> >
>> >>> > Cheers,
>> >>> > -Kristoffer
>> >>> >
>> >>> >
>> >>> > [1] https://github.com/sequenceiq/docker-ambari
>> >>> > [2]
>> >>>
>> https://gist.githubusercontent.com/krisskross/901ed8223c1ed1db80e3/raw/869327be9ad15e6a9f099a7591323244cd245357/ambari-hdp2.3
>> >>>
>>

Re: Phantom region server and PENDING_OPEN regions

Reply via email to