Re: Too many connections from / - max is 60

Josh Elser Wed, 03 Jun 2020 08:15:03 -0700

The RegionServer hosting hbase:meta will certainly have "load" placedonto it, commensurate to the size of your cluster and the number ofclients you're running. However, this shouldn't be increasing the amountof connections to ZK from a RegionServer.

The RegionServer hosting system.catalog would be unique WRT otherPhoenix-table Regions. I don't recall off of the top of my head if thereis anything specific in the RegionServer code that runs alongsidesystem.catalog (the MetaDataEndpoint protocol) that reaches out toZooKeeper.

If you're using HDP 2.6.3, I wouldn't be surprised if you're runninginto known and fixed issues where ZooKeeper connections are not cleanedup. That's multiple-years old code.

netstat and tcpdump isn't really going to tell you anything you don'talready. From a thread dump or a heap dump, you'll be able to see thenumber of ZooKeeper connections from a RegionServer. The 4LW commandsfrom ZK will be able to tell you which clients (i.e. RegionServers) havethe most connections. These numbers should match (X connections from aRS to a ZK, and X connections in the Java RS process). The focus wouldneed to be on what opens a new connection and what is not properlyclosing that connection (in every case).


On 6/3/20 4:57 AM, anil gupta wrote:

Thanks for sharing insights. Moving hbase mailing list to cc.
Sorry, forgot to mention that we are using Phoenix4.7(HDP 2.6.3). Thiscluster is mostly being queried via Phoenix apart from few pure NoSqlcases that uses raw HBase api's.
I looked further into zk logs and found that only 6/15 RS are runninginto max connection problems(no other ip/hosts of our client apps werefound) constantly. One of those RS is getting 3-4x the connectionserrors as compared to others, this RS is hosting hbase:meta<http://ip-10-74-10-228.us-west-2.compute.internal:16030/region.jsp?name=1588230740>,regions of phoenix secondary indexes and region of Phoenix and HBasetables. I also looked into other 5 RS that are getting max connectionerrors, for me nothing really stands out since all of them are hostingregions of phoenix secondary indexes and region of Phoenix and HBase tables.
I also tried to run netstat and tcpdump on zk host to find out anomalybut couldn't find anything apart from above mentioned analysis. Also ranhbck and it reported that things are fine. I am still unable to pinpoint exact problem(maybe something with phoenix secondary index?). Anyother pointer to further debug the problem will be appreciated.
Lastly, I constantly see following zk connection loss logs in abovementioned 6 RS:/2020-06-03 06:40:30,859 WARN [RpcServer.FifoWFPBQ.default.handler=123,queue=3,port=16020-SendThread(ip-10-74-0-120.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x0 for server ip-10-74-0-120.us-west-2.compute.internal/10.74.0.120:2181 <http://10.74.0.120:2181>, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
atorg.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) atorg.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) atorg.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)2020-06-03 06:40:30,861 INFO [RpcServer.FifoWFPBQ.default.handler=137,queue=17,port=16020-SendThread(ip-10-74-9-182.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Opening socket connection to server ip-10-74-9-182.us-west-2.compute.internal/10.74.9.182:2181 <http://10.74.9.182:2181>. Will not attempt to authenticate using SASL (unknown error)2020-06-03 06:40:30,861 INFO [RpcServer.FifoWFPBQ.default.handler=137,queue=17,port=16020-SendThread(ip-10-74-9-182.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.74.10.228:60012 <http://10.74.10.228:60012>, server: ip-10-74-9-182.us-west-2.compute.internal/10.74.9.182:2181 <http://10.74.9.182:2181>2020-06-03 06:40:30,861 WARN [RpcServer.FifoWFPBQ.default.handler=137,queue=17,port=16020-SendThread(ip-10-74-9-182.us-west-2.compute.internal:2181)] zookeeper.ClientCnxn: Session 0x0 for server ip-10-74-9-182.us-west-2.compute.internal/10.74.9.182:2181 <http://10.74.9.182:2181>, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
         at sun.nio.ch.IOUtil.read(IOUtil.java:192)
         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
atorg.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) atorg.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) atorg.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)/
Thanks!
On Tue, Jun 2, 2020 at 6:57 AM Josh Elser <els...@apache.org<mailto:els...@apache.org>> wrote:
    HBase (daemons) try to use a single connection for themselves. A RS
    also
    does not need to mutate state in ZK to handle things like gets and puts.

    Phoenix is probably the thing you need to look at more closely
    (especially if you're using an old version of Phoenix that matches the
    old HBase 1.1 version). Internally, Phoenix acts like an HBase client
    which results in a new ZK connection. There have certainly been bugs
    like that in the past (speaking generally, not specifically).

    On 6/1/20 5:59 PM, anil gupta wrote:
     > Hi Folks,
     >
     > We are running in HBase problems due to hitting the limit of ZK
     > connections. This cluster is running HBase 1.1.x and ZK 3.4.6.x
    on I3en ec2
     > instance type in AWS. Almost all our Region server are listed in
    zk logs
     > with "Too many connections from /<IP> - max is 60".
     > 2020-06-01 21:42:08,375 - WARN  [NIOServerCxn.Factory:
     > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193
    <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193>] - Too many
    connections from
     > /<ip> - max is 60
     >
     >   On a average each RegionServer has ~250 regions. We are also
    running
     > Phoenix on this cluster. Most of the queries are short range
    scans but
     > sometimes we are doing full table scans too.
     >
     >    It seems like one of the simple fix is to increase maxClientCnxns
     > property in zoo.cfg to 300, 500, 700, etc. I will probably do
    that. But, i
     > am just curious to know In what scenarios these connections are
     > created/used(Scans/Puts/Delete or during other RegionServer
    operations)?
     > Are these also created by hbase clients/apps(my guess is NO)? How
    can i
     > calculate optimal value of maxClientCnxns for my cluster/usage?
     >



--
Thanks & Regards,
Anil Gupta

Re: Too many connections from / - max is 60

Reply via email to