You were right from the beginning. It is a problem with phoenix secondary index! I tried 4LW zk commands after enabling them, they didnt really provide me much extra information. Then, i took heap and thread dump of a RS that was throwing a lot of max connection error. Most of the rpc were busy with: *RpcServer.FifoWFPBQ.default.handler=129,queue=9,port=16020* *org.apache.phoenix.hbase.index.exception.SingleIndexWriteFailureException @ 0x63d837320* *org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException @ 0x41f9a0270*
*Failed 499 actions: Table 'DE:TABLE_IDX_NEW' was not found, got: DE:TABLE_IDX.: 499 times, * After finding the above error, i queried SYSTEM.CATALOG table. DE:TABLE has two secondary index but has only 1 secondary index table in hbase(DE:TABLE_IDX). Other table DE:TABLE_IDX_NEW' is missing from hbase. I am not really sure how this would happen. DE:TABLE_IDX_NEW is listed with index_state='i' in catalog table. Can you tell me what does this mean?(incomplete?) Now, i am trying to delete the primary table to get rid of index and then we can recreate the primary table since this is a small table but i am unable to do so via phoenix. Can you please tell me how i can *delete this table?(restart of cluster? or doing an upsert in catalog?)* *Lat one, if table_type='u' then its a user defined table? if table_type='i' then its an index table? * *Thanks a lot for your help!* *~Anil * On Wed, Jun 3, 2020 at 8:14 AM Josh Elser <els...@apache.org> wrote: > The RegionServer hosting hbase:meta will certainly have "load" placed > onto it, commensurate to the size of your cluster and the number of > clients you're running. However, this shouldn't be increasing the amount > of connections to ZK from a RegionServer. > > The RegionServer hosting system.catalog would be unique WRT other > Phoenix-table Regions. I don't recall off of the top of my head if there > is anything specific in the RegionServer code that runs alongside > system.catalog (the MetaDataEndpoint protocol) that reaches out to > ZooKeeper. > > If you're using HDP 2.6.3, I wouldn't be surprised if you're running > into known and fixed issues where ZooKeeper connections are not cleaned > up. That's multiple-years old code. > > netstat and tcpdump isn't really going to tell you anything you don't > already. From a thread dump or a heap dump, you'll be able to see the > number of ZooKeeper connections from a RegionServer. The 4LW commands > from ZK will be able to tell you which clients (i.e. RegionServers) have > the most connections. These numbers should match (X connections from a > RS to a ZK, and X connections in the Java RS process). The focus would > need to be on what opens a new connection and what is not properly > closing that connection (in every case). > > On 6/3/20 4:57 AM, anil gupta wrote: > > Thanks for sharing insights. Moving hbase mailing list to cc. > > Sorry, forgot to mention that we are using Phoenix4.7(HDP 2.6.3). This > > cluster is mostly being queried via Phoenix apart from few pure NoSql > > cases that uses raw HBase api's. > > > > I looked further into zk logs and found that only 6/15 RS are running > > into max connection problems(no other ip/hosts of our client apps were > > found) constantly. One of those RS is getting 3-4x the connections > > errors as compared to others, this RS is hosting hbase:meta > > < > http://ip-10-74-10-228.us-west-2.compute.internal:16030/region.jsp?name=1588230740>, > > > regions of phoenix secondary indexes and region of Phoenix and HBase > > tables. I also looked into other 5 RS that are getting max connection > > errors, for me nothing really stands out since all of them are hosting > > regions of phoenix secondary indexes and region of Phoenix and HBase > tables. > > > > I also tried to run netstat and tcpdump on zk host to find out anomaly > > but couldn't find anything apart from above mentioned analysis. Also ran > > hbck and it reported that things are fine. I am still unable to pin > > point exact problem(maybe something with phoenix secondary index?). Any > > other pointer to further debug the problem will be appreciated. > > > > Lastly, I constantly see following zk connection loss logs in above > > mentioned 6 RS: > > /2020-06-03 06:40:30,859 WARN > > > > [RpcServer.FifoWFPBQ.default.handler=123,queue=3,port=16020-SendThread(ip-10-74-0-120.us-west-2.compute.internal:2181)] > zookeeper.ClientCnxn: Session 0x0 for server > ip-10-74-0-120.us-west-2.compute.internal/10.74.0.120:2181 < > http://10.74.0.120:2181>, unexpected error, closing socket connection and > attempting reconnect > > java.io.IOException: Connection reset by peer > > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > > at > > > org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) > > at > > > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) > > at > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) > > 2020-06-03 06:40:30,861 INFO > > > > [RpcServer.FifoWFPBQ.default.handler=137,queue=17,port=16020-SendThread(ip-10-74-9-182.us-west-2.compute.internal:2181)] > zookeeper.ClientCnxn: Opening socket connection to server > ip-10-74-9-182.us-west-2.compute.internal/10.74.9.182:2181 < > http://10.74.9.182:2181>. Will not attempt to authenticate using SASL > (unknown error) > > 2020-06-03 06:40:30,861 INFO > > > > [RpcServer.FifoWFPBQ.default.handler=137,queue=17,port=16020-SendThread(ip-10-74-9-182.us-west-2.compute.internal:2181)] > zookeeper.ClientCnxn: Socket connection established, initiating session, > client: /10.74.10.228:60012 <http://10.74.10.228:60012>, server: > ip-10-74-9-182.us-west-2.compute.internal/10.74.9.182:2181 < > http://10.74.9.182:2181> > > 2020-06-03 06:40:30,861 WARN > > > > [RpcServer.FifoWFPBQ.default.handler=137,queue=17,port=16020-SendThread(ip-10-74-9-182.us-west-2.compute.internal:2181)] > zookeeper.ClientCnxn: Session 0x0 for server > ip-10-74-9-182.us-west-2.compute.internal/10.74.9.182:2181 < > http://10.74.9.182:2181>, unexpected error, closing socket connection and > attempting reconnect > > java.io.IOException: Connection reset by peer > > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > > at > > > org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) > > at > > > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) > > at > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)/ > > > > Thanks! > > > > On Tue, Jun 2, 2020 at 6:57 AM Josh Elser <els...@apache.org > > <mailto:els...@apache.org>> wrote: > > > > HBase (daemons) try to use a single connection for themselves. A RS > > also > > does not need to mutate state in ZK to handle things like gets and > puts. > > > > Phoenix is probably the thing you need to look at more closely > > (especially if you're using an old version of Phoenix that matches > the > > old HBase 1.1 version). Internally, Phoenix acts like an HBase client > > which results in a new ZK connection. There have certainly been bugs > > like that in the past (speaking generally, not specifically). > > > > On 6/1/20 5:59 PM, anil gupta wrote: > > > Hi Folks, > > > > > > We are running in HBase problems due to hitting the limit of ZK > > > connections. This cluster is running HBase 1.1.x and ZK 3.4.6.x > > on I3en ec2 > > > instance type in AWS. Almost all our Region server are listed in > > zk logs > > > with "Too many connections from /<IP> - max is 60". > > > 2020-06-01 21:42:08,375 - WARN [NIOServerCxn.Factory: > > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193 > > <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193>] - Too many > > connections from > > > /<ip> - max is 60 > > > > > > On a average each RegionServer has ~250 regions. We are also > > running > > > Phoenix on this cluster. Most of the queries are short range > > scans but > > > sometimes we are doing full table scans too. > > > > > > It seems like one of the simple fix is to increase > maxClientCnxns > > > property in zoo.cfg to 300, 500, 700, etc. I will probably do > > that. But, i > > > am just curious to know In what scenarios these connections are > > > created/used(Scans/Puts/Delete or during other RegionServer > > operations)? > > > Are these also created by hbase clients/apps(my guess is NO)? How > > can i > > > calculate optimal value of maxClientCnxns for my cluster/usage? > > > > > > > > > > > -- > > Thanks & Regards, > > Anil Gupta > -- Thanks & Regards, Anil Gupta