Sorry for the late reply. Yes, we plan to upgrade to HBase2 and Phoenix5 by the end of this year but it's a bigger effort to upgrade the entire platform. I was able to resolve the problem by: 1. Creating the secondary index table in hbase if its missing. 2. Disabling and dropping the indexes that were not in index_state='a' 3. Rebuild the dropped indexes if they are needed.
I used the following sql to find out all the indexes that were not in active state(especially focused on index_state='b' and 'i') by querying system_catalog: *SELECT TABLE_NAME, TABLE_SCHEM, TABLE_TYPE, DATA_TABLE_NAME, INDEX_STATE, LINK_TYPE, INDEX_TYPE, IS_NAMESPACE_MAPPED, COLUMN_NAME FROM SYSTEM."CATALOG" WHERE COLUMN_NAME IS NULLAND TABLE_TYPE ='i'AND LINK_TYPE IS NULL AND index_state!='a'AND table_name IS NOT NULL * We still don't know how we got into this situation with secondary indexes but we are happy that the cluster is not freaking out anymore. Thanks for all the pointers! -Anil On Tue, Jun 9, 2020 at 12:09 AM Sukumar Maddineni <smaddin...@salesforce.com> wrote: > Hi Anil, > > I think if you create that missing HBase table (index table) with dummy > metadata(using have shell) and then Phoenix drop index and recreate index > should work. > > Phoenix 4.7 indexing code might have issues related cross rpc calls > between RS which can cause zk connection leaks. I would recommend upgrading > 4.14.3 of possible which has lot of indexing improvements related to > consistency a d also performance. > > -- > Sukumar > > On Mon, Jun 8, 2020, 10:15 PM anil gupta <anilgupt...@gmail.com> wrote: > >> You were right from the beginning. It is a problem with phoenix secondary >> index! >> I tried 4LW zk commands after enabling them, they didnt really provide me >> much extra information. >> Then, i took heap and thread dump of a RS that was throwing a lot of max >> connection error. Most of the rpc were busy with: >> *RpcServer.FifoWFPBQ.default.handler=129,queue=9,port=16020* >> *org.apache.phoenix.hbase.index.exception.SingleIndexWriteFailureException >> @ 0x63d837320* >> *org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException @ >> 0x41f9a0270* >> >> *Failed 499 actions: Table 'DE:TABLE_IDX_NEW' was not found, got: >> DE:TABLE_IDX.: 499 times, * >> >> After finding the above error, i queried SYSTEM.CATALOG table. DE:TABLE >> has two secondary index but has only 1 secondary index table in >> hbase(DE:TABLE_IDX). Other table DE:TABLE_IDX_NEW' is missing from hbase. I >> am not really sure how this would happen. >> DE:TABLE_IDX_NEW is listed with index_state='i' in catalog table. Can you >> tell me what does this mean?(incomplete?) >> Now, i am trying to delete the primary table to get rid of index and then >> we can recreate the primary table since this is a small table but i am >> unable to do so via phoenix. Can you please tell me how i can >> *delete this table?(restart of cluster? or doing an upsert in catalog?)* >> >> *Lat one, if table_type='u' then its a user defined table? if >> table_type='i' then its an index table? * >> >> *Thanks a lot for your help!* >> >> *~Anil * >> >> On Wed, Jun 3, 2020 at 8:14 AM Josh Elser <els...@apache.org> wrote: >> >>> The RegionServer hosting hbase:meta will certainly have "load" placed >>> onto it, commensurate to the size of your cluster and the number of >>> clients you're running. However, this shouldn't be increasing the amount >>> of connections to ZK from a RegionServer. >>> >>> The RegionServer hosting system.catalog would be unique WRT other >>> Phoenix-table Regions. I don't recall off of the top of my head if there >>> is anything specific in the RegionServer code that runs alongside >>> system.catalog (the MetaDataEndpoint protocol) that reaches out to >>> ZooKeeper. >>> >>> If you're using HDP 2.6.3, I wouldn't be surprised if you're running >>> into known and fixed issues where ZooKeeper connections are not cleaned >>> up. That's multiple-years old code. >>> >>> netstat and tcpdump isn't really going to tell you anything you don't >>> already. From a thread dump or a heap dump, you'll be able to see the >>> number of ZooKeeper connections from a RegionServer. The 4LW commands >>> from ZK will be able to tell you which clients (i.e. RegionServers) have >>> the most connections. These numbers should match (X connections from a >>> RS to a ZK, and X connections in the Java RS process). The focus would >>> need to be on what opens a new connection and what is not properly >>> closing that connection (in every case). >>> >>> On 6/3/20 4:57 AM, anil gupta wrote: >>> > Thanks for sharing insights. Moving hbase mailing list to cc. >>> > Sorry, forgot to mention that we are using Phoenix4.7(HDP 2.6.3). This >>> > cluster is mostly being queried via Phoenix apart from few pure NoSql >>> > cases that uses raw HBase api's. >>> > >>> > I looked further into zk logs and found that only 6/15 RS are running >>> > into max connection problems(no other ip/hosts of our client apps were >>> > found) constantly. One of those RS is getting 3-4x the connections >>> > errors as compared to others, this RS is hosting hbase:meta >>> > < >>> http://ip-10-74-10-228.us-west-2.compute.internal:16030/region.jsp?name=1588230740>, >>> >>> > regions of phoenix secondary indexes and region of Phoenix and HBase >>> > tables. I also looked into other 5 RS that are getting max connection >>> > errors, for me nothing really stands out since all of them are hosting >>> > regions of phoenix secondary indexes and region of Phoenix and HBase >>> tables. >>> > >>> > I also tried to run netstat and tcpdump on zk host to find out anomaly >>> > but couldn't find anything apart from above mentioned analysis. Also >>> ran >>> > hbck and it reported that things are fine. I am still unable to pin >>> > point exact problem(maybe something with phoenix secondary index?). >>> Any >>> > other pointer to further debug the problem will be appreciated. >>> > >>> > Lastly, I constantly see following zk connection loss logs in above >>> > mentioned 6 RS: >>> > /2020-06-03 06:40:30,859 WARN >>> > >>> >>> [RpcServer.FifoWFPBQ.default.handler=123,queue=3,port=16020-SendThread(ip-10-74-0-120.us-west-2.compute.internal:2181)] >>> zookeeper.ClientCnxn: Session 0x0 for server >>> ip-10-74-0-120.us-west-2.compute.internal/10.74.0.120:2181 < >>> http://10.74.0.120:2181>, unexpected error, closing socket connection >>> and attempting reconnect >>> > java.io.IOException: Connection reset by peer >>> > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) >>> > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) >>> > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) >>> > at sun.nio.ch.IOUtil.read(IOUtil.java:192) >>> > at >>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) >>> > at >>> > >>> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) >>> > at >>> > >>> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) >>> > at >>> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125) >>> > 2020-06-03 06:40:30,861 INFO >>> > >>> >>> [RpcServer.FifoWFPBQ.default.handler=137,queue=17,port=16020-SendThread(ip-10-74-9-182.us-west-2.compute.internal:2181)] >>> zookeeper.ClientCnxn: Opening socket connection to server >>> ip-10-74-9-182.us-west-2.compute.internal/10.74.9.182:2181 < >>> http://10.74.9.182:2181>. Will not attempt to authenticate using SASL >>> (unknown error) >>> > 2020-06-03 06:40:30,861 INFO >>> > >>> >>> [RpcServer.FifoWFPBQ.default.handler=137,queue=17,port=16020-SendThread(ip-10-74-9-182.us-west-2.compute.internal:2181)] >>> zookeeper.ClientCnxn: Socket connection established, initiating session, >>> client: /10.74.10.228:60012 <http://10.74.10.228:60012>, server: >>> ip-10-74-9-182.us-west-2.compute.internal/10.74.9.182:2181 < >>> http://10.74.9.182:2181> >>> > 2020-06-03 06:40:30,861 WARN >>> > >>> >>> [RpcServer.FifoWFPBQ.default.handler=137,queue=17,port=16020-SendThread(ip-10-74-9-182.us-west-2.compute.internal:2181)] >>> zookeeper.ClientCnxn: Session 0x0 for server >>> ip-10-74-9-182.us-west-2.compute.internal/10.74.9.182:2181 < >>> http://10.74.9.182:2181>, unexpected error, closing socket connection >>> and attempting reconnect >>> > java.io.IOException: Connection reset by peer >>> > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) >>> > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) >>> > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) >>> > at sun.nio.ch.IOUtil.read(IOUtil.java:192) >>> > at >>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) >>> > at >>> > >>> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68) >>> > at >>> > >>> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) >>> > at >>> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)/ >>> > >>> > Thanks! >>> > >>> > On Tue, Jun 2, 2020 at 6:57 AM Josh Elser <els...@apache.org >>> > <mailto:els...@apache.org>> wrote: >>> > >>> > HBase (daemons) try to use a single connection for themselves. A RS >>> > also >>> > does not need to mutate state in ZK to handle things like gets and >>> puts. >>> > >>> > Phoenix is probably the thing you need to look at more closely >>> > (especially if you're using an old version of Phoenix that matches >>> the >>> > old HBase 1.1 version). Internally, Phoenix acts like an HBase >>> client >>> > which results in a new ZK connection. There have certainly been >>> bugs >>> > like that in the past (speaking generally, not specifically). >>> > >>> > On 6/1/20 5:59 PM, anil gupta wrote: >>> > > Hi Folks, >>> > > >>> > > We are running in HBase problems due to hitting the limit of ZK >>> > > connections. This cluster is running HBase 1.1.x and ZK 3.4.6.x >>> > on I3en ec2 >>> > > instance type in AWS. Almost all our Region server are listed in >>> > zk logs >>> > > with "Too many connections from /<IP> - max is 60". >>> > > 2020-06-01 21:42:08,375 - WARN [NIOServerCxn.Factory: >>> > > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193 >>> > <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193>] - Too many >>> > connections from >>> > > /<ip> - max is 60 >>> > > >>> > > On a average each RegionServer has ~250 regions. We are also >>> > running >>> > > Phoenix on this cluster. Most of the queries are short range >>> > scans but >>> > > sometimes we are doing full table scans too. >>> > > >>> > > It seems like one of the simple fix is to increase >>> maxClientCnxns >>> > > property in zoo.cfg to 300, 500, 700, etc. I will probably do >>> > that. But, i >>> > > am just curious to know In what scenarios these connections are >>> > > created/used(Scans/Puts/Delete or during other RegionServer >>> > operations)? >>> > > Are these also created by hbase clients/apps(my guess is NO)? >>> How >>> > can i >>> > > calculate optimal value of maxClientCnxns for my cluster/usage? >>> > > >>> > >>> > >>> > >>> > -- >>> > Thanks & Regards, >>> > Anil Gupta >>> >> >> >> -- >> Thanks & Regards, >> Anil Gupta >> > -- Thanks & Regards, Anil Gupta