Sorry for the late reply.
Yes, we plan to upgrade to HBase2 and Phoenix5 by the end of this year but
it's a bigger effort to upgrade the entire platform.
I was able to resolve the problem by:
1. Creating the secondary index table in hbase if its missing.
2. Disabling and dropping the indexes that were not in index_state='a'
3. Rebuild the dropped indexes if they are needed.

I used the following sql to find out all the indexes that were not in
active state(especially focused on index_state='b' and 'i') by querying
system_catalog:






*SELECT TABLE_NAME, TABLE_SCHEM, TABLE_TYPE, DATA_TABLE_NAME, INDEX_STATE,
LINK_TYPE, INDEX_TYPE, IS_NAMESPACE_MAPPED, COLUMN_NAME FROM
SYSTEM."CATALOG" WHERE COLUMN_NAME IS NULLAND TABLE_TYPE ='i'AND LINK_TYPE
IS NULL AND index_state!='a'AND table_name IS NOT NULL *

We still don't know how we got into this situation with secondary indexes
but we are happy that the cluster is not freaking out anymore.
Thanks for all the pointers!

-Anil


On Tue, Jun 9, 2020 at 12:09 AM Sukumar Maddineni <smaddin...@salesforce.com>
wrote:

> Hi Anil,
>
> I think if you create that missing HBase table (index table) with dummy
> metadata(using have shell) and then Phoenix drop index and recreate index
> should work.
>
> Phoenix 4.7 indexing code might have issues related cross rpc calls
> between RS which can cause zk connection leaks. I would recommend upgrading
> 4.14.3 of possible which has lot of indexing improvements related to
> consistency a d also performance.
>
> --
> Sukumar
>
> On Mon, Jun 8, 2020, 10:15 PM anil gupta <anilgupt...@gmail.com> wrote:
>
>> You were right from the beginning. It is a problem with phoenix secondary
>> index!
>> I tried 4LW zk commands after enabling them, they didnt really provide me
>> much extra information.
>> Then, i took heap and thread dump of a RS that was throwing a lot of max
>> connection error. Most of the rpc were busy with:
>> *RpcServer.FifoWFPBQ.default.handler=129,queue=9,port=16020*
>> *org.apache.phoenix.hbase.index.exception.SingleIndexWriteFailureException
>> @ 0x63d837320*
>> *org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException @
>> 0x41f9a0270*
>>
>> *Failed 499 actions: Table 'DE:TABLE_IDX_NEW' was not found, got:
>> DE:TABLE_IDX.: 499 times, *
>>
>> After finding the above error, i queried SYSTEM.CATALOG table. DE:TABLE
>> has two secondary index but has only 1 secondary index table in
>> hbase(DE:TABLE_IDX). Other table DE:TABLE_IDX_NEW' is missing from hbase. I
>> am not really sure how this would happen.
>> DE:TABLE_IDX_NEW is listed with index_state='i' in catalog table. Can you
>> tell me what does this mean?(incomplete?)
>> Now, i am trying to delete the primary table to get rid of index and then
>> we can recreate the primary table since this is a small table but i am
>> unable to do so via phoenix. Can you please tell me how i can
>> *delete this table?(restart of cluster? or doing an upsert in catalog?)*
>>
>> *Lat one, if table_type='u' then its a user defined table? if
>> table_type='i' then its an index table? *
>>
>> *Thanks a lot for your help!*
>>
>> *~Anil *
>>
>> On Wed, Jun 3, 2020 at 8:14 AM Josh Elser <els...@apache.org> wrote:
>>
>>> The RegionServer hosting hbase:meta will certainly have "load" placed
>>> onto it, commensurate to the size of your cluster and the number of
>>> clients you're running. However, this shouldn't be increasing the amount
>>> of connections to ZK from a RegionServer.
>>>
>>> The RegionServer hosting system.catalog would be unique WRT other
>>> Phoenix-table Regions. I don't recall off of the top of my head if there
>>> is anything specific in the RegionServer code that runs alongside
>>> system.catalog (the MetaDataEndpoint protocol) that reaches out to
>>> ZooKeeper.
>>>
>>> If you're using HDP 2.6.3, I wouldn't be surprised if you're running
>>> into known and fixed issues where ZooKeeper connections are not cleaned
>>> up. That's multiple-years old code.
>>>
>>> netstat and tcpdump isn't really going to tell you anything you don't
>>> already. From a thread dump or a heap dump, you'll be able to see the
>>> number of ZooKeeper connections from a RegionServer. The 4LW commands
>>> from ZK will be able to tell you which clients (i.e. RegionServers) have
>>> the most connections. These numbers should match (X connections from a
>>> RS to a ZK, and X connections in the Java RS process). The focus would
>>> need to be on what opens a new connection and what is not properly
>>> closing that connection (in every case).
>>>
>>> On 6/3/20 4:57 AM, anil gupta wrote:
>>> > Thanks for sharing insights. Moving hbase mailing list to cc.
>>> > Sorry, forgot to mention that we are using Phoenix4.7(HDP 2.6.3). This
>>> > cluster is mostly being queried via Phoenix apart from few pure NoSql
>>> > cases that uses raw HBase api's.
>>> >
>>> > I looked further into zk logs and found that only 6/15 RS are running
>>> > into max connection problems(no other ip/hosts of our client apps were
>>> > found) constantly. One of those RS is getting 3-4x the connections
>>> > errors as compared to others, this RS is hosting hbase:meta
>>> > <
>>> http://ip-10-74-10-228.us-west-2.compute.internal:16030/region.jsp?name=1588230740>,
>>>
>>> > regions of phoenix secondary indexes and region of Phoenix and HBase
>>> > tables. I also looked into other 5 RS that are getting max connection
>>> > errors, for me nothing really stands out since all of them are hosting
>>> > regions of phoenix secondary indexes and region of Phoenix and HBase
>>> tables.
>>> >
>>> > I also tried to run netstat and tcpdump on zk host to find out anomaly
>>> > but couldn't find anything apart from above mentioned analysis. Also
>>> ran
>>> > hbck and it reported that things are fine. I am still unable to pin
>>> > point exact problem(maybe something with phoenix secondary index?).
>>> Any
>>> > other pointer to further debug the problem will be appreciated.
>>> >
>>> > Lastly, I constantly see following zk connection loss logs in above
>>> > mentioned 6 RS:
>>> > /2020-06-03 06:40:30,859 WARN
>>> >
>>>  
>>> [RpcServer.FifoWFPBQ.default.handler=123,queue=3,port=16020-SendThread(ip-10-74-0-120.us-west-2.compute.internal:2181)]
>>> zookeeper.ClientCnxn: Session 0x0 for server
>>> ip-10-74-0-120.us-west-2.compute.internal/10.74.0.120:2181 <
>>> http://10.74.0.120:2181>, unexpected error, closing socket connection
>>> and attempting reconnect
>>> > java.io.IOException: Connection reset by peer
>>> >          at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>> >          at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>> >          at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>> >          at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>>> >          at
>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>>> >          at
>>> >
>>> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
>>> >          at
>>> >
>>> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
>>> >          at
>>> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
>>> > 2020-06-03 06:40:30,861 INFO
>>> >
>>>  
>>> [RpcServer.FifoWFPBQ.default.handler=137,queue=17,port=16020-SendThread(ip-10-74-9-182.us-west-2.compute.internal:2181)]
>>> zookeeper.ClientCnxn: Opening socket connection to server
>>> ip-10-74-9-182.us-west-2.compute.internal/10.74.9.182:2181 <
>>> http://10.74.9.182:2181>. Will not attempt to authenticate using SASL
>>> (unknown error)
>>> > 2020-06-03 06:40:30,861 INFO
>>> >
>>>  
>>> [RpcServer.FifoWFPBQ.default.handler=137,queue=17,port=16020-SendThread(ip-10-74-9-182.us-west-2.compute.internal:2181)]
>>> zookeeper.ClientCnxn: Socket connection established, initiating session,
>>> client: /10.74.10.228:60012 <http://10.74.10.228:60012>, server:
>>> ip-10-74-9-182.us-west-2.compute.internal/10.74.9.182:2181 <
>>> http://10.74.9.182:2181>
>>> > 2020-06-03 06:40:30,861 WARN
>>> >
>>>  
>>> [RpcServer.FifoWFPBQ.default.handler=137,queue=17,port=16020-SendThread(ip-10-74-9-182.us-west-2.compute.internal:2181)]
>>> zookeeper.ClientCnxn: Session 0x0 for server
>>> ip-10-74-9-182.us-west-2.compute.internal/10.74.9.182:2181 <
>>> http://10.74.9.182:2181>, unexpected error, closing socket connection
>>> and attempting reconnect
>>> > java.io.IOException: Connection reset by peer
>>> >          at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>> >          at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>> >          at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>> >          at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>>> >          at
>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>>> >          at
>>> >
>>> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
>>> >          at
>>> >
>>> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
>>> >          at
>>> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)/
>>> >
>>> > Thanks!
>>> >
>>> > On Tue, Jun 2, 2020 at 6:57 AM Josh Elser <els...@apache.org
>>> > <mailto:els...@apache.org>> wrote:
>>> >
>>> >     HBase (daemons) try to use a single connection for themselves. A RS
>>> >     also
>>> >     does not need to mutate state in ZK to handle things like gets and
>>> puts.
>>> >
>>> >     Phoenix is probably the thing you need to look at more closely
>>> >     (especially if you're using an old version of Phoenix that matches
>>> the
>>> >     old HBase 1.1 version). Internally, Phoenix acts like an HBase
>>> client
>>> >     which results in a new ZK connection. There have certainly been
>>> bugs
>>> >     like that in the past (speaking generally, not specifically).
>>> >
>>> >     On 6/1/20 5:59 PM, anil gupta wrote:
>>> >      > Hi Folks,
>>> >      >
>>> >      > We are running in HBase problems due to hitting the limit of ZK
>>> >      > connections. This cluster is running HBase 1.1.x and ZK 3.4.6.x
>>> >     on I3en ec2
>>> >      > instance type in AWS. Almost all our Region server are listed in
>>> >     zk logs
>>> >      > with "Too many connections from /<IP> - max is 60".
>>> >      > 2020-06-01 21:42:08,375 - WARN  [NIOServerCxn.Factory:
>>> >      > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193
>>> >     <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193>] - Too many
>>> >     connections from
>>> >      > /<ip> - max is 60
>>> >      >
>>> >      >   On a average each RegionServer has ~250 regions. We are also
>>> >     running
>>> >      > Phoenix on this cluster. Most of the queries are short range
>>> >     scans but
>>> >      > sometimes we are doing full table scans too.
>>> >      >
>>> >      >    It seems like one of the simple fix is to increase
>>> maxClientCnxns
>>> >      > property in zoo.cfg to 300, 500, 700, etc. I will probably do
>>> >     that. But, i
>>> >      > am just curious to know In what scenarios these connections are
>>> >      > created/used(Scans/Puts/Delete or during other RegionServer
>>> >     operations)?
>>> >      > Are these also created by hbase clients/apps(my guess is NO)?
>>> How
>>> >     can i
>>> >      > calculate optimal value of maxClientCnxns for my cluster/usage?
>>> >      >
>>> >
>>> >
>>> >
>>> > --
>>> > Thanks & Regards,
>>> > Anil Gupta
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anil Gupta
>>
>

-- 
Thanks & Regards,
Anil Gupta

Reply via email to