You were right from the beginning. It is a problem with phoenix secondary
index!
I tried 4LW zk commands after enabling them, they didnt really provide me
much extra information.
Then, i took heap and thread dump of a RS that was throwing a lot of max
connection error. Most of the rpc were busy with:
*RpcServer.FifoWFPBQ.default.handler=129,queue=9,port=16020*
*org.apache.phoenix.hbase.index.exception.SingleIndexWriteFailureException
@ 0x63d837320*
*org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException @
0x41f9a0270*

*Failed 499 actions: Table 'DE:TABLE_IDX_NEW' was not found, got:
DE:TABLE_IDX.: 499 times, *

After finding the above error, i queried SYSTEM.CATALOG table. DE:TABLE has
two secondary index but has only 1 secondary index table in
hbase(DE:TABLE_IDX). Other table DE:TABLE_IDX_NEW' is missing from hbase. I
am not really sure how this would happen.
DE:TABLE_IDX_NEW is listed with index_state='i' in catalog table. Can you
tell me what does this mean?(incomplete?)
Now, i am trying to delete the primary table to get rid of index and then
we can recreate the primary table since this is a small table but i am
unable to do so via phoenix. Can you please tell me how i can
*delete this table?(restart of cluster? or doing an upsert in catalog?)*

*Lat one, if table_type='u' then its a user defined table? if
table_type='i' then its an index table? *

*Thanks a lot for your help!*

*~Anil *

On Wed, Jun 3, 2020 at 8:14 AM Josh Elser <els...@apache.org> wrote:

> The RegionServer hosting hbase:meta will certainly have "load" placed
> onto it, commensurate to the size of your cluster and the number of
> clients you're running. However, this shouldn't be increasing the amount
> of connections to ZK from a RegionServer.
>
> The RegionServer hosting system.catalog would be unique WRT other
> Phoenix-table Regions. I don't recall off of the top of my head if there
> is anything specific in the RegionServer code that runs alongside
> system.catalog (the MetaDataEndpoint protocol) that reaches out to
> ZooKeeper.
>
> If you're using HDP 2.6.3, I wouldn't be surprised if you're running
> into known and fixed issues where ZooKeeper connections are not cleaned
> up. That's multiple-years old code.
>
> netstat and tcpdump isn't really going to tell you anything you don't
> already. From a thread dump or a heap dump, you'll be able to see the
> number of ZooKeeper connections from a RegionServer. The 4LW commands
> from ZK will be able to tell you which clients (i.e. RegionServers) have
> the most connections. These numbers should match (X connections from a
> RS to a ZK, and X connections in the Java RS process). The focus would
> need to be on what opens a new connection and what is not properly
> closing that connection (in every case).
>
> On 6/3/20 4:57 AM, anil gupta wrote:
> > Thanks for sharing insights. Moving hbase mailing list to cc.
> > Sorry, forgot to mention that we are using Phoenix4.7(HDP 2.6.3). This
> > cluster is mostly being queried via Phoenix apart from few pure NoSql
> > cases that uses raw HBase api's.
> >
> > I looked further into zk logs and found that only 6/15 RS are running
> > into max connection problems(no other ip/hosts of our client apps were
> > found) constantly. One of those RS is getting 3-4x the connections
> > errors as compared to others, this RS is hosting hbase:meta
> > <
> http://ip-10-74-10-228.us-west-2.compute.internal:16030/region.jsp?name=1588230740>,
>
> > regions of phoenix secondary indexes and region of Phoenix and HBase
> > tables. I also looked into other 5 RS that are getting max connection
> > errors, for me nothing really stands out since all of them are hosting
> > regions of phoenix secondary indexes and region of Phoenix and HBase
> tables.
> >
> > I also tried to run netstat and tcpdump on zk host to find out anomaly
> > but couldn't find anything apart from above mentioned analysis. Also ran
> > hbck and it reported that things are fine. I am still unable to pin
> > point exact problem(maybe something with phoenix secondary index?). Any
> > other pointer to further debug the problem will be appreciated.
> >
> > Lastly, I constantly see following zk connection loss logs in above
> > mentioned 6 RS:
> > /2020-06-03 06:40:30,859 WARN
> >
>  
> [RpcServer.FifoWFPBQ.default.handler=123,queue=3,port=16020-SendThread(ip-10-74-0-120.us-west-2.compute.internal:2181)]
> zookeeper.ClientCnxn: Session 0x0 for server
> ip-10-74-0-120.us-west-2.compute.internal/10.74.0.120:2181 <
> http://10.74.0.120:2181>, unexpected error, closing socket connection and
> attempting reconnect
> > java.io.IOException: Connection reset by peer
> >          at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> >          at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> >          at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> >          at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> >          at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
> >          at
> >
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
> >          at
> >
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> >          at
> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
> > 2020-06-03 06:40:30,861 INFO
> >
>  
> [RpcServer.FifoWFPBQ.default.handler=137,queue=17,port=16020-SendThread(ip-10-74-9-182.us-west-2.compute.internal:2181)]
> zookeeper.ClientCnxn: Opening socket connection to server
> ip-10-74-9-182.us-west-2.compute.internal/10.74.9.182:2181 <
> http://10.74.9.182:2181>. Will not attempt to authenticate using SASL
> (unknown error)
> > 2020-06-03 06:40:30,861 INFO
> >
>  
> [RpcServer.FifoWFPBQ.default.handler=137,queue=17,port=16020-SendThread(ip-10-74-9-182.us-west-2.compute.internal:2181)]
> zookeeper.ClientCnxn: Socket connection established, initiating session,
> client: /10.74.10.228:60012 <http://10.74.10.228:60012>, server:
> ip-10-74-9-182.us-west-2.compute.internal/10.74.9.182:2181 <
> http://10.74.9.182:2181>
> > 2020-06-03 06:40:30,861 WARN
> >
>  
> [RpcServer.FifoWFPBQ.default.handler=137,queue=17,port=16020-SendThread(ip-10-74-9-182.us-west-2.compute.internal:2181)]
> zookeeper.ClientCnxn: Session 0x0 for server
> ip-10-74-9-182.us-west-2.compute.internal/10.74.9.182:2181 <
> http://10.74.9.182:2181>, unexpected error, closing socket connection and
> attempting reconnect
> > java.io.IOException: Connection reset by peer
> >          at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> >          at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> >          at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> >          at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> >          at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
> >          at
> >
> org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:68)
> >          at
> >
> org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
> >          at
> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)/
> >
> > Thanks!
> >
> > On Tue, Jun 2, 2020 at 6:57 AM Josh Elser <els...@apache.org
> > <mailto:els...@apache.org>> wrote:
> >
> >     HBase (daemons) try to use a single connection for themselves. A RS
> >     also
> >     does not need to mutate state in ZK to handle things like gets and
> puts.
> >
> >     Phoenix is probably the thing you need to look at more closely
> >     (especially if you're using an old version of Phoenix that matches
> the
> >     old HBase 1.1 version). Internally, Phoenix acts like an HBase client
> >     which results in a new ZK connection. There have certainly been bugs
> >     like that in the past (speaking generally, not specifically).
> >
> >     On 6/1/20 5:59 PM, anil gupta wrote:
> >      > Hi Folks,
> >      >
> >      > We are running in HBase problems due to hitting the limit of ZK
> >      > connections. This cluster is running HBase 1.1.x and ZK 3.4.6.x
> >     on I3en ec2
> >      > instance type in AWS. Almost all our Region server are listed in
> >     zk logs
> >      > with "Too many connections from /<IP> - max is 60".
> >      > 2020-06-01 21:42:08,375 - WARN  [NIOServerCxn.Factory:
> >      > 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193
> >     <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193>] - Too many
> >     connections from
> >      > /<ip> - max is 60
> >      >
> >      >   On a average each RegionServer has ~250 regions. We are also
> >     running
> >      > Phoenix on this cluster. Most of the queries are short range
> >     scans but
> >      > sometimes we are doing full table scans too.
> >      >
> >      >    It seems like one of the simple fix is to increase
> maxClientCnxns
> >      > property in zoo.cfg to 300, 500, 700, etc. I will probably do
> >     that. But, i
> >      > am just curious to know In what scenarios these connections are
> >      > created/used(Scans/Puts/Delete or during other RegionServer
> >     operations)?
> >      > Are these also created by hbase clients/apps(my guess is NO)? How
> >     can i
> >      > calculate optimal value of maxClientCnxns for my cluster/usage?
> >      >
> >
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
>


-- 
Thanks & Regards,
Anil Gupta

Reply via email to