[ https://issues.apache.org/jira/browse/HIVE-21265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Istvan Fajth updated HIVE-21265: -------------------------------- Issue Type: Improvement (was: Bug) > Hive miss-uses HBase HConnection object and that puts high load on Zookeeper > ---------------------------------------------------------------------------- > > Key: HIVE-21265 > URL: https://issues.apache.org/jira/browse/HIVE-21265 > Project: Hive > Issue Type: Improvement > Components: HBase Handler > Reporter: Istvan Fajth > Priority: Major > > When there is a table in Hive backed by an HBase table, then the following > access pattern is shown multiple times in Zookeeper even for a simple query > like "SELECT * FROM table": > - A client is connecting to Zookeeper > - Checks whether the /hbase ZNode exists > - Reads /hbase/hbaseid > - Client closes the connection. > The amount of these accesses are depending on the amount of data most likely > it is correlating to the number of HBase regions. > The same access pattern one can see in ZK when one runs the following Java > code: > {code}import org.apache.hadoop.hbase.client.*; > public class Test { > public static void main(String args[]) throws Exception { > Connection c = ConnectionFactory.createConnection(); > c.close(); > } > }{code} > The problem with this is that for large tables this creates an enormous > amount of session creation which is expensive in ZK, and if the amount of > queries to this table is high, then the ZK transaction log is heavily > written, and there are way more snapshots created then otherwise due to the > amount of createSession closeSession transaction in Zookeeper. In this > particular case the Zookeeper data directory was filled with about 24GB of > data and caused the device to almost fill under the Zookeeper data directory. > ~90% of the data written was createSession and closeSession transactions. > I am not sure what logs I should provide, but reproducing the behaviour is > easy enough. In Zookeeper if one enables DEBUG level logging, the logs are > showing what is being read by sessions. These sessions live for 1-5ms tops. > I imagine that the solution is to somehow share the connection object between > the mappers if possible, and use one connection according to the suggestion > in the API documentation of ConnectionFactory and request table/admin/any > object from the one connection, or at least use only one connection object > per map/reduce, and make it a longer living connection that is there for the > whole map/reduce lifetime. -- This message was sent by Atlassian JIRA (v7.6.3#76005)