I've tried to update hive to 0.12 but it did not help :( It seems, that range scans and filter pushdown [1,2] do not work. Can anybody help?
[1] https://issues.apache.org/jira/browse/HIVE-1643 [2] https://issues.apache.org/jira/browse/HIVE-2815 30.01.2014, в 17:31, Бородин Владимир <[email protected]> написал(а): > Hi all! > > I'm having a performance problem with quering data from hbase using hive. I > use CDH 4.5 (hbase-0.94.6, hive-0.10.0 and hadoop-yarn-2.0.0) on a cluster of > 10 hosts. Right now it stores 3 TB of data in hbase table which now consists > of 1000+ regions. One record in it looks like this: > > hbase(main):002:0> get 'users_history', > '00000000000000000010_18446742684488356353' > COLUMN CELL > > > cf:affected timestamp=1389221195263, value=1 > > > cf:date timestamp=1389221195263, > value=1389221195262 > > cf:hidden timestamp=1389221195263, value=0 > > > cf:ip timestamp=1389221195263, > value=95.47.182.98 > > cf:module timestamp=1389221195263, > value=wmi > > cf:operation timestamp=1389221195263, > value=create > > cf:state timestamp=1389221195263, > value=2060000003962075906->2060000003962075906 > > cf:target timestamp=1389221195263, > value=message > > 8 row(s) in 0.0200 seconds > > hbase(main):003:0> > > > I have created the appropriate table in hive like that: > > create external table hbase_users_history(key string, affected int, cf_date > string, hidden int, ip string, module string, operation string, state string, > target string) > stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > with serdeproperties ("hbase.columns.mapping" = > ":key,cf:affected,cf:date,cf:hidden,cf:ip,cf:module,cf:operation,cf:state,cf:target") > tblproperties("hbase.table.name" = "users_history"); > > > If I run query getting the data by full key it creates one mapper and runs > very fast: > > hive> select * from hbase_users_history where > key='00000000000000000010_18446742684488356353'; > > Total MapReduce jobs = 1 > Launching Job 1 out of 1 > Number of reduce tasks is set to 0 since there's no reduce operator > Starting Job = job_1386711334339_0015, Tracking URL = > http://historydb07d.mail.yandex.net:8088/proxy/application_1386711334339_0015/ > Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1386711334339_0015 > Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 0 > 2014-01-14 09:23:00,392 Stage-1 map = 0%, reduce = 0% > 2014-01-14 09:23:06,752 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.23 > sec > MapReduce Total cumulative CPU time: 3 seconds 230 msec > Ended Job = job_1386711334339_0015 > MapReduce Jobs Launched: > Job 0: Map: 1 Cumulative CPU: 3.23 sec HDFS Read: 354 HDFS Write: 122 > SUCCESS > Total MapReduce CPU Time Spent: 3 seconds 230 msec > OK > 00000000000000000010_18446742684488356353 1 NULL 0 > 95.47.182.98 wmi create 2060000003962075906->2060000003962075906 > message > Time taken: 21.265 seconds > hive> > > > If I run query getting data by first part of the key it creates 1008 maps and > takes some king of 8-10 hours which seems to be very slow: > > hive> select * from hbase_users_history where key like > '00000000000000000010_%'; > <...> > 2014-01-10 18:40:55,485 Stage-1 map = 99%, reduce = 0%, Cumulative CPU > 732857.15 sec > 2014-01-10 18:40:56,519 Stage-1 map = 100%, reduce = 0%, Cumulative CPU > 732858.27 sec > MapReduce Total cumulative CPU time: 8 days 11 hours 34 minutes 18 seconds > 270 msec > Ended Job = job_1386711334339_0004 > MapReduce Jobs Launched: > Job 0: Map: 1008 Cumulative CPU: 732858.27 sec HDFS Read: 301742 HDFS > Write: 18395 SUCCESS > Total MapReduce CPU Time Spent: 8 days 11 hours 34 minutes 18 seconds 270 msec > OK > <actual data> > Time taken: 34505.449 seconds > hive> > The result is the same if I do "where key between ..". > > > Explain on this query looks like that: > > > explain select * from hbase_users_history where key like > '00000000000000000010_%'; > OK > ABSTRACT SYNTAX TREE: > (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME hbase_users_history))) > (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR > TOK_ALLCOLREF)) (TOK_WHERE (like (TOK_TABLE_OR_COL key) > '00000000000000000010_%')))) > > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > hbase_users_history > TableScan > alias: hbase_users_history > Filter Operator > predicate: > expr: (key like '00000000000000000010_%') > type: boolean > Select Operator > expressions: > expr: key > type: string > expr: affected > type: int > expr: cf_date > type: int > expr: hidden > type: int > expr: ip > type: string > expr: module > type: string > expr: operation > type: string > expr: state > type: string > expr: target > type: string > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, > _col6, _col7, _col8 > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > > Stage: Stage-0 > Fetch Operator > limit: -1 > > > Time taken: 3.209 seconds > hive> > > So the questions are: > 1. Why does hive start 1000+ mappers (one per every region) if it can find > out the region where the data is stored? > 2. Why mappers for all other regions take so much time to find out that there > are no such keys in this region? > > Thanks. > > -- > Vladimir > > > > -- Vladimir
