I've tried to update hive to 0.12 but it did not help :( It seems, that range 
scans and filter pushdown [1,2] do not work. Can anybody help?

[1] https://issues.apache.org/jira/browse/HIVE-1643 
[2] https://issues.apache.org/jira/browse/HIVE-2815

30.01.2014, в 17:31, Бородин Владимир <[email protected]> написал(а):

> Hi all!
> 
> I'm having a performance problem with quering data from hbase using hive. I 
> use CDH 4.5 (hbase-0.94.6, hive-0.10.0 and hadoop-yarn-2.0.0) on a cluster of 
> 10 hosts. Right now it stores 3 TB of data in hbase table which now consists 
> of 1000+ regions. One record in it looks like this:
> 
> hbase(main):002:0> get 'users_history', 
> '00000000000000000010_18446742684488356353'
> COLUMN                                       CELL                             
>                                                                               
>                    
>  cf:affected                                 timestamp=1389221195263, value=1 
>                                                                               
>                    
>  cf:date                                     timestamp=1389221195263, 
> value=1389221195262                                                           
>                            
>  cf:hidden                                   timestamp=1389221195263, value=0 
>                                                                               
>                    
>  cf:ip                                       timestamp=1389221195263, 
> value=95.47.182.98                                                            
>                           
>  cf:module                                   timestamp=1389221195263, 
> value=wmi                                                                     
>                            
>  cf:operation                                timestamp=1389221195263, 
> value=create                                                                  
>                           
>  cf:state                                    timestamp=1389221195263, 
> value=2060000003962075906->2060000003962075906                                
>                            
>  cf:target                                   timestamp=1389221195263, 
> value=message                                                                 
>                            
> 8 row(s) in 0.0200 seconds
> 
> hbase(main):003:0>
> 
> 
> I have created the appropriate table in hive like that:
> 
> create external table hbase_users_history(key string, affected int, cf_date 
> string, hidden int, ip string, module string, operation string, state string, 
> target string)
> stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> with serdeproperties ("hbase.columns.mapping" = 
> ":key,cf:affected,cf:date,cf:hidden,cf:ip,cf:module,cf:operation,cf:state,cf:target")
> tblproperties("hbase.table.name" = "users_history");
> 
> 
> If I run query getting the data by full key it creates one mapper and runs 
> very fast:
> 
> hive> select * from hbase_users_history where 
> key='00000000000000000010_18446742684488356353';                              
>                                                 
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_1386711334339_0015, Tracking URL = 
> http://historydb07d.mail.yandex.net:8088/proxy/application_1386711334339_0015/
> Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1386711334339_0015
> Hadoop job information for Stage-1: number of mappers: 1; number of reducers: > 0
> 2014-01-14 09:23:00,392 Stage-1 map = 0%,  reduce = 0%
> 2014-01-14 09:23:06,752 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.23 
> sec
> MapReduce Total cumulative CPU time: 3 seconds 230 msec
> Ended Job = job_1386711334339_0015
> MapReduce Jobs Launched: 
> Job 0: Map: 1   Cumulative CPU: 3.23 sec   HDFS Read: 354 HDFS Write: 122 
> SUCCESS
> Total MapReduce CPU Time Spent: 3 seconds 230 msec
> OK
> 00000000000000000010_18446742684488356353       1       NULL    0       
> 95.47.182.98    wmi     create  2060000003962075906->2060000003962075906      
>   message
> Time taken: 21.265 seconds
> hive>
> 
> 
> If I run query getting data by first part of the key it creates 1008 maps and 
> takes some king of 8-10 hours which seems to be very slow:
> 
> hive> select * from hbase_users_history where key like 
> '00000000000000000010_%';
> <...>
> 2014-01-10 18:40:55,485 Stage-1 map = 99%,  reduce = 0%, Cumulative CPU 
> 732857.15 sec
> 2014-01-10 18:40:56,519 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 
> 732858.27 sec
> MapReduce Total cumulative CPU time: 8 days 11 hours 34 minutes 18 seconds 
> 270 msec
> Ended Job = job_1386711334339_0004
> MapReduce Jobs Launched: 
> Job 0: Map: 1008   Cumulative CPU: 732858.27 sec   HDFS Read: 301742 HDFS 
> Write: 18395 SUCCESS
> Total MapReduce CPU Time Spent: 8 days 11 hours 34 minutes 18 seconds 270 msec
> OK
> <actual data>
> Time taken: 34505.449 seconds
> hive>
> The result is the same if I do "where key between ..".
> 
> 
> Explain on this query looks like that:
> 
>     > explain select * from hbase_users_history where key like 
> '00000000000000000010_%';
> OK
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME hbase_users_history))) 
> (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 
> TOK_ALLCOLREF)) (TOK_WHERE (like (TOK_TABLE_OR_COL key) 
> '00000000000000000010_%'))))
> 
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> 
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         hbase_users_history 
>           TableScan
>             alias: hbase_users_history
>             Filter Operator
>               predicate:
>                   expr: (key like '00000000000000000010_%')
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: key
>                       type: string
>                       expr: affected
>                       type: int
>                       expr: cf_date
>                       type: int
>                       expr: hidden
>                       type: int
>                       expr: ip
>                       type: string
>                       expr: module
>                       type: string
>                       expr: operation
>                       type: string
>                       expr: state
>                       type: string
>                       expr: target
>                       type: string
>                 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
> _col6, _col7, _col8
>                 File Output Operator
>                   compressed: false
>                   GlobalTableId: 0
>                   table:
>                       input format: org.apache.hadoop.mapred.TextInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> 
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> 
> 
> Time taken: 3.209 seconds
> hive>
> 
> So the questions are:
> 1. Why does hive start 1000+ mappers (one per every region) if it can find 
> out the region where the data is stored?
> 2. Why mappers for all other regions take so much time to find out that there 
> are no such keys in this region?
> 
> Thanks.
> 
> --
> Vladimir
> 
> 
> 
> 


--
Vladimir




Reply via email to