Hi all!

I'm having a performance problem with quering data from hbase using hive. I use 
CDH 4.5 (hbase-0.94.6, hive-0.10.0 and hadoop-yarn-2.0.0) on a cluster of 10 
hosts. Right now it stores 3 TB of data in hbase table which now consists of 
1000+ regions. One record in it looks like this:

hbase(main):002:0> get 'users_history', 
'00000000000000000010_18446742684488356353'
COLUMN                                       CELL                               
                                                                                
               
 cf:affected                                 timestamp=1389221195263, value=1   
                                                                                
              
 cf:date                                     timestamp=1389221195263, 
value=1389221195262                                                             
                         
 cf:hidden                                   timestamp=1389221195263, value=0   
                                                                                
              
 cf:ip                                       timestamp=1389221195263, 
value=95.47.182.98                                                              
                        
 cf:module                                   timestamp=1389221195263, value=wmi 
                                                                                
              
 cf:operation                                timestamp=1389221195263, 
value=create                                                                    
                         
 cf:state                                    timestamp=1389221195263, 
value=2060000003962075906->2060000003962075906                                  
                         
 cf:target                                   timestamp=1389221195263, 
value=message                                                                   
                        
8 row(s) in 0.0200 seconds

hbase(main):003:0>


I have created the appropriate table in hive like that:

create external table hbase_users_history(key string, affected int, cf_date 
string, hidden int, ip string, module string, operation string, state string, 
target string)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties ("hbase.columns.mapping" = 
":key,cf:affected,cf:date,cf:hidden,cf:ip,cf:module,cf:operation,cf:state,cf:target")
tblproperties("hbase.table.name" = "users_history");


If I run query getting the data by full key it creates one mapper and runs very 
fast:

hive> select * from hbase_users_history where 
key='00000000000000000010_18446742684488356353';                                
                                             
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1386711334339_0015, Tracking URL = 
http://historydb07d.mail.yandex.net:8088/proxy/application_1386711334339_0015/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1386711334339_0015
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-01-14 09:23:00,392 Stage-1 map = 0%,  reduce = 0%
2014-01-14 09:23:06,752 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.23 
sec
MapReduce Total cumulative CPU time: 3 seconds 230 msec
Ended Job = job_1386711334339_0015
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 3.23 sec   HDFS Read: 354 HDFS Write: 122 
SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 230 msec
OK
00000000000000000010_18446742684488356353       1       NULL    0       
95.47.182.98    wmi     create  2060000003962075906->2060000003962075906        
message
Time taken: 21.265 seconds
hive>


If I run query getting data by first part of the key it creates 1008 maps and 
takes some king of 8-10 hours which seems to be very slow:

hive> select * from hbase_users_history where key like '00000000000000000010_%';
<...>
2014-01-10 18:40:55,485 Stage-1 map = 99%,  reduce = 0%, Cumulative CPU 
732857.15 sec
2014-01-10 18:40:56,519 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 
732858.27 sec
MapReduce Total cumulative CPU time: 8 days 11 hours 34 minutes 18 seconds 270 
msec
Ended Job = job_1386711334339_0004
MapReduce Jobs Launched: 
Job 0: Map: 1008   Cumulative CPU: 732858.27 sec   HDFS Read: 301742 HDFS 
Write: 18395 SUCCESS
Total MapReduce CPU Time Spent: 8 days 11 hours 34 minutes 18 seconds 270 msec
OK
<actual data>
Time taken: 34505.449 seconds
hive>
The result is the same if I do "where key between ...".


Explain on this query looks like that:

    > explain select * from hbase_users_history where key like 
'00000000000000000010_%';
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME hbase_users_history))) 
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 
TOK_ALLCOLREF)) (TOK_WHERE (like (TOK_TABLE_OR_COL key) 
'00000000000000000010_%'))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        hbase_users_history 
          TableScan
            alias: hbase_users_history
            Filter Operator
              predicate:
                  expr: (key like '00000000000000000010_%')
                  type: boolean
              Select Operator
                expressions:
                      expr: key
                      type: string
                      expr: affected
                      type: int
                      expr: cf_date
                      type: int
                      expr: hidden
                      type: int
                      expr: ip
                      type: string
                      expr: module
                      type: string
                      expr: operation
                      type: string
                      expr: state
                      type: string
                      expr: target
                      type: string
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, 
_col6, _col7, _col8
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1


Time taken: 3.209 seconds
hive>

So the questions are:
1. Why does hive start 1000+ mappers (one per every region) if it can find out 
the region where the data is stored?
2. Why mappers for all other regions take so much time to find out that there 
are no such keys in this region?

Thanks.

--
Vladimir




Reply via email to