Hi all!
I'm having a performance problem with quering data from hbase using hive. I use
CDH 4.5 (hbase-0.94.6, hive-0.10.0 and hadoop-yarn-2.0.0) on a cluster of 10
hosts. Right now it stores 3 TB of data in hbase table which now consists of
1000+ regions. One record in it looks like this:
hbase(main):002:0> get 'users_history',
'00000000000000000010_18446742684488356353'
COLUMN CELL
cf:affected timestamp=1389221195263, value=1
cf:date timestamp=1389221195263,
value=1389221195262
cf:hidden timestamp=1389221195263, value=0
cf:ip timestamp=1389221195263,
value=95.47.182.98
cf:module timestamp=1389221195263, value=wmi
cf:operation timestamp=1389221195263,
value=create
cf:state timestamp=1389221195263,
value=2060000003962075906->2060000003962075906
cf:target timestamp=1389221195263,
value=message
8 row(s) in 0.0200 seconds
hbase(main):003:0>
I have created the appropriate table in hive like that:
create external table hbase_users_history(key string, affected int, cf_date
string, hidden int, ip string, module string, operation string, state string,
target string)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties ("hbase.columns.mapping" =
":key,cf:affected,cf:date,cf:hidden,cf:ip,cf:module,cf:operation,cf:state,cf:target")
tblproperties("hbase.table.name" = "users_history");
If I run query getting the data by full key it creates one mapper and runs very
fast:
hive> select * from hbase_users_history where
key='00000000000000000010_18446742684488356353';
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1386711334339_0015, Tracking URL =
http://historydb07d.mail.yandex.net:8088/proxy/application_1386711334339_0015/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1386711334339_0015
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-01-14 09:23:00,392 Stage-1 map = 0%, reduce = 0%
2014-01-14 09:23:06,752 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.23
sec
MapReduce Total cumulative CPU time: 3 seconds 230 msec
Ended Job = job_1386711334339_0015
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 3.23 sec HDFS Read: 354 HDFS Write: 122
SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 230 msec
OK
00000000000000000010_18446742684488356353 1 NULL 0
95.47.182.98 wmi create 2060000003962075906->2060000003962075906
message
Time taken: 21.265 seconds
hive>
If I run query getting data by first part of the key it creates 1008 maps and
takes some king of 8-10 hours which seems to be very slow:
hive> select * from hbase_users_history where key like '00000000000000000010_%';
<...>
2014-01-10 18:40:55,485 Stage-1 map = 99%, reduce = 0%, Cumulative CPU
732857.15 sec
2014-01-10 18:40:56,519 Stage-1 map = 100%, reduce = 0%, Cumulative CPU
732858.27 sec
MapReduce Total cumulative CPU time: 8 days 11 hours 34 minutes 18 seconds 270
msec
Ended Job = job_1386711334339_0004
MapReduce Jobs Launched:
Job 0: Map: 1008 Cumulative CPU: 732858.27 sec HDFS Read: 301742 HDFS
Write: 18395 SUCCESS
Total MapReduce CPU Time Spent: 8 days 11 hours 34 minutes 18 seconds 270 msec
OK
<actual data>
Time taken: 34505.449 seconds
hive>
The result is the same if I do "where key between ...".
Explain on this query looks like that:
> explain select * from hbase_users_history where key like
'00000000000000000010_%';
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME hbase_users_history)))
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
TOK_ALLCOLREF)) (TOK_WHERE (like (TOK_TABLE_OR_COL key)
'00000000000000000010_%'))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
hbase_users_history
TableScan
alias: hbase_users_history
Filter Operator
predicate:
expr: (key like '00000000000000000010_%')
type: boolean
Select Operator
expressions:
expr: key
type: string
expr: affected
type: int
expr: cf_date
type: int
expr: hidden
type: int
expr: ip
type: string
expr: module
type: string
expr: operation
type: string
expr: state
type: string
expr: target
type: string
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
_col6, _col7, _col8
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
Time taken: 3.209 seconds
hive>
So the questions are:
1. Why does hive start 1000+ mappers (one per every region) if it can find out
the region where the data is stored?
2. Why mappers for all other regions take so much time to find out that there
are no such keys in this region?
Thanks.
--
Vladimir