Hi,
I am researching a project to load data from MySQL into HBase.
The data is our user data. We want to analyze the user data with its activities
and tracking. The user data is big. The record count reaches billion and whole
data set size is about more than several Ts.
The purpose of this data in this system is pure analyzing. The reason I am
thinking of using HBase instead of HDFS is because of daily change of the data.
Every day, there could be about 1 to 5% of user having any update/delete
activities. Because HDFS is a write-once system, I have to regenerate the whole
latest snapshot just for small change of the data, it is kind of wasting.
So I am thinking storing the data into HBase. All other activities/tracking
data (Fact data) will store in HDFS. I hope I can get good performance joining
the data between HBase and HDFS for different datasets. I test the HBase scan
speed, not as good as HDFS, but maybe acceptable for my case. HBase here is
only a sink/source of MR jobs in Hadoop. So I want to give it a try.
This system has to support Hive, as most of users prefer SQL. When I want to
store the data into HBase, I am not sure what is the best way to store them.
As sample example, if I have data from the users table, and also have the data
from addresses table, it is one-to-many relationship between them. Let's say I
have user with (userid, name, ...)and address with (addressId, userId, street,
state ...). I think I have following 2 options:
Store them as 2 tables in HBase, using userid as row key of Users table is
obvious. For address table, I can use either addressId as row key, or (userId +
addressId) as row key. The benefits of this way is that it is a straight
forward, and I have very easy mapping both tables into Hive external table. But
I have following doubts:The data between 2 tables will be join again and again.
If I can store them together during ETL, I can avoid thatEven if I choose
(userId + addressId) as row key (I can map this row key into 2 columns in
Hive), it won't really help me during the join, right? If I query like this
(select * from addresses where userId = 'xxx'), the HBase still need to search
across all the regions, as the row key is (userid + addressed), correct? Can
HBase utilize that first part of the row key being queried?In this case, there
is not much optimizing can use from HBase. It will be pure data source of MR
job, right?Another way is to store them in one table. But since I don't know
how many addresses a user could have, even if I create an address column family
in user table, how do I create column name? I thought of using addressId as
prefix of the column name in this CF, but from what I found out so far, it is
impossible to map dynamic column names of HBase in Hive, especially when the
count of columns is unknown. I would prefer to store this way, but I have to
support Hive, and don't know how to archive that. Besides there are 7-8 user
related tables, I am not sure store all of them together is a good idea.
I want to know if anyone here can share some good/bad of each way. Since I am
still very new to HBase, I have also some additional questions:
1) I know I should use short CF names and column names, so I am thinking just
use index number as column names, like (1, 2, 3 ...). Is that fine?2) I am
thinking storing frequently queried columns into one CF, the rest in another.
How many CFs I can create? From HBase wiki page, it said only 2 or 3 CFs are
enough. Is that kind of too small?