Hi, 
I am researching a project to load data from MySQL into HBase.
The data is our user data. We want to analyze the user data with its activities 
and tracking. The user data is big. The record count reaches billion and whole 
data set size is about more than several Ts.
The purpose of this data in this system is pure analyzing. The reason I am 
thinking of using HBase instead of HDFS is because of daily change of the data. 
Every day, there could be about 1 to 5% of user having any update/delete 
activities. Because HDFS is a write-once system, I have to regenerate the whole 
latest snapshot just for small change of the data, it is kind of wasting.
So I am thinking storing the data into HBase. All other activities/tracking 
data (Fact data) will store in HDFS. I hope I can get good performance joining 
the data between HBase and HDFS for different datasets. I test the HBase scan 
speed, not as good as HDFS, but maybe acceptable for my case. HBase here is 
only a sink/source of MR jobs in Hadoop. So I want to give it a try.
This system has to support Hive, as most of users prefer SQL. When I want to 
store the data into HBase, I am not sure what is the best way to store them.
As sample example, if I have data from the users table, and also have the data 
from addresses table, it is one-to-many relationship between them. Let's say I 
have user with (userid, name, ...)and address with (addressId, userId, street, 
state ...). I think I have following 2 options:
Store them as 2 tables in HBase, using userid as row key of Users table is 
obvious. For address table, I can use either addressId as row key, or (userId + 
addressId) as row key. The benefits of this way is that it is a straight 
forward, and I have very easy mapping both tables into Hive external table. But 
I have following doubts:The data between 2 tables will be join again and again. 
If I can store them together during ETL, I can avoid thatEven if I choose 
(userId + addressId) as row key (I can map this row key into 2 columns in 
Hive), it won't really help me during the join, right? If I query like this 
(select * from addresses where userId = 'xxx'), the HBase still need to search 
across all the regions, as the row key is (userid + addressed), correct? Can 
HBase utilize that first part of the row key being queried?In this case, there 
is not much optimizing can use from HBase. It will be pure data source of MR 
job, right?Another way is to store them in one table. But since I don't know 
how many addresses a user could have, even if I create an address column family 
in user table, how do I create column name? I thought of using addressId as 
prefix of the column name in this CF, but from what I found out so far, it is 
impossible to map dynamic column names of HBase in Hive, especially when the 
count of columns is unknown. I would prefer to store this way, but I have to 
support Hive, and don't know how to archive that. Besides there are 7-8 user 
related tables, I am not sure store all of them together is a good idea.
I want to know if anyone here can share some good/bad of each way. Since I am 
still very new to HBase, I have also some additional questions:
1) I know I should use short CF names and column names, so I am thinking just 
use index number as column names, like (1, 2, 3 ...). Is that fine?2) I am 
thinking storing frequently queried columns into one CF, the rest in another. 
How many CFs I can create? From HBase wiki page, it said only 2 or 3 CFs are 
enough. Is that kind of too small?                                    

Reply via email to