Hi, Ted: I will take a look. Thanks Yong
> Date: Thu, 13 Feb 2014 20:39:18 -0800 > Subject: Re: some data modeling questions related to Hbase > From: [email protected] > To: [email protected] > > bq. This system has to support Hive, as most of users prefer SQL > > Have you considered Apache Phoenix ? > See http://incubator.apache.org/projects/phoenix.html > > Cheers > > > On Thu, Feb 13, 2014 at 6:49 PM, java8964 <[email protected]> wrote: > > > Hi, > > I am researching a project to load data from MySQL into HBase. > > The data is our user data. We want to analyze the user data with its > > activities and tracking. The user data is big. The record count reaches > > billion and whole data set size is about more than several Ts. > > The purpose of this data in this system is pure analyzing. The reason I am > > thinking of using HBase instead of HDFS is because of daily change of the > > data. Every day, there could be about 1 to 5% of user having any > > update/delete activities. Because HDFS is a write-once system, I have to > > regenerate the whole latest snapshot just for small change of the data, it > > is kind of wasting. > > So I am thinking storing the data into HBase. All other > > activities/tracking data (Fact data) will store in HDFS. I hope I can get > > good performance joining the data between HBase and HDFS for different > > datasets. I test the HBase scan speed, not as good as HDFS, but maybe > > acceptable for my case. HBase here is only a sink/source of MR jobs in > > Hadoop. So I want to give it a try. > > This system has to support Hive, as most of users prefer SQL. When I want > > to store the data into HBase, I am not sure what is the best way to store > > them. > > As sample example, if I have data from the users table, and also have the > > data from addresses table, it is one-to-many relationship between them. > > Let's say I have user with (userid, name, ...)and address with (addressId, > > userId, street, state ...). I think I have following 2 options: > > Store them as 2 tables in HBase, using userid as row key of Users table is > > obvious. For address table, I can use either addressId as row key, or > > (userId + addressId) as row key. The benefits of this way is that it is a > > straight forward, and I have very easy mapping both tables into Hive > > external table. But I have following doubts:The data between 2 tables will > > be join again and again. If I can store them together during ETL, I can > > avoid thatEven if I choose (userId + addressId) as row key (I can map this > > row key into 2 columns in Hive), it won't really help me during the join, > > right? If I query like this (select * from addresses where userId = 'xxx'), > > the HBase still need to search across all the regions, as the row key is > > (userid + addressed), correct? Can HBase utilize that first part of the row > > key being queried?In this case, there is not much optimizing can use from > > HBase. It will be pure data source of MR job, right?Another way is to store > > them in one table. But since I don't know how many addresses a user could > > have, even if I create an address column family in user table, how do I > > create column name? I thought of using addressId as prefix of the column > > name in this CF, but from what I found out so far, it is impossible to map > > dynamic column names of HBase in Hive, especially when the count of columns > > is unknown. I would prefer to store this way, but I have to support Hive, > > and don't know how to archive that. Besides there are 7-8 user related > > tables, I am not sure store all of them together is a good idea. > > I want to know if anyone here can share some good/bad of each way. Since I > > am still very new to HBase, I have also some additional questions: > > 1) I know I should use short CF names and column names, so I am thinking > > just use index number as column names, like (1, 2, 3 ...). Is that fine?2) > > I am thinking storing frequently queried columns into one CF, the rest in > > another. How many CFs I can create? From HBase wiki page, it said only 2 or > > 3 CFs are enough. Is that kind of too small? > >
