This post would give you some more idea about how Phoenix is used: http://search-hadoop.com/m/zS4qCsW1Ry1/james+phoenix+create+table&subj=Re+Rowkey+design+and+presplit+table
Cheers On Fri, Feb 14, 2014 at 6:36 AM, java8964 <[email protected]> wrote: > Hi, Ted: > I will take a look. > Thanks > Yong > > > Date: Thu, 13 Feb 2014 20:39:18 -0800 > > Subject: Re: some data modeling questions related to Hbase > > From: [email protected] > > To: [email protected] > > > > bq. This system has to support Hive, as most of users prefer SQL > > > > Have you considered Apache Phoenix ? > > See http://incubator.apache.org/projects/phoenix.html > > > > Cheers > > > > > > On Thu, Feb 13, 2014 at 6:49 PM, java8964 <[email protected]> wrote: > > > > > Hi, > > > I am researching a project to load data from MySQL into HBase. > > > The data is our user data. We want to analyze the user data with its > > > activities and tracking. The user data is big. The record count reaches > > > billion and whole data set size is about more than several Ts. > > > The purpose of this data in this system is pure analyzing. The reason > I am > > > thinking of using HBase instead of HDFS is because of daily change of > the > > > data. Every day, there could be about 1 to 5% of user having any > > > update/delete activities. Because HDFS is a write-once system, I have > to > > > regenerate the whole latest snapshot just for small change of the > data, it > > > is kind of wasting. > > > So I am thinking storing the data into HBase. All other > > > activities/tracking data (Fact data) will store in HDFS. I hope I can > get > > > good performance joining the data between HBase and HDFS for different > > > datasets. I test the HBase scan speed, not as good as HDFS, but maybe > > > acceptable for my case. HBase here is only a sink/source of MR jobs in > > > Hadoop. So I want to give it a try. > > > This system has to support Hive, as most of users prefer SQL. When I > want > > > to store the data into HBase, I am not sure what is the best way to > store > > > them. > > > As sample example, if I have data from the users table, and also have > the > > > data from addresses table, it is one-to-many relationship between them. > > > Let's say I have user with (userid, name, ...)and address with > (addressId, > > > userId, street, state ...). I think I have following 2 options: > > > Store them as 2 tables in HBase, using userid as row key of Users > table is > > > obvious. For address table, I can use either addressId as row key, or > > > (userId + addressId) as row key. The benefits of this way is that it > is a > > > straight forward, and I have very easy mapping both tables into Hive > > > external table. But I have following doubts:The data between 2 tables > will > > > be join again and again. If I can store them together during ETL, I can > > > avoid thatEven if I choose (userId + addressId) as row key (I can map > this > > > row key into 2 columns in Hive), it won't really help me during the > join, > > > right? If I query like this (select * from addresses where userId = > 'xxx'), > > > the HBase still need to search across all the regions, as the row key > is > > > (userid + addressed), correct? Can HBase utilize that first part of > the row > > > key being queried?In this case, there is not much optimizing can use > from > > > HBase. It will be pure data source of MR job, right?Another way is to > store > > > them in one table. But since I don't know how many addresses a user > could > > > have, even if I create an address column family in user table, how do I > > > create column name? I thought of using addressId as prefix of the > column > > > name in this CF, but from what I found out so far, it is impossible to > map > > > dynamic column names of HBase in Hive, especially when the count of > columns > > > is unknown. I would prefer to store this way, but I have to support > Hive, > > > and don't know how to archive that. Besides there are 7-8 user related > > > tables, I am not sure store all of them together is a good idea. > > > I want to know if anyone here can share some good/bad of each way. > Since I > > > am still very new to HBase, I have also some additional questions: > > > 1) I know I should use short CF names and column names, so I am > thinking > > > just use index number as column names, like (1, 2, 3 ...). Is that > fine?2) > > > I am thinking storing frequently queried columns into one CF, the rest > in > > > another. How many CFs I can create? From HBase wiki page, it said only > 2 or > > > 3 CFs are enough. Is that kind of too small? > > > > >
