Re: some data modeling questions related to Hbase

Ted Yu Fri, 14 Feb 2014 09:06:43 -0800

This post would give you some more idea about how Phoenix is used:

http://search-hadoop.com/m/zS4qCsW1Ry1/james+phoenix+create+table&subj=Re+Rowkey+design+and+presplit+table


Cheers


On Fri, Feb 14, 2014 at 6:36 AM, java8964 <[email protected]> wrote:

> Hi, Ted:
> I will take a look.
> Thanks
> Yong
>
> > Date: Thu, 13 Feb 2014 20:39:18 -0800
> > Subject: Re: some data modeling questions related to Hbase
> > From: [email protected]
> > To: [email protected]
> >
> > bq. This system has to support Hive, as most of users prefer SQL
> >
> > Have you considered Apache Phoenix ?
> > See http://incubator.apache.org/projects/phoenix.html
> >
> > Cheers
> >
> >
> > On Thu, Feb 13, 2014 at 6:49 PM, java8964 <[email protected]> wrote:
> >
> > > Hi,
> > > I am researching a project to load data from MySQL into HBase.
> > > The data is our user data. We want to analyze the user data with its
> > > activities and tracking. The user data is big. The record count reaches
> > > billion and whole data set size is about more than several Ts.
> > > The purpose of this data in this system is pure analyzing. The reason
> I am
> > > thinking of using HBase instead of HDFS is because of daily change of
> the
> > > data. Every day, there could be about 1 to 5% of user having any
> > > update/delete activities. Because HDFS is a write-once system, I have
> to
> > > regenerate the whole latest snapshot just for small change of the
> data, it
> > > is kind of wasting.
> > > So I am thinking storing the data into HBase. All other
> > > activities/tracking data (Fact data) will store in HDFS. I hope I can
> get
> > > good performance joining the data between HBase and HDFS for different
> > > datasets. I test the HBase scan speed, not as good as HDFS, but maybe
> > > acceptable for my case. HBase here is only a sink/source of MR jobs in
> > > Hadoop. So I want to give it a try.
> > > This system has to support Hive, as most of users prefer SQL. When I
> want
> > > to store the data into HBase, I am not sure what is the best way to
> store
> > > them.
> > > As sample example, if I have data from the users table, and also have
> the
> > > data from addresses table, it is one-to-many relationship between them.
> > > Let's say I have user with (userid, name, ...)and address with
> (addressId,
> > > userId, street, state ...). I think I have following 2 options:
> > > Store them as 2 tables in HBase, using userid as row key of Users
> table is
> > > obvious. For address table, I can use either addressId as row key, or
> > > (userId + addressId) as row key. The benefits of this way is that it
> is a
> > > straight forward, and I have very easy mapping both tables into Hive
> > > external table. But I have following doubts:The data between 2 tables
> will
> > > be join again and again. If I can store them together during ETL, I can
> > > avoid thatEven if I choose (userId + addressId) as row key (I can map
> this
> > > row key into 2 columns in Hive), it won't really help me during the
> join,
> > > right? If I query like this (select * from addresses where userId =
> 'xxx'),
> > > the HBase still need to search across all the regions, as the row key
> is
> > > (userid + addressed), correct? Can HBase utilize that first part of
> the row
> > > key being queried?In this case, there is not much optimizing can use
> from
> > > HBase. It will be pure data source of MR job, right?Another way is to
> store
> > > them in one table. But since I don't know how many addresses a user
> could
> > > have, even if I create an address column family in user table, how do I
> > > create column name? I thought of using addressId as prefix of the
> column
> > > name in this CF, but from what I found out so far, it is impossible to
> map
> > > dynamic column names of HBase in Hive, especially when the count of
> columns
> > > is unknown. I would prefer to store this way, but I have to support
> Hive,
> > > and don't know how to archive that. Besides there are 7-8 user related
> > > tables, I am not sure store all of them together is a good idea.
> > > I want to know if anyone here can share some good/bad of each way.
> Since I
> > > am still very new to HBase, I have also some additional questions:
> > > 1) I know I should use short CF names and column names, so I am
> thinking
> > > just use index number as column names, like (1, 2, 3 ...). Is that
> fine?2)
> > > I am thinking storing frequently queried columns into one CF, the rest
> in
> > > another. How many CFs I can create? From HBase wiki page, it said only
> 2 or
> > > 3 CFs are enough. Is that kind of too small?
> > >
>
>

Re: some data modeling questions related to Hbase

Reply via email to