Re: Advices for HTable schema

Jean-Marc Spaggiari Mon, 02 Jul 2012 16:49:01 -0700

Addresses will mainly be accessed independently, and sometime only,
with the other data.


I'm not sure either to prefer the "versions" option. So if I go with a
2nd table, does it mean it's better to have more rows than more
columns?

Here are the 2 options now. Both with a new table.

1) I store the key "personID" and a:a1 to a:an for the addresses.
2) I store the key "personID" + "address

In both I will have the same amount of data. In #1 total size will be
smaller since the key will be stored only once. In #1 I will have more
columns where in #2 I will have more rows.

Is there one better than the other one? Also, if I go with option 1,
why is it better to have a 2nd table instead of a 2nd column familly?

JM

2012/7/2, Amandeep Khurana <[email protected]>:
> Responses inline
>
>
> On Monday, July 2, 2012 at 12:53 PM, Jean-Marc Spaggiari wrote:
>
>> Hi Amandeep,
>>
>> Thanks for your prompt reply.
>>
>> I forgot to add that all the addresses are valid at the same time.
>> There is no orders int the addresses. They are all active addresses at
>> the same time. If one is not valid any more, it's removed. If there is
>> a new one, it's added to the list, not replacing any other. So it's
>> not "the last address", but I have to consider all the addresses when
>> I will process them.
>>
>> Regarding adding the address count in the first CF, don't ask me why I
>> have put in in 'a'. I have no clue why I did not tought about adding
>> it to 'b' directly. I agree that it's useless to have it in 'a'.
>>
>> The idea of the hash as a column name was just to have something to
>> put there. It's like the '1' in the second solution. A random number
>> will do the same thing.
>>
>> I'm accessing the data in 2 ways.
>> 1) I acces the person information to update them or retreive all of
>> them to display them
>> 2) I access only the address the compute some statistiques about it.
>> Which mean usually I read ALL the address for one person and not just
>> one address at a time.
>>
> So, that means that the addresses are accessed independently of the other
> information and you always access all the addresses together? Or does that
> mean that the addresses are accessed along with the other information to
> display or retrieve and they are also accessed separately for the stats
> calculation?
>
> You could consider the following ideas:
>
> 1. Store everything in 'a' and let all addresses go into the column
> 'a:address'. Increase the versions to N, where N is the max number of
> addresses you want to store for any user.
>
> OR
>
> 2. Store addresses in an entirely different table with the rowkey being
> user+address. The column qualifier and cell value could be just a simple 1
> for the sake of having something there. When you want to get all addresses
> for a user, you just scan from start key 'user' to end key 'user+1'.
>
> I'm not a fan of the first schema option that you outlined earlier because
> of the complexity involved in the client code. That approach works with
> relational databases where you have the ability to do transactions. In the
> HBase world, not so much.
>>
>> So basically, there all the 3 options almost the same thing. If I
>> store the number of addresses, I will have more work when I have to
>> add/remove one entry, same amount of work when I want to parse the
>> entries, and less work when I want to count the entries.
>>
>> Difficult choice. I don't find any schema better than the other one
>> because all of them have pros and cons. For now, my prefered one is #1
>> because it's sound more "natural" to store the number of columns, then
>> parse them by name, etc. but I think I need to think about it a little
>> be more before taking any decision...
>>
>> JM
>>
>> 2012/7/2, Amandeep Khurana <[email protected] (mailto:[email protected])>:
>> > Jean-Marc,
>> >
>> > These are great questions! Find my answers (and some questions for you)
>> > inline.
>> >
>> > -ak
>> >
>> >
>> > On Monday, July 2, 2012 at 12:04 PM, Jean-Marc Spaggiari wrote:
>> >
>> > > Hi,
>> > >
>> > > I have a question regarding the best way to design a table.
>> > >
>> > > Let's imagine I want to store all the people in the world on a
>> > > database.
>> > >
>> > > Everyone has a name, last name, phone number, lot of flags (sex, age,
>> > > etc.).
>> > >
>> > > Now, people can have one address, but they can also have 2, or 3, or
>> > > even more... But they will never have thousands of addresses. Let's
>> > > say, usually, they have between 1 and 10.
>> > >
>> >
>> >
>> > The point to think about here is - what will be your read access
>> > pattern?
>> > Will you always want the latest address? Or will you want all addresses
>> > every time? And then also defining the maximum number of addresses to
>> > be
>> > stored.
>> > >
>> > > My table is designes like that.
>> > >
>> > > create 'person', {NAME => 'a', VERSIONS => 1}, {NAME => 'b', VERSIONS
>> > > => 1, COMPRESSION => 'gz'}
>> > >
>> >
>> >
>> > You could easily bump up the versions to a number that limits the max
>> > number
>> > of addresses you will store.
>> > Having two separate column families is not the way to solve this problem
>> > in
>> > my opinion. Reason being - the concept of column families enables you
>> > to
>> > isolate data with different access patterns. If that's what you desire
>> > here,
>> > multiple families make sense. But again, this goes back to defining
>> > your
>> > read patterns. Are you going to access all the data together or are the
>> > addresses going to be accessed independently of the rest of the
>> > information.
>> > >
>> > > The 'a' CF will contain all the informations exepct the address.
>> > > The 'b' CF will contain only the address.
>> > >
>> > > I have few options to store the addresses.
>> > > I can:
>> > > - Store in CF 'a' a flag to tell how many addresses there is and
>> > > store
>> > > "add1" to "addx" in the 'b' CF will each cell containing the address.
>> > >
>> >
>> >
>> > This sort of becomes a case where you'll need to build a transaction
>> > like
>> > logic in your client code. When you want to store an additional
>> > address,
>> > you'll need to do the following:
>> > 1. read counter from 'a'. Let's say that is n.
>> > 2. store next address with CQ as add[n++]
>> > 3. store n++ as the counter
>> >
>> > That complicates the client code and is undesirable. Moreover, you are
>> > accessing both column families at the time of any access to the address
>> > info. It is probably better to store the counter in 'b' instead of 'a'
>> > in
>> > this approach but you still have the complication of the transaction
>> > like
>> > logic.
>> > > - Store in CF 'b' the addresses using an hash as the column
>> > > identifier.
>> >
>> >
>> > The hash doesn't buy you anything. How do you ensure that you are
>> > reading
>> > the latest address? Again, goes back to defining the read patterns.
>> > > - Store in CF 'b' the addresses as the column identifier and simply
>> > > put '1' in the cell, or a hash.
>> > >
>> >
>> > Same problem as the last approach.
>> > >
>> > > The first option give me very quick information about the number of
>> > > addresses, but if I need to add one address, I have to update the 2
>> > > CF. Same if I have to remove one.
>> > > The second option will allow me to add any address even without
>> > > checking if it's already there. I can remove one very quickly and add
>> > > one very quickly. If I want to know the number of addresses, I have
>> > > to
>> > > retreive all the columns in the CF and count them. However, I'm
>> > > storing almost the same information twice. One time with the address,
>> > > one time with the hash (CRC32).
>> > > The 3rd option has all the advantages of the second one but also,
>> > > it's
>> > > not storing the information twice. However, that might result in VERY
>> > > long column names. And I'm not sure it's good. Like, if I just want
>> > > to
>> > > know how many address this person has, I will still need to download
>> > > them totally on the client side to count them.
>> > >
>> >
>> >
>> > Long column qualifiers are perfectly fine and take the same amount of
>> > disk
>> > space as storing the data in the cells. I don't believe that should be
>> > a
>> > concern.
>> > >
>> > > I'm not able to find which solution I should use. All of them have
>> > > some pros and cons. And I'm not advanced enought in HBase to forsee
>> > > the issues I will have later with one format or the other.
>> > >
>> > > If I look at the online documentation (
>> > > http://hbase.apache.org/book.html#keysize ) it seems the 3rd option
>> > > is
>> > > not a good one. So I might have to choose between the 2 first one.
>> > >
>> > > Does anyone have any advice/recommandation regarding the best of the
>> > > 2
>> > > formats I should use? Or even maybe there is some other options I
>> > > have
>> > > not yet figured?
>> > >
>> > > Thanks
>> > >
>> > > JM
>
>

Re: Advices for HTable schema

Reply via email to