Re: many tables vs long rows

Stack Tue, 03 Jan 2012 12:14:48 -0800

On Tue, Jan 3, 2012 at 6:39 AM, Joe Stein
<[email protected]> wrote:
> So, first I want to be able to delete rows that are older than a time
> period (like 6 months trailing).  The issue here is I don't think I can use
> TTL (unless I can override the timestamp on insert and even if I did not
> sure that is good for just billions of rows to get deleted by TTL each day).
>


TTL check happens (mostly) when you major compact so you can control
it somewhat.

There is a difference between a TTL and an explicit delete.  With the
former, older cells are just dropped at compact time.  With the
latter, a new delete record is added and at query time its acted on.
There are also different kinds of deletes in that there are explicit
deletes of explicit cells (a new entry in hbase per cell to be
deleted) and a column family delete which is a single entry at the
start of a row for the deleted column family.

I raise the above so you see that doing explicit deletes 'costs' more
than TTL'ing.


> Our system is asyncronous and we store > billions of pieces of data per day
> and in such a system I could receive data from a mobile device today with a
> timestamp from November (or whatever) because now is when the user
> connected to the internet and also used the app I am receiving data for the
> last time they used it but was not connected to the internet.
>

You want to keep the cell for 6 months since you 'saw' it -- if so,
you could TTL it? -- or for 6 months after the event happened (For
latter, the timestamp would be the event timestamp).


> So one thought I had was a table for each day this way I could delete
> whenever i wanted to ... this seems like a bit of a nightmare, maybe by
> month? or week? week feels better....
>

You could do that but sounds like the table-per-month would have data
from outside of the month?  You'd be ok w/ this?   You'd need to
figure how to do the x-months view.

> I guess I am also a little worried about having trillions of rows in a
> table but maybe that is not an issue????  just dumping everything in one
> mega table just does not feel right.
>


HBase deals in regions; it doesn't care if they are of one table or many.


> So far my load tests are going well but there is a lot more to-go, I am
> thinking of turning on bloomfilters (already have compression on) as I will
> have lots of misses (most of the data 90%+ is NOT duplicate but real) a
> bunch of other things I am learning as I go trying to iterate with each
> change to our de-duplication code.  I have been really happy and impressed
> so far with HBase, great job everyone and thanks!
>

I'd say don't do blooms till you have 0.92 up on your cluster (Are you
0.92'ing it or 0.90?).  They've been much improved in 0.92.


> I guess my next step may just end up being to jump into the code so I can
> get a better sense of these things but appreciate any help either in my
> questions or pointing things through the code (being on the east coast I
> feel thousands of miles away from the action and meetups and the rest but
> look forward getting more into things).
>

Good on you Joe (You saw that I asked for your wiki name so I could
add you as editor for hbase pages?)

St.Ack

Re: many tables vs long rows

Reply via email to