On Tue, Jan 3, 2012 at 6:39 AM, Joe Stein <[email protected]> wrote: > So, first I want to be able to delete rows that are older than a time > period (like 6 months trailing). The issue here is I don't think I can use > TTL (unless I can override the timestamp on insert and even if I did not > sure that is good for just billions of rows to get deleted by TTL each day). >
TTL check happens (mostly) when you major compact so you can control it somewhat. There is a difference between a TTL and an explicit delete. With the former, older cells are just dropped at compact time. With the latter, a new delete record is added and at query time its acted on. There are also different kinds of deletes in that there are explicit deletes of explicit cells (a new entry in hbase per cell to be deleted) and a column family delete which is a single entry at the start of a row for the deleted column family. I raise the above so you see that doing explicit deletes 'costs' more than TTL'ing. > Our system is asyncronous and we store > billions of pieces of data per day > and in such a system I could receive data from a mobile device today with a > timestamp from November (or whatever) because now is when the user > connected to the internet and also used the app I am receiving data for the > last time they used it but was not connected to the internet. > You want to keep the cell for 6 months since you 'saw' it -- if so, you could TTL it? -- or for 6 months after the event happened (For latter, the timestamp would be the event timestamp). > So one thought I had was a table for each day this way I could delete > whenever i wanted to ... this seems like a bit of a nightmare, maybe by > month? or week? week feels better.... > You could do that but sounds like the table-per-month would have data from outside of the month? You'd be ok w/ this? You'd need to figure how to do the x-months view. > I guess I am also a little worried about having trillions of rows in a > table but maybe that is not an issue???? just dumping everything in one > mega table just does not feel right. > HBase deals in regions; it doesn't care if they are of one table or many. > So far my load tests are going well but there is a lot more to-go, I am > thinking of turning on bloomfilters (already have compression on) as I will > have lots of misses (most of the data 90%+ is NOT duplicate but real) a > bunch of other things I am learning as I go trying to iterate with each > change to our de-duplication code. I have been really happy and impressed > so far with HBase, great job everyone and thanks! > I'd say don't do blooms till you have 0.92 up on your cluster (Are you 0.92'ing it or 0.90?). They've been much improved in 0.92. > I guess my next step may just end up being to jump into the code so I can > get a better sense of these things but appreciate any help either in my > questions or pointing things through the code (being on the east coast I > feel thousands of miles away from the action and meetups and the rest but > look forward getting more into things). > Good on you Joe (You saw that I asked for your wiki name so I could add you as editor for hbase pages?) St.Ack
