Hey folks, I am prototyping using hbase to handle our de-duplication needs using checkAndPut (right now I do it using map/reduce and have built a more realtime system which this is the last piece to polish off) and have a few questions/thoughts I wanted to bounce around and get some feedback on please. thanks!
So, first I want to be able to delete rows that are older than a time period (like 6 months trailing). The issue here is I don't think I can use TTL (unless I can override the timestamp on insert and even if I did not sure that is good for just billions of rows to get deleted by TTL each day). Our system is asyncronous and we store > billions of pieces of data per day and in such a system I could receive data from a mobile device today with a timestamp from November (or whatever) because now is when the user connected to the internet and also used the app I am receiving data for the last time they used it but was not connected to the internet. So one thought I had was a table for each day this way I could delete whenever i wanted to ... this seems like a bit of a nightmare, maybe by month? or week? week feels better.... I guess I am also a little worried about having trillions of rows in a table but maybe that is not an issue???? just dumping everything in one mega table just does not feel right. So far my load tests are going well but there is a lot more to-go, I am thinking of turning on bloomfilters (already have compression on) as I will have lots of misses (most of the data 90%+ is NOT duplicate but real) a bunch of other things I am learning as I go trying to iterate with each change to our de-duplication code. I have been really happy and impressed so far with HBase, great job everyone and thanks! I guess my next step may just end up being to jump into the code so I can get a better sense of these things but appreciate any help either in my questions or pointing things through the code (being on the east coast I feel thousands of miles away from the action and meetups and the rest but look forward getting more into things). Regards -- /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop <http://twitter.com/#!/allthingshadoop> */
