You've understood correctly Michel and thanks you for your suggestions, I think I'll take the second and manually do TTL.
Andrew - I somewhat over simplified my use case; happy to explain in full but it's probably OTT. I am intrigued by your idea and certainly hadn't thought of anything that "clever/devious" (both meant in good ways); I'm not sure I can use it for this problem but it's certainly something that I will bear in mind. I did think in terms of map reduce at first but it seemed like the best I could get was write a huge file of valid IDs in to Hadoop and then map side join on them inside the job while iterating the table. A simple reading/deleting client process seemed to simplify the operations for the first pass - there only be a few million rows on 5or 6 nodes. Thanks for the advice, likely to have more questions soon! Merry Christmas Richard On Dec 22, 2011, at 18:09, Andrew Purtell <[email protected]> wrote: >> I plan to use HBase to store data that has a variable length lifespan > >> [. > > Indeed that the simplest approach is usually best. > > The simplest way to manage automatic expiration of data over various > lifetimes, especially if there are only a few of them, like in your case (3 > days versus 3 months): Create a column family for each. Store into a given > column family as appropriate. Get or Scan with families included as needed, > will retrieve all of the nonexpired data in the row in the given families. > >> I don’t think I can use standard TTL for this because that’s fixed at a >> column family level. > > > Is that really the case? > > I had a use case once where most data was not useful after a couple of weeks, > but some data occasionally needed to be promoted to permanent storage. It > wasn't convenient to model the transient data and permanent data as separate > entities. You might think TTLs couldn't be used for that. However, we created > two column families; one with a TTL, one without; a very simple maintenance > mapreduce job, run from crontab on the jobtracker, for copying from one to > the other, and we were able to use filters to reduce the work this job needed > to do; and a very thin presentation layer to give users the illusion that > these entities were stored "in the same place" (we needed to give them a REST > API anyway). This worked well. There was some modest penalty on read for > accessing two stores instead of one, but the performance was within the > bounds we needed. > > Best regards, > > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein (via > Tom White) > > > ----- Original Message ----- >> From: Michel Segel <[email protected]> >> To: "[email protected]" <[email protected]> >> Cc: "[email protected]" <[email protected]> >> Sent: Thursday, December 22, 2011 5:21 AM >> Subject: Re: Data management strategy >> >> Richard, >> >> Let's see if I understand what you want to do... >> >> You have some data and you want to store it in some table A. >> Some of the records/rows in this table have a limited life span of 3 days, >> others have a limited life span of 3 months. But both are the same records? >> By >> this I mean that both records contain the same type of data but there is >> some >> business logic that determines which record gets deleted. >> ( like purge all records that haven't been accessed in the last 3 days.) >> >> If what I imagine is true, you can't use the standard TTL unless you know >> that after a set N hours or days the record will be deleted. Like all >> records >> will self destruct 30 days past creation. >> >> The simplest solution would be to have a column that contains a, timestamp >> of >> last access and your application controls when this field gets updated. Then >> using cron, launch a job that scans the table and removes the rows which >> meet >> your delete criteria. >> >> Since co-processors are new... Not yet in any of the commercial releases, I >> would suggest keeping the logic simple. You can always refactor your code to >> use >> Co-processors when you've had time to play with them. >> >> Even with coprocessors because the data dies an arbitrary death, you will >> still >> have to purge the data yourself. Hence the cron job that marks the record >> for >> deletion and then does a major compaction on the table to really delete the >> rows... >> >> Of course the standard caveats apply, assuming I really did understand what >> you >> wanted... >> >> Oh and KISS is always the best practice... :-) >> >> Sent from a remote device. Please excuse any typos... >> >> Mike Segel >> >> On Dec 21, 2011, at 12:03 PM, Richard Lawrence <[email protected]> >> wrote: >> >>> Hi >>> >>> I was wondering if I could seek some advance about data management in >> HBase? I plan to use HBase to store data that has a variable length >> lifespan, >> the vast majority will be short but occasionally the data life time will be >> significantly longer (3 days versus 3 months). Once the lifespan is over I >> need >> the data to be deleted at some point in the near future (within a few day is >> fine). I don’t think I can use standard TTL for this because that’s fixed >> at a >> column family level. Therefore, my plan was to run script every few days >> that >> looks through external information for what needs to be kept and then >> updates >> HBase in some way so that it can understand. With the data in HBase I can >> then >> use the standard TTL mechanism to clean up. >>> >>> The two ways I can think of to let HBase know are: >>> >>> Add a co-processor that updates timestamp on each read and then have my >> process simply read the data. I shied away from this because the >> documentation >> indicated the co-processor can’t take row locks. Does that imply that it >> shouldn’t modify the underlying data. For my use case the timestamp doesn’t >> have to be perfect the keys are created in a such that the underlying data >> is >> fixed at creation time. >>> Add an extra column to each row that’s a cache flag and rewrite that at >> various intervals so that the timestamp updates and prevents the TTL from >> deleting it. >>> >>> Are there other best practice alternatives? >>> >>> Thanks >>> >>> Richard >>> >>
