Re: Data management strategy

Richard Lawrence Thu, 22 Dec 2011 18:19:19 -0800

You've understood correctly Michel and thanks you for your suggestions, I think 
I'll take the second and manually do TTL.


Andrew - I somewhat over simplified my use case; happy to explain in full but 
it's probably OTT.  I am intrigued by your idea and certainly hadn't thought of 
anything that "clever/devious" (both meant in good ways); I'm not sure I can 
use it for this problem but it's certainly something that I will bear in mind.  
I did think in terms of map reduce at first but it seemed like the best I could 
get was write a huge file of valid IDs in to Hadoop and then map side join on 
them inside the job while iterating the table.  A simple reading/deleting 
client process seemed to simplify the operations for the first pass - there 
only be a few million rows on 5or 6 nodes.

Thanks for the advice, likely to have more questions soon!

Merry Christmas

Richard


On Dec 22, 2011, at 18:09, Andrew Purtell <[email protected]> wrote:

>> I plan to use HBase to store data that has a  variable length lifespan
> 
>> [.
> 
> Indeed that the simplest approach is usually best.
> 
> The simplest way to manage automatic expiration of data over various 
> lifetimes, especially if there are only a few of them, like in your case (3 
> days versus 3 months): Create a column family for each. Store into a given 
> column family as appropriate. Get or Scan with families included as needed, 
> will retrieve all of the nonexpired data in the row in the given families.
> 
>> I don’t think I can use standard TTL for this because that’s fixed at a
>> column family level. 
> 
> 
> Is that really the case?
> 
> I had a use case once where most data was not useful after a couple of weeks, 
> but some data occasionally needed to be promoted to permanent storage. It 
> wasn't convenient to model the transient data and permanent data as separate 
> entities. You might think TTLs couldn't be used for that. However, we created 
> two column families; one with a TTL, one without; a very simple maintenance 
> mapreduce job, run from crontab on the jobtracker, for copying from one to 
> the other, and we were able to use filters to reduce the work this job needed 
> to do; and a very thin presentation layer to give users the illusion that 
> these entities were stored "in the same place" (we needed to give them a REST 
> API anyway). This worked well. There was some modest penalty on read for 
> accessing two stores instead of one, but the performance was within the 
> bounds we needed.
> 
> Best regards,
> 
> 
>    - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
> Tom White)
> 
> 
> ----- Original Message -----
>> From: Michel Segel <[email protected]>
>> To: "[email protected]" <[email protected]>
>> Cc: "[email protected]" <[email protected]>
>> Sent: Thursday, December 22, 2011 5:21 AM
>> Subject: Re: Data management strategy
>> 
>> Richard,
>> 
>> Let's see if I understand what you want to do...
>> 
>> You have some data and you want to store it in some table A.
>> Some of the records/rows in this table have a limited life span of 3 days, 
>> others have a limited life span of 3 months. But both are the same records? 
>> By 
>> this I mean that both records contain the same type of data but there is 
>> some 
>> business logic that determines which record gets deleted.
>> ( like purge all records that haven't been accessed in the last 3 days.)
>> 
>> If what I imagine is true, you can't use the standard TTL unless you know 
>> that after a set N hours or days the record will be deleted. Like all 
>> records 
>> will self destruct 30 days past creation.
>> 
>> The simplest solution would be to have a column that contains a, timestamp 
>> of 
>> last access and your application controls when this field gets updated. Then 
>> using cron, launch a job that scans the table and removes the rows which 
>> meet 
>> your delete criteria.
>> 
>> Since co-processors are new... Not yet in any of the commercial releases, I 
>> would suggest keeping the logic simple. You can always refactor your code to 
>> use 
>> Co-processors when you've had time to play with them.
>> 
>> Even with coprocessors because the data dies an arbitrary death, you will 
>> still 
>> have to purge the data yourself. Hence the cron job that marks the record 
>> for 
>> deletion and then does a major compaction on the table to really delete the 
>> rows...
>> 
>> Of course the standard caveats apply, assuming I really did understand what 
>> you 
>> wanted...
>> 
>> Oh and KISS is always the best practice... :-)
>> 
>> Sent from a remote device. Please excuse any typos...
>> 
>> Mike Segel
>> 
>> On Dec 21, 2011, at 12:03 PM, Richard Lawrence <[email protected]> 
>> wrote:
>> 
>>> Hi
>>> 
>>> I was wondering if I could seek some advance about data management in 
>> HBase?  I plan to use HBase to store data that has a  variable length 
>> lifespan, 
>> the vast majority will be short but occasionally the data life time will be 
>> significantly longer (3 days versus 3 months).  Once the lifespan is over I 
>> need 
>> the data to be deleted at some point in the near future (within a few day is 
>> fine).  I don’t think I can use standard TTL for this because that’s fixed 
>> at a 
>> column family level.  Therefore, my plan was to run script every few days 
>> that 
>> looks through external information for what needs to be kept and then 
>> updates 
>> HBase in some way so that it can understand.  With the data in HBase I can 
>> then 
>> use the standard TTL mechanism to clean up.
>>> 
>>> The two ways I can think of to let HBase know are:
>>> 
>>> Add a co-processor that updates timestamp on each read and then have my 
>> process simply read the data.  I shied away from this because the 
>> documentation 
>> indicated the co-processor can’t take row locks.  Does that imply that it 
>> shouldn’t modify the underlying data.  For my use case the timestamp doesn’t 
>> have to be perfect the keys are created in a such that the underlying data 
>> is 
>> fixed at creation time.
>>> Add an extra column to each row that’s a cache flag and rewrite that at 
>> various intervals so that the timestamp updates and prevents the TTL from 
>> deleting it.
>>> 
>>> Are there other best practice alternatives?
>>> 
>>> Thanks
>>> 
>>> Richard
>>> 
>>

Re: Data management strategy

Reply via email to