How to delete rows in a FIFO manner

2010-08-06 Thread Thomas Downing

Hi,

Continuing with testing HBase suitability in a high ingest rate
environment, I've come up with a new stumbling block, likely
due to my inexperience with HBase.

We want to keep and purge records on a time basis: i.e, when
a record is older than say, 24 hours, we want to purge it from
the database.

The problem that I am encountering is the only way I've found
to delete records using an arbitrary but strongly ordered over
time row id is to scan for rows from lower bound to upper
bound, then build an array of Delete using

for Result in ResultScanner
add new Delete( Result.getRow( ) ) to Delete array.

This method is far too slow to keep up with our ingest rate; the
iteration over the Results in the ResultScanner is the bottleneck,
even though the Scan is limited to a single small column in the
column family.

The obvious but naive solution is to use a sequential row id
where the lower and upper bound can be known.  This would
allow the building of the array of Delete objects without a scan
step.  Problem with this approach is how do you guarantee a
sequential and non-colliding row id across more than one Put'ing
process, and do it efficiently.  As it happens, I can do this, but
given the details of my operational requirements, it's not a simple
thing to do.

So I was hoping that I had just missed something.  The ideal
would be a Delete object that would take row id bounds in the
same way that Scan does, allowing the work to be done all
on the server side.  Does this exists somewhere?  Or is there
some other way to skin this cat?

Thanks

Thomas Downing


Re: How to delete rows in a FIFO manner

2010-08-06 Thread Jean-Daniel Cryans
If the inserts are coming from more than 1 client, and your are trying
to delete from only 1 client, then likely it won't work. You could try
using a pool of deleters (multiple threads that delete rows) that you
feed from the scanner. Or you could run a MapReduce that would
parallelize that for you, that takes your table as an input and that
outputs Delete objects.

J-D

On Fri, Aug 6, 2010 at 5:50 AM, Thomas Downing
tdown...@proteus-technologies.com wrote:
 Hi,

 Continuing with testing HBase suitability in a high ingest rate
 environment, I've come up with a new stumbling block, likely
 due to my inexperience with HBase.

 We want to keep and purge records on a time basis: i.e, when
 a record is older than say, 24 hours, we want to purge it from
 the database.

 The problem that I am encountering is the only way I've found
 to delete records using an arbitrary but strongly ordered over
 time row id is to scan for rows from lower bound to upper
 bound, then build an array of Delete using

 for Result in ResultScanner
    add new Delete( Result.getRow( ) ) to Delete array.

 This method is far too slow to keep up with our ingest rate; the
 iteration over the Results in the ResultScanner is the bottleneck,
 even though the Scan is limited to a single small column in the
 column family.

 The obvious but naive solution is to use a sequential row id
 where the lower and upper bound can be known.  This would
 allow the building of the array of Delete objects without a scan
 step.  Problem with this approach is how do you guarantee a
 sequential and non-colliding row id across more than one Put'ing
 process, and do it efficiently.  As it happens, I can do this, but
 given the details of my operational requirements, it's not a simple
 thing to do.

 So I was hoping that I had just missed something.  The ideal
 would be a Delete object that would take row id bounds in the
 same way that Scan does, allowing the work to be done all
 on the server side.  Does this exists somewhere?  Or is there
 some other way to skin this cat?

 Thanks

 Thomas Downing



Re: How to delete rows in a FIFO manner

2010-08-06 Thread Venkatesh

 I wrestled with that idea of time bounded tables..Would it make it harder to 
write code/run map reduce
on multiple tables ? Also, how do u decide to when to do the cut over (start of 
a new day, week/month..)
 if u do how to process data that cross those time boundaries efficiently..
Guess that is not your requirement..

If it is fixed time cut over, is n't enough to set the TTL timestamp ? 


 Interesting thread..thanks


 

 

-Original Message-
From: Thomas Downing tdown...@proteus-technologies.com
To: user@hbase.apache.org user@hbase.apache.org
Sent: Fri, Aug 6, 2010 11:39 am
Subject: Re: How to delete rows in a FIFO manner


Thanks for the suggestions.  The problem isn't generating the 
Delete objects, or the delete operation itself - both are fast 
enough.  The problem is generating the list of row keys from 
which the Delete objects are created. 
 
For now, the obvious work-around is to create and drop 
tables on the fly, using HBaseAdmin, with the tables being 
time-bounded. When the high end of a table passes the expiry 
time, just drop the table. When a table is written with the first 
record greater than the low bound, create a new table for the 
next time interval. 
 
As I am having other problems related to high ingest rates, 
the fact may be that I am just using the wrong tool for the job. 
 
Thanks 
 
td 
 
On 8/6/2010 10:24 AM, Jean-Daniel Cryans wrote: 
 If the inserts are coming from more than 1 client, and your are trying 
 to delete from only 1 client, then likely it won't work. You could try 
 using a pool of deleters (multiple threads that delete rows) that you 
 feed from the scanner. Or you could run a MapReduce that would 
 parallelize that for you, that takes your table as an input and that 
 outputs Delete objects. 
 
 J-D 
 
 On Fri, Aug 6, 2010 at 5:50 AM, Thomas Downing 
 tdown...@proteus-technologies.com  wrote: 
 Hi, 
 
 Continuing with testing HBase suitability in a high ingest rate 
 environment, I've come up with a new stumbling block, likely 
 due to my inexperience with HBase. 
 
 We want to keep and purge records on a time basis: i.e, when 
 a record is older than say, 24 hours, we want to purge it from 
 the database. 
 
 The problem that I am encountering is the only way I've found 
 to delete records using an arbitrary but strongly ordered over 
 time row id is to scan for rows from lower bound to upper 
 bound, then build an array of Delete using 
 
 for Result in ResultScanner 
 add new Delete( Result.getRow( ) ) to Delete array. 
 
 This method is far too slow to keep up with our ingest rate; the 
 iteration over the Results in the ResultScanner is the bottleneck, 
 even though the Scan is limited to a single small column in the 
 column family. 
 
 The obvious but naive solution is to use a sequential row id 
 where the lower and upper bound can be known.  This would 
 allow the building of the array of Delete objects without a scan 
 step.  Problem with this approach is how do you guarantee a 
 sequential and non-colliding row id across more than one Put'ing 
 process, and do it efficiently.  As it happens, I can do this, but 
 given the details of my operational requirements, it's not a simple 
 thing to do. 
 
 So I was hoping that I had just missed something.  The ideal 
 would be a Delete object that would take row id bounds in the 
 same way that Scan does, allowing the work to be done all 
 on the server side.  Does this exists somewhere?  Or is there 
 some other way to skin this cat? 
 
 Thanks 
 
 Thomas Downing 
 
   -- 
 Follow this link to mark it as spam: 
 http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=6574C2821B.A5164
  
 
 
 

 


Re: How to delete rows in a FIFO manner

2010-08-06 Thread Thomas Downing

Our problem does not require significant map/reduce ops, and
queries tend to be for sequential rows with the timeframe being
the primary consideration.  So time-bounded tables are not a
big hurdle, as they might be were other columns primary keys
or considerations for query or map/reduce ops.

TTL timestamp - that may be just the magic I was looking for...
thanks, I'll look at that.

td

On 8/6/2010 11:59 AM, Venkatesh wrote:

  I wrestled with that idea of time bounded tables..Would it make it harder to 
write code/run map reduce
on multiple tables ? Also, how do u decide to when to do the cut over (start of 
a new day, week/month..)
  if u do how to process data that cross those time boundaries efficiently..
Guess that is not your requirement..

If it is fixed time cut over, is n't enough to set the TTL timestamp ?


  Interesting thread..thanks






-Original Message-
From: Thomas Downingtdown...@proteus-technologies.com
To: user@hbase.apache.orguser@hbase.apache.org
Sent: Fri, Aug 6, 2010 11:39 am
Subject: Re: How to delete rows in a FIFO manner


Thanks for the suggestions.  The problem isn't generating the
Delete objects, or the delete operation itself - both are fast
enough.  The problem is generating the list of row keys from
which the Delete objects are created.

For now, the obvious work-around is to create and drop
tables on the fly, using HBaseAdmin, with the tables being
time-bounded. When the high end of a table passes the expiry
time, just drop the table. When a table is written with the first
record greater than the low bound, create a new table for the
next time interval.

As I am having other problems related to high ingest rates,
the fact may be that I am just using the wrong tool for the job.

Thanks

td

On 8/6/2010 10:24 AM, Jean-Daniel Cryans wrote:
   

If the inserts are coming from more than 1 client, and your are trying
to delete from only 1 client, then likely it won't work. You could try
using a pool of deleters (multiple threads that delete rows) that you
feed from the scanner. Or you could run a MapReduce that would
parallelize that for you, that takes your table as an input and that
outputs Delete objects.

J-D

On Fri, Aug 6, 2010 at 5:50 AM, Thomas Downing
tdown...@proteus-technologies.com   wrote:
  Hi,
 

Continuing with testing HBase suitability in a high ingest rate
environment, I've come up with a new stumbling block, likely
due to my inexperience with HBase.

We want to keep and purge records on a time basis: i.e, when
a record is older than say, 24 hours, we want to purge it from
the database.

The problem that I am encountering is the only way I've found
to delete records using an arbitrary but strongly ordered over
time row id is to scan for rows from lower bound to upper
bound, then build an array of Delete using

for Result in ResultScanner
 add new Delete( Result.getRow( ) ) to Delete array.

This method is far too slow to keep up with our ingest rate; the
iteration over the Results in the ResultScanner is the bottleneck,
even though the Scan is limited to a single small column in the
column family.

The obvious but naive solution is to use a sequential row id
where the lower and upper bound can be known.  This would
allow the building of the array of Delete objects without a scan
step.  Problem with this approach is how do you guarantee a
sequential and non-colliding row id across more than one Put'ing
process, and do it efficiently.  As it happens, I can do this, but
given the details of my operational requirements, it's not a simple
thing to do.

So I was hoping that I had just missed something.  The ideal
would be a Delete object that would take row id bounds in the
same way that Scan does, allowing the work to be done all
on the server side.  Does this exists somewhere?  Or is there
some other way to skin this cat?

Thanks

Thomas Downing

-- 
   

Follow this link to mark it as spam:
http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=6574C2821B.A5164



 



--
Follow this link to mark it as spam:
http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=7E6BE2821B.A4479