How to delete rows in a FIFO manner
Hi, Continuing with testing HBase suitability in a high ingest rate environment, I've come up with a new stumbling block, likely due to my inexperience with HBase. We want to keep and purge records on a time basis: i.e, when a record is older than say, 24 hours, we want to purge it from the database. The problem that I am encountering is the only way I've found to delete records using an arbitrary but strongly ordered over time row id is to scan for rows from lower bound to upper bound, then build an array of Delete using for Result in ResultScanner add new Delete( Result.getRow( ) ) to Delete array. This method is far too slow to keep up with our ingest rate; the iteration over the Results in the ResultScanner is the bottleneck, even though the Scan is limited to a single small column in the column family. The obvious but naive solution is to use a sequential row id where the lower and upper bound can be known. This would allow the building of the array of Delete objects without a scan step. Problem with this approach is how do you guarantee a sequential and non-colliding row id across more than one Put'ing process, and do it efficiently. As it happens, I can do this, but given the details of my operational requirements, it's not a simple thing to do. So I was hoping that I had just missed something. The ideal would be a Delete object that would take row id bounds in the same way that Scan does, allowing the work to be done all on the server side. Does this exists somewhere? Or is there some other way to skin this cat? Thanks Thomas Downing
Re: How to delete rows in a FIFO manner
If the inserts are coming from more than 1 client, and your are trying to delete from only 1 client, then likely it won't work. You could try using a pool of deleters (multiple threads that delete rows) that you feed from the scanner. Or you could run a MapReduce that would parallelize that for you, that takes your table as an input and that outputs Delete objects. J-D On Fri, Aug 6, 2010 at 5:50 AM, Thomas Downing tdown...@proteus-technologies.com wrote: Hi, Continuing with testing HBase suitability in a high ingest rate environment, I've come up with a new stumbling block, likely due to my inexperience with HBase. We want to keep and purge records on a time basis: i.e, when a record is older than say, 24 hours, we want to purge it from the database. The problem that I am encountering is the only way I've found to delete records using an arbitrary but strongly ordered over time row id is to scan for rows from lower bound to upper bound, then build an array of Delete using for Result in ResultScanner add new Delete( Result.getRow( ) ) to Delete array. This method is far too slow to keep up with our ingest rate; the iteration over the Results in the ResultScanner is the bottleneck, even though the Scan is limited to a single small column in the column family. The obvious but naive solution is to use a sequential row id where the lower and upper bound can be known. This would allow the building of the array of Delete objects without a scan step. Problem with this approach is how do you guarantee a sequential and non-colliding row id across more than one Put'ing process, and do it efficiently. As it happens, I can do this, but given the details of my operational requirements, it's not a simple thing to do. So I was hoping that I had just missed something. The ideal would be a Delete object that would take row id bounds in the same way that Scan does, allowing the work to be done all on the server side. Does this exists somewhere? Or is there some other way to skin this cat? Thanks Thomas Downing
Re: How to delete rows in a FIFO manner
I wrestled with that idea of time bounded tables..Would it make it harder to write code/run map reduce on multiple tables ? Also, how do u decide to when to do the cut over (start of a new day, week/month..) if u do how to process data that cross those time boundaries efficiently.. Guess that is not your requirement.. If it is fixed time cut over, is n't enough to set the TTL timestamp ? Interesting thread..thanks -Original Message- From: Thomas Downing tdown...@proteus-technologies.com To: user@hbase.apache.org user@hbase.apache.org Sent: Fri, Aug 6, 2010 11:39 am Subject: Re: How to delete rows in a FIFO manner Thanks for the suggestions. The problem isn't generating the Delete objects, or the delete operation itself - both are fast enough. The problem is generating the list of row keys from which the Delete objects are created. For now, the obvious work-around is to create and drop tables on the fly, using HBaseAdmin, with the tables being time-bounded. When the high end of a table passes the expiry time, just drop the table. When a table is written with the first record greater than the low bound, create a new table for the next time interval. As I am having other problems related to high ingest rates, the fact may be that I am just using the wrong tool for the job. Thanks td On 8/6/2010 10:24 AM, Jean-Daniel Cryans wrote: If the inserts are coming from more than 1 client, and your are trying to delete from only 1 client, then likely it won't work. You could try using a pool of deleters (multiple threads that delete rows) that you feed from the scanner. Or you could run a MapReduce that would parallelize that for you, that takes your table as an input and that outputs Delete objects. J-D On Fri, Aug 6, 2010 at 5:50 AM, Thomas Downing tdown...@proteus-technologies.com wrote: Hi, Continuing with testing HBase suitability in a high ingest rate environment, I've come up with a new stumbling block, likely due to my inexperience with HBase. We want to keep and purge records on a time basis: i.e, when a record is older than say, 24 hours, we want to purge it from the database. The problem that I am encountering is the only way I've found to delete records using an arbitrary but strongly ordered over time row id is to scan for rows from lower bound to upper bound, then build an array of Delete using for Result in ResultScanner add new Delete( Result.getRow( ) ) to Delete array. This method is far too slow to keep up with our ingest rate; the iteration over the Results in the ResultScanner is the bottleneck, even though the Scan is limited to a single small column in the column family. The obvious but naive solution is to use a sequential row id where the lower and upper bound can be known. This would allow the building of the array of Delete objects without a scan step. Problem with this approach is how do you guarantee a sequential and non-colliding row id across more than one Put'ing process, and do it efficiently. As it happens, I can do this, but given the details of my operational requirements, it's not a simple thing to do. So I was hoping that I had just missed something. The ideal would be a Delete object that would take row id bounds in the same way that Scan does, allowing the work to be done all on the server side. Does this exists somewhere? Or is there some other way to skin this cat? Thanks Thomas Downing -- Follow this link to mark it as spam: http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=6574C2821B.A5164
Re: How to delete rows in a FIFO manner
Our problem does not require significant map/reduce ops, and queries tend to be for sequential rows with the timeframe being the primary consideration. So time-bounded tables are not a big hurdle, as they might be were other columns primary keys or considerations for query or map/reduce ops. TTL timestamp - that may be just the magic I was looking for... thanks, I'll look at that. td On 8/6/2010 11:59 AM, Venkatesh wrote: I wrestled with that idea of time bounded tables..Would it make it harder to write code/run map reduce on multiple tables ? Also, how do u decide to when to do the cut over (start of a new day, week/month..) if u do how to process data that cross those time boundaries efficiently.. Guess that is not your requirement.. If it is fixed time cut over, is n't enough to set the TTL timestamp ? Interesting thread..thanks -Original Message- From: Thomas Downingtdown...@proteus-technologies.com To: user@hbase.apache.orguser@hbase.apache.org Sent: Fri, Aug 6, 2010 11:39 am Subject: Re: How to delete rows in a FIFO manner Thanks for the suggestions. The problem isn't generating the Delete objects, or the delete operation itself - both are fast enough. The problem is generating the list of row keys from which the Delete objects are created. For now, the obvious work-around is to create and drop tables on the fly, using HBaseAdmin, with the tables being time-bounded. When the high end of a table passes the expiry time, just drop the table. When a table is written with the first record greater than the low bound, create a new table for the next time interval. As I am having other problems related to high ingest rates, the fact may be that I am just using the wrong tool for the job. Thanks td On 8/6/2010 10:24 AM, Jean-Daniel Cryans wrote: If the inserts are coming from more than 1 client, and your are trying to delete from only 1 client, then likely it won't work. You could try using a pool of deleters (multiple threads that delete rows) that you feed from the scanner. Or you could run a MapReduce that would parallelize that for you, that takes your table as an input and that outputs Delete objects. J-D On Fri, Aug 6, 2010 at 5:50 AM, Thomas Downing tdown...@proteus-technologies.com wrote: Hi, Continuing with testing HBase suitability in a high ingest rate environment, I've come up with a new stumbling block, likely due to my inexperience with HBase. We want to keep and purge records on a time basis: i.e, when a record is older than say, 24 hours, we want to purge it from the database. The problem that I am encountering is the only way I've found to delete records using an arbitrary but strongly ordered over time row id is to scan for rows from lower bound to upper bound, then build an array of Delete using for Result in ResultScanner add new Delete( Result.getRow( ) ) to Delete array. This method is far too slow to keep up with our ingest rate; the iteration over the Results in the ResultScanner is the bottleneck, even though the Scan is limited to a single small column in the column family. The obvious but naive solution is to use a sequential row id where the lower and upper bound can be known. This would allow the building of the array of Delete objects without a scan step. Problem with this approach is how do you guarantee a sequential and non-colliding row id across more than one Put'ing process, and do it efficiently. As it happens, I can do this, but given the details of my operational requirements, it's not a simple thing to do. So I was hoping that I had just missed something. The ideal would be a Delete object that would take row id bounds in the same way that Scan does, allowing the work to be done all on the server side. Does this exists somewhere? Or is there some other way to skin this cat? Thanks Thomas Downing -- Follow this link to mark it as spam: http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=6574C2821B.A5164 -- Follow this link to mark it as spam: http://mailfilter.proteus-technologies.com/cgi-bin/learn-msg.cgi?id=7E6BE2821B.A4479