Hi,

I'm working on a project where we have a strange use case.

First off, we use bulk loading exclusively.  We never use the put or bulk put 
interface to load data into tables.

We have drivers that make me want to segregate data by tables and column 
families.  Our data is clearly delineated by the job it came from.  We would 
like to quickly either delete, or export data from a given data set quickly.  
To enable this I have been considering using column families to make it quick 
for us and easy on hbase to delete data that is no longer needed.

It is my understanding that multiple column families bite you in the back side 
via the put interface and memstore.  That having multiple column families with 
different distributions among the partitions can cause lumpiness in your 
partitions.  I have convinced myself that because our key space is so 
incredibly consistent that we don't have the lumpiness issue.

And so, I ask this, given that we don't use the memstore, are there any other 
drawbacks to using tables and column families to segregate data for easy/quick 
backup and deletion?  If you are wondering about our backup strategy it 
involves using snapshots and clones.  Once a table is cloned we can delete the 
column families from the table we don't want to export to tape.  And delete 
becomes quick because the bulk of the work involves deleting the files from the 
column family from HDFS.

All feedback is greatly appreciated!

Thanks

Dave



Reply via email to