Hi, I'm working on a project where we have a strange use case.
First off, we use bulk loading exclusively. We never use the put or bulk put interface to load data into tables. We have drivers that make me want to segregate data by tables and column families. Our data is clearly delineated by the job it came from. We would like to quickly either delete, or export data from a given data set quickly. To enable this I have been considering using column families to make it quick for us and easy on hbase to delete data that is no longer needed. It is my understanding that multiple column families bite you in the back side via the put interface and memstore. That having multiple column families with different distributions among the partitions can cause lumpiness in your partitions. I have convinced myself that because our key space is so incredibly consistent that we don't have the lumpiness issue. And so, I ask this, given that we don't use the memstore, are there any other drawbacks to using tables and column families to segregate data for easy/quick backup and deletion? If you are wondering about our backup strategy it involves using snapshots and clones. Once a table is cloned we can delete the column families from the table we don't want to export to tape. And delete becomes quick because the bulk of the work involves deleting the files from the column family from HDFS. All feedback is greatly appreciated! Thanks Dave
