Hi everyone, Are there any tools or libraries for managing HDFS files that are used solely for the purpose of creating indexes in HBase? In other words, is there any way to seamlessly integrate new HDFS files into a periodic MapReduce process that builds indexes and also reprocess those files if the index building logic or underlying HDFS files change?
I'm looking for something similar to HCatalog but the limitation I find with it is that there's no way to rebuild parts of an index with out deleting the old index entries or having to guarantee that the new index cells will completely overwrite the old ones. Here's an example to better explain: - Assume I want to build an index in HBase on HDFS files A, B, and C. - Let's say I build that index with a MapReduce job and then realize that one of the auxiliary lookup files used in that job was not completely correct. - I'd like to rerun the indexing job at this point but it's entirely possible that the new index won't involve all the same cells as the old index. - Now, I can't delete all the old index entries before running the new job since that index may still be in use so there's no obvious way to update the index in isolation The prevailing approach to solving this seems to be continually rebuilding the indexes in full and having a way to atomically switch the old indexes out with the new ones. A better approach might be to do the same thing with a higher granularity and what I'm really asking is whether or not there is any tool that does exactly that. A naive approach at "versioning" like this with higher granularity might simply tie HDFS files to cells in HBase, give that association a version number, and allow clients to only read cells from hbase associated with active versions (as opposed to versions that are currently being inserted into HBase). Then the "active" version could be incremented at the end of a successful MapReduce index build for all files used in that job. If there are no existing tools for something like this, then doing what I mentioned above is probably the route I'll take and I'm very curious to hear if others are facing similar problems and whether or not a tool to solve them would be more widely beneficial. Thank you for your time and I apologize if this might be a better question for the hbase users list. - Eric
