Silly question(s). 

1) What sort of indexes do you want to build?

2) Why would you want to store your indexes outside of HBase? 

(Ok they are not so silly.  But I don't want people to think that I'm against 
the idea, just that its more of an issue of design.) 

-Mike

On Oct 12, 2012, at 7:03 AM, Eric Czech <eczec...@gmail.com> wrote:

> Hi everyone,
> 
> Are there any tools or libraries for managing HDFS files that are used
> solely for the purpose of creating indexes in HBase?  In other words, is
> there any way to seamlessly integrate new HDFS files into a periodic
> MapReduce process that builds indexes and also reprocess those files if the
> index building logic or underlying HDFS files change?
> 
> I'm looking for something similar to HCatalog but the limitation I find
> with it is that there's no way to rebuild parts of an index with out
> deleting the old index entries or having to guarantee that the new index
> cells will completely overwrite the old ones.
> 
> Here's an example to better explain:
> 
> -  Assume I want to build an index in HBase on HDFS files A, B, and C.
> -  Let's say I build that index with a MapReduce job and then realize that
> one of the auxiliary lookup files used in that job was not completely
> correct.
> -  I'd like to rerun the indexing job at this point but it's entirely
> possible that the new index won't involve all the same cells as the old
> index.
> -  Now, I can't delete all the old index entries before running the new job
> since that index may still be in use so there's no obvious way to update
> the index in isolation
> 
> The prevailing approach to solving this seems to be continually rebuilding
> the indexes in full and having a way to atomically switch the old indexes
> out with the new ones.  A better approach might be to do the same thing
> with a higher granularity and what I'm really asking is whether or not
> there is any tool that does exactly that.
> 
> A naive approach at "versioning" like this with higher granularity might
> simply tie HDFS files to cells in HBase, give that association a version
> number, and allow clients to only read cells from hbase associated with
> active versions (as opposed to versions that are currently being inserted
> into HBase).  Then the "active" version could be incremented at the end of
> a successful MapReduce index build for all files used in that job.
> 
> If there are no existing tools for something like this, then doing what I
> mentioned above is probably the route I'll take and I'm very curious to
> hear if others are facing similar problems and whether or not a tool to
> solve them would be more widely beneficial.
> 
> Thank you!

Reply via email to