Jason Rutherglen:
> Hello,
> 
> I'm curious as to what a 'good' approach would be for implementing
> search in HBase (using Lucene) with the end goal being the integration
> of realtime search into HBase.  I think the use case makes sense as
> HBase is realtime and has a write-ahead log, performs automatic
> partitioning, splitting of data, failover, redundancy, etc.  These are
> all things Lucene does not have out of the box, that we'd essentially
> get for 'free'.
> 
> For starters: Where would be the right place to store Lucene segments
> or postings?  Eg, we need to be able to efficiently perform a linear
> iteration of the per-term posting list(s).
> 
> Thanks!
> 
> Jason Rutherglen
Hi Jason,

I had the same idea around last year but didn't continue it since I'm leaving 
the company right now.
Do you want to do Term- or Document partitioning? Both have advantages and 
disadvantages. You can get a very good introduction in chapter 14.1 of this 
book:
http://www.ir.uwaterloo.ca/book

The following lecture gives a very interesting insight on Google's index 
architecture:
http://videolectures.net/wsdm09_dean_cblirs

Projects that do Document partitioning:
distributed solr, katta, elasticsearch, linkedin's Sensei
Projects that do Term partitioning:
lucandra/solandra (using cassandra), hbasene (which is abandoned since a year)

I very much thought that hbasene would be a perfect solution for scalable 
search, but the above book and video convinced me that improving katta would 
be the way to go:
- implement an indexing solution for katta
- serve the index shards from memory, as google apparently does

Hope I could help, please keep us posted,

Thomas Koch, http://www.koch.ro

Reply via email to