Jason Rutherglen: > Hello, > > I'm curious as to what a 'good' approach would be for implementing > search in HBase (using Lucene) with the end goal being the integration > of realtime search into HBase. I think the use case makes sense as > HBase is realtime and has a write-ahead log, performs automatic > partitioning, splitting of data, failover, redundancy, etc. These are > all things Lucene does not have out of the box, that we'd essentially > get for 'free'. > > For starters: Where would be the right place to store Lucene segments > or postings? Eg, we need to be able to efficiently perform a linear > iteration of the per-term posting list(s). > > Thanks! > > Jason Rutherglen Hi Jason,
I had the same idea around last year but didn't continue it since I'm leaving the company right now. Do you want to do Term- or Document partitioning? Both have advantages and disadvantages. You can get a very good introduction in chapter 14.1 of this book: http://www.ir.uwaterloo.ca/book The following lecture gives a very interesting insight on Google's index architecture: http://videolectures.net/wsdm09_dean_cblirs Projects that do Document partitioning: distributed solr, katta, elasticsearch, linkedin's Sensei Projects that do Term partitioning: lucandra/solandra (using cassandra), hbasene (which is abandoned since a year) I very much thought that hbasene would be a perfect solution for scalable search, but the above book and video convinced me that improving katta would be the way to go: - implement an indexing solution for katta - serve the index shards from memory, as google apparently does Hope I could help, please keep us posted, Thomas Koch, http://www.koch.ro