Brian, Thanks for the response.
> solr/katta/elasticsearch These don't have a distributed solution for realtime search [yet]. Eg, a transaction log is required, and a place to store the versioned documents, sounds a lot like HBase? The technique of query sharding/partitioning is fairly trivial, and something that this solution'd need to leverage as well. > http://bizosyshsearch.sourceforge.net/ I looked. I'm a little confused as to why this and things like Lucandra/Solandra create their own indexes, as this is [probably] going to yield unpredictable RAM and performance inefficiencies that Lucene has traversed and solved long ago. The user will [likely] want queries that are as fast as possible. This's why Lucene 4.x's flexible indexing is interesting to make use of in conjunction with HBase, eg, there won't be a slow down in queries, unless there's IO overhead added by the low level usage of HBase to store and iterate the postings. I'd imagine the documents pertaining to an index would 'stick' with that index, meaning they'd stay in the same region. I'm not sure how that'd be implemented in HBase. > HBase scales on the row key, so if you use the term > as row key you can have an quasi-unlimited amount of terms, but not > unlimited long posting lists (i.e., documents) for those terms. The posting > lists would not be sharded. If you use a 'term+seqnr' approach (manual > sharding), the terms will usually end up in the same region, so reading them > will all touch the same server. The posting list'll need to stay in the same region and likely the [few] posting lists that span rows may not actually impact performance, eg, they'll probably only need to span once? That'll need to be tested. I'm not sure how we'd efficiently map doc-ids to keys to the actual document data. > There is something to say for keeping the fulltext index for all rows stored > in one HBase region alongside the region, but when a region splits, > splitting the fulltext index would be expensive. Right, splitting postings was briefly discussed in Lucene-land, and is probably implementable in an efficient way. Jason On Sat, Feb 12, 2011 at 3:02 AM, Bruno Dumon <br...@outerthought.org> wrote: > Hi, > > AFAIU scaling fulltext search is usually done by processing partitions of > posting lists concurrently. That is essentially what you get with sharded > solr/katta/elasticsearch. I wonder how you would map things to HBase so that > this would be possible. HBase scales on the row key, so if you use the term > as row key you can have an quasi-unlimited amount of terms, but not > unlimited long posting lists (i.e., documents) for those terms. The posting > lists would not be sharded. If you use a 'term+seqnr' approach (manual > sharding), the terms will usually end up in the same region, so reading them > will all touch the same server. > > There is something to say for keeping the fulltext index for all rows stored > in one HBase region alongside the region, but when a region splits, > splitting the fulltext index would be expensive. > > BTW, here is another attempt to build fulltext search on top of HBase: > > http://bizosyshsearch.sourceforge.net/ > > But from what I understood their approach to scalability is partitioning by > term (instead of by document), and sharding over multiple HBase clusters: > > http://sourceforge.net/projects/bizosyshsearch/forums/forum/1295149/topic/4006417 > > > On Sat, Feb 12, 2011 at 4:21 AM, Jason Rutherglen < > jason.rutherg...@gmail.com> wrote: >