Re: HBase and Lucene for realtime search

Jason Rutherglen Sat, 12 Feb 2011 07:13:47 -0800

Brian,

Thanks for the response.

> solr/katta/elasticsearch

These don't have a distributed solution for realtime search [yet].
Eg, a transaction log is required, and a place to store the versioned
documents, sounds a lot like HBase?  The technique of query
sharding/partitioning is fairly trivial, and something that this
solution'd need to leverage as well.

> http://bizosyshsearch.sourceforge.net/

I looked.  I'm a little confused as to why this and things like
Lucandra/Solandra create their own indexes, as this is [probably]
going to yield unpredictable RAM and performance inefficiencies that
Lucene has traversed and solved long ago.  The user will [likely] want
queries that are as fast as possible.  This's why Lucene 4.x's
flexible indexing is interesting to make use of in conjunction with
HBase, eg, there won't be a slow down in queries, unless there's IO
overhead added by the low level usage of HBase to store and iterate
the postings.

I'd imagine the documents pertaining to an index would 'stick' with
that index, meaning they'd stay in the same region.  I'm not sure how
that'd be implemented in HBase.

> HBase scales on the row key, so if you use the term
> as row key you can have an quasi-unlimited amount of terms, but not
> unlimited long posting lists (i.e., documents) for those terms. The posting
> lists would not be sharded. If you use a 'term+seqnr' approach (manual
> sharding), the terms will usually end up in the same region, so reading them
> will all touch the same server.

The posting list'll need to stay in the same region and likely the
[few] posting lists that span rows may not actually impact
performance, eg, they'll probably only need to span once?  That'll
need to be tested.  I'm not sure how we'd efficiently map doc-ids to
keys to the actual document data.

> There is something to say for keeping the fulltext index for all rows stored
> in one HBase region alongside the region, but when a region splits,
> splitting the fulltext index would be expensive.

Right, splitting postings was briefly discussed in Lucene-land, and is
probably implementable in an efficient way.

Jason

On Sat, Feb 12, 2011 at 3:02 AM, Bruno Dumon <br...@outerthought.org> wrote:
> Hi,
>
> AFAIU scaling fulltext search is usually done by processing partitions of
> posting lists concurrently. That is essentially what you get with sharded
> solr/katta/elasticsearch. I wonder how you would map things to HBase so that
> this would be possible. HBase scales on the row key, so if you use the term
> as row key you can have an quasi-unlimited amount of terms, but not
> unlimited long posting lists (i.e., documents) for those terms. The posting
> lists would not be sharded. If you use a 'term+seqnr' approach (manual
> sharding), the terms will usually end up in the same region, so reading them
> will all touch the same server.
>
> There is something to say for keeping the fulltext index for all rows stored
> in one HBase region alongside the region, but when a region splits,
> splitting the fulltext index would be expensive.
>
> BTW, here is another attempt to build fulltext search on top of HBase:
>
> http://bizosyshsearch.sourceforge.net/
>
> But from what I understood their approach to scalability is partitioning by
> term (instead of by document), and sharding over multiple HBase clusters:
>
> http://sourceforge.net/projects/bizosyshsearch/forums/forum/1295149/topic/4006417
>
>
> On Sat, Feb 12, 2011 at 4:21 AM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>

Re: HBase and Lucene for realtime search

Reply via email to