Re: HBase and Lucene for realtime search

Bruno Dumon Sat, 12 Feb 2011 03:03:19 -0800

Hi,

AFAIU scaling fulltext search is usually done by processing partitions of
posting lists concurrently. That is essentially what you get with sharded
solr/katta/elasticsearch. I wonder how you would map things to HBase so that
this would be possible. HBase scales on the row key, so if you use the term
as row key you can have an quasi-unlimited amount of terms, but not
unlimited long posting lists (i.e., documents) for those terms. The posting
lists would not be sharded. If you use a 'term+seqnr' approach (manual
sharding), the terms will usually end up in the same region, so reading them
will all touch the same server.


There is something to say for keeping the fulltext index for all rows stored
in one HBase region alongside the region, but when a region splits,
splitting the fulltext index would be expensive.

BTW, here is another attempt to build fulltext search on top of HBase:

http://bizosyshsearch.sourceforge.net/

But from what I understood their approach to scalability is partitioning by
term (instead of by document), and sharding over multiple HBase clusters:

http://sourceforge.net/projects/bizosyshsearch/forums/forum/1295149/topic/4006417


On Sat, Feb 12, 2011 at 4:21 AM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> > No.  And I doubt there ever will be
>
> Hmm... Because of the use of blocks at a low level?  This isn't too
> much different than an OS' filesystem, however I wonder how much
> overhead's in the use of HBase blocks?  If the posting exceeded the
> block size, yeah, that'd be an issue.  Spanning key values pairs for a
> posting, that sounds a little scary.  However it's seems possible to
> provide direct access to the underlying filesystem in a separate API?
> I'm surprised this isn't a more requested feature given HBase is
> 'based' on BigTable which can store large BLOBs?  If the query
> performance degrades at all, then this isn't a viable solution.
> Though the advantages of storing the indexes into HBase, and then
> leveraging the data storage, replication, distribution capabilities
> would seem to make sense.
>
> On Fri, Feb 11, 2011 at 7:00 PM, Ted Dunning <tdunn...@maprtech.com>
> wrote:
> > No.  And I doubt there ever will be.
> >
> > That was one reason to split the larger posting vectors.  That way you
> can
> > multi-thread the fetching and the scoring.
> >
> > On Fri, Feb 11, 2011 at 6:56 PM, Jason Rutherglen <
> > jason.rutherg...@gmail.com> wrote:
> >
> >> Thanks!  In browsing the HBase code, I think it'd be optimal to stream
> >> the posting/binary data directly from the underlying storage (instead
> >> of loading the entire byte[]), it doesn't look like there's a way to
> >> do this (yet)?
> >>
> >> On Fri, Feb 11, 2011 at 6:20 PM, Ted Dunning <tdunn...@maprtech.com>
> >> wrote:
> >> > Go for it!
> >> >
> >> > On Fri, Feb 11, 2011 at 4:44 PM, Jason Rutherglen <
> >> > jason.rutherg...@gmail.com> wrote:
> >> >
> >> >> > Michi's stuff uses flexible indexing with a zero lock architecture.
> >>  The
> >> >> > speed *is* much higher.
> >> >>
> >> >> The speed's higher, and there isn't much Lucene left there either, as
> >> >> I believe it was built specifically for the 140 characters use case
> >> >> (eg, not the general use case).  I don't think most indexes can be
> >> >> compressed to only exist in RAM on a single server?  The Twitter use
> >> >> case isn't one that the HBase RT search solution is useful for?
> >> >>
> >> >> > If you were to store entire posting vectors as values with terms as
> >> keys,
> >> >> > you might be OK.  Very long posting vectors or add-ons could be
> added
> >> >> using
> >> >> > a key+serial number trick.
> >> >>
> >> >> This sounds like the right approach to try.  Also, the Lucene terms
> >> >> dict is sorted anyways, so moving the terms into HBase's sorted keys
> >> >> probably makes sense.
> >> >>
> >> >> > For updates, speed would only be acceptable if you batch up a
> >> >> > lot updates or possibly if you build in a value append function as
> a
> >> >> > co-processor.
> >> >>
> >> >> Hmm... I think the main issue would be the way Lucene implements
> >> >> deletes (eg, today as a BitVector).  I think we'd keep that
> >> >> functionality.  The new docs/updates would be added to the
> >> >> in-RAM-buffer.  I think there'd be a RAM size based flush as there is
> >> >> today.  Where that'd be flushed to is an open question.
> >> >>
> >> >> I think the key advantages to the RT + HBase architecture is the
> index
> >> >> would live alongside HBase columns, and so all other scaling problems
> >> >> (especially those related to scaling RT, such as synchronization of
> >> >> distributed data and updates) goes away.
> >> >>
> >> >> A distributed query would remain the same, eg, it'd hit N servers?
> >> >>
> >> >> In addition, Lucene offers a wide variety of new query types which
> >> >> HBase'd get in realtime for free.
> >> >>
> >> >> On Fri, Feb 11, 2011 at 4:13 PM, Ted Dunning <tdunn...@maprtech.com>
> >> >> wrote:
> >> >> > On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen <
> >> >> > jason.rutherg...@gmail.com> wrote:
> >> >> >
> >> >> >> > I can't imagine that the speed achieved by using Hbase would be
> >> even
> >> >> >> within
> >> >> >> > orders of magnitude of what you can do in Lucene 4 (or even 3).
> >> >> >>
> >> >> >> The indexing speed in Lucene hasn't changed in quite a while, are
> you
> >> >> >> saying HBase would somehow be overloaded?  That doesn't seem to
> jive
> >> >> >> with the sequential writes HBase performs?
> >> >> >>
> >> >> >
> >> >> > Michi's stuff uses flexible indexing with a zero lock architecture.
> >>  The
> >> >> > speed *is* much higher.
> >> >> >
> >> >> > The real problem is that hbase repeats keys.
> >> >> >
> >> >> > If you were to store entire posting vectors as values with terms as
> >> keys,
> >> >> > you might be OK.  Very long posting vectors or add-ons could be
> added
> >> >> using
> >> >> > a key+serial number trick.
> >> >> >
> >> >> > Short queries would involve reading and merging several posting
> >> vectors.
> >> >>  In
> >> >> > that mode, query speeds might be OK, but there isn't a lot of
> Lucene
> >> left
> >> >> at
> >> >> > that point.  For updates, speed would only be acceptable if you
> batch
> >> up
> >> >> a
> >> >> > lot updates or possibly if you build in a value append function as
> a
> >> >> > co-processor.
> >> >> >
> >> >> >
> >> >> >
> >> >> >> The speed of indexing is a function of creating segments, with
> >> >> >> flexible indexing, the underlying segment files (and postings) may
> be
> >> >> >> significantly altered from the default file structures, eg, placed
> >> >> >> into HBase in various ways.  The posting lists could even be split
> >> >> >> along with HBase regions?
> >> >> >>
> >> >> >
> >> >> > Possibly.  But if you use term + counter and post vectors of
> limited
> >> >> length
> >> >> > you might be OK.
> >> >> >
> >> >>
> >> >
> >>
> >
>



-- 
Bruno Dumon
Outerthought
http://outerthought.org/

Re: HBase and Lucene for realtime search

Reply via email to