Re: HBase and Lucene for realtime search

Ted Dunning Fri, 11 Feb 2011 19:01:32 -0800

No.  And I doubt there ever will be.

That was one reason to split the larger posting vectors.  That way you can
multi-thread the fetching and the scoring.


On Fri, Feb 11, 2011 at 6:56 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:

> Thanks!  In browsing the HBase code, I think it'd be optimal to stream
> the posting/binary data directly from the underlying storage (instead
> of loading the entire byte[]), it doesn't look like there's a way to
> do this (yet)?
>
> On Fri, Feb 11, 2011 at 6:20 PM, Ted Dunning <tdunn...@maprtech.com>
> wrote:
> > Go for it!
> >
> > On Fri, Feb 11, 2011 at 4:44 PM, Jason Rutherglen <
> > jason.rutherg...@gmail.com> wrote:
> >
> >> > Michi's stuff uses flexible indexing with a zero lock architecture.
>  The
> >> > speed *is* much higher.
> >>
> >> The speed's higher, and there isn't much Lucene left there either, as
> >> I believe it was built specifically for the 140 characters use case
> >> (eg, not the general use case).  I don't think most indexes can be
> >> compressed to only exist in RAM on a single server?  The Twitter use
> >> case isn't one that the HBase RT search solution is useful for?
> >>
> >> > If you were to store entire posting vectors as values with terms as
> keys,
> >> > you might be OK.  Very long posting vectors or add-ons could be added
> >> using
> >> > a key+serial number trick.
> >>
> >> This sounds like the right approach to try.  Also, the Lucene terms
> >> dict is sorted anyways, so moving the terms into HBase's sorted keys
> >> probably makes sense.
> >>
> >> > For updates, speed would only be acceptable if you batch up a
> >> > lot updates or possibly if you build in a value append function as a
> >> > co-processor.
> >>
> >> Hmm... I think the main issue would be the way Lucene implements
> >> deletes (eg, today as a BitVector).  I think we'd keep that
> >> functionality.  The new docs/updates would be added to the
> >> in-RAM-buffer.  I think there'd be a RAM size based flush as there is
> >> today.  Where that'd be flushed to is an open question.
> >>
> >> I think the key advantages to the RT + HBase architecture is the index
> >> would live alongside HBase columns, and so all other scaling problems
> >> (especially those related to scaling RT, such as synchronization of
> >> distributed data and updates) goes away.
> >>
> >> A distributed query would remain the same, eg, it'd hit N servers?
> >>
> >> In addition, Lucene offers a wide variety of new query types which
> >> HBase'd get in realtime for free.
> >>
> >> On Fri, Feb 11, 2011 at 4:13 PM, Ted Dunning <tdunn...@maprtech.com>
> >> wrote:
> >> > On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen <
> >> > jason.rutherg...@gmail.com> wrote:
> >> >
> >> >> > I can't imagine that the speed achieved by using Hbase would be
> even
> >> >> within
> >> >> > orders of magnitude of what you can do in Lucene 4 (or even 3).
> >> >>
> >> >> The indexing speed in Lucene hasn't changed in quite a while, are you
> >> >> saying HBase would somehow be overloaded?  That doesn't seem to jive
> >> >> with the sequential writes HBase performs?
> >> >>
> >> >
> >> > Michi's stuff uses flexible indexing with a zero lock architecture.
>  The
> >> > speed *is* much higher.
> >> >
> >> > The real problem is that hbase repeats keys.
> >> >
> >> > If you were to store entire posting vectors as values with terms as
> keys,
> >> > you might be OK.  Very long posting vectors or add-ons could be added
> >> using
> >> > a key+serial number trick.
> >> >
> >> > Short queries would involve reading and merging several posting
> vectors.
> >>  In
> >> > that mode, query speeds might be OK, but there isn't a lot of Lucene
> left
> >> at
> >> > that point.  For updates, speed would only be acceptable if you batch
> up
> >> a
> >> > lot updates or possibly if you build in a value append function as a
> >> > co-processor.
> >> >
> >> >
> >> >
> >> >> The speed of indexing is a function of creating segments, with
> >> >> flexible indexing, the underlying segment files (and postings) may be
> >> >> significantly altered from the default file structures, eg, placed
> >> >> into HBase in various ways.  The posting lists could even be split
> >> >> along with HBase regions?
> >> >>
> >> >
> >> > Possibly.  But if you use term + counter and post vectors of limited
> >> length
> >> > you might be OK.
> >> >
> >>
> >
>

Re: HBase and Lucene for realtime search

Reply via email to