Re: HBase and Lucene for realtime search

Jason Rutherglen Fri, 11 Feb 2011 19:21:57 -0800

> No.  And I doubt there ever will be

Hmm... Because of the use of blocks at a low level?  This isn't too
much different than an OS' filesystem, however I wonder how much
overhead's in the use of HBase blocks?  If the posting exceeded the
block size, yeah, that'd be an issue.  Spanning key values pairs for a
posting, that sounds a little scary.  However it's seems possible to
provide direct access to the underlying filesystem in a separate API?
I'm surprised this isn't a more requested feature given HBase is
'based' on BigTable which can store large BLOBs?  If the query
performance degrades at all, then this isn't a viable solution.
Though the advantages of storing the indexes into HBase, and then
leveraging the data storage, replication, distribution capabilities
would seem to make sense.


On Fri, Feb 11, 2011 at 7:00 PM, Ted Dunning <tdunn...@maprtech.com> wrote:
> No.  And I doubt there ever will be.
>
> That was one reason to split the larger posting vectors.  That way you can
> multi-thread the fetching and the scoring.
>
> On Fri, Feb 11, 2011 at 6:56 PM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> Thanks!  In browsing the HBase code, I think it'd be optimal to stream
>> the posting/binary data directly from the underlying storage (instead
>> of loading the entire byte[]), it doesn't look like there's a way to
>> do this (yet)?
>>
>> On Fri, Feb 11, 2011 at 6:20 PM, Ted Dunning <tdunn...@maprtech.com>
>> wrote:
>> > Go for it!
>> >
>> > On Fri, Feb 11, 2011 at 4:44 PM, Jason Rutherglen <
>> > jason.rutherg...@gmail.com> wrote:
>> >
>> >> > Michi's stuff uses flexible indexing with a zero lock architecture.
>>  The
>> >> > speed *is* much higher.
>> >>
>> >> The speed's higher, and there isn't much Lucene left there either, as
>> >> I believe it was built specifically for the 140 characters use case
>> >> (eg, not the general use case).  I don't think most indexes can be
>> >> compressed to only exist in RAM on a single server?  The Twitter use
>> >> case isn't one that the HBase RT search solution is useful for?
>> >>
>> >> > If you were to store entire posting vectors as values with terms as
>> keys,
>> >> > you might be OK.  Very long posting vectors or add-ons could be added
>> >> using
>> >> > a key+serial number trick.
>> >>
>> >> This sounds like the right approach to try.  Also, the Lucene terms
>> >> dict is sorted anyways, so moving the terms into HBase's sorted keys
>> >> probably makes sense.
>> >>
>> >> > For updates, speed would only be acceptable if you batch up a
>> >> > lot updates or possibly if you build in a value append function as a
>> >> > co-processor.
>> >>
>> >> Hmm... I think the main issue would be the way Lucene implements
>> >> deletes (eg, today as a BitVector).  I think we'd keep that
>> >> functionality.  The new docs/updates would be added to the
>> >> in-RAM-buffer.  I think there'd be a RAM size based flush as there is
>> >> today.  Where that'd be flushed to is an open question.
>> >>
>> >> I think the key advantages to the RT + HBase architecture is the index
>> >> would live alongside HBase columns, and so all other scaling problems
>> >> (especially those related to scaling RT, such as synchronization of
>> >> distributed data and updates) goes away.
>> >>
>> >> A distributed query would remain the same, eg, it'd hit N servers?
>> >>
>> >> In addition, Lucene offers a wide variety of new query types which
>> >> HBase'd get in realtime for free.
>> >>
>> >> On Fri, Feb 11, 2011 at 4:13 PM, Ted Dunning <tdunn...@maprtech.com>
>> >> wrote:
>> >> > On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen <
>> >> > jason.rutherg...@gmail.com> wrote:
>> >> >
>> >> >> > I can't imagine that the speed achieved by using Hbase would be
>> even
>> >> >> within
>> >> >> > orders of magnitude of what you can do in Lucene 4 (or even 3).
>> >> >>
>> >> >> The indexing speed in Lucene hasn't changed in quite a while, are you
>> >> >> saying HBase would somehow be overloaded?  That doesn't seem to jive
>> >> >> with the sequential writes HBase performs?
>> >> >>
>> >> >
>> >> > Michi's stuff uses flexible indexing with a zero lock architecture.
>>  The
>> >> > speed *is* much higher.
>> >> >
>> >> > The real problem is that hbase repeats keys.
>> >> >
>> >> > If you were to store entire posting vectors as values with terms as
>> keys,
>> >> > you might be OK.  Very long posting vectors or add-ons could be added
>> >> using
>> >> > a key+serial number trick.
>> >> >
>> >> > Short queries would involve reading and merging several posting
>> vectors.
>> >>  In
>> >> > that mode, query speeds might be OK, but there isn't a lot of Lucene
>> left
>> >> at
>> >> > that point.  For updates, speed would only be acceptable if you batch
>> up
>> >> a
>> >> > lot updates or possibly if you build in a value append function as a
>> >> > co-processor.
>> >> >
>> >> >
>> >> >
>> >> >> The speed of indexing is a function of creating segments, with
>> >> >> flexible indexing, the underlying segment files (and postings) may be
>> >> >> significantly altered from the default file structures, eg, placed
>> >> >> into HBase in various ways.  The posting lists could even be split
>> >> >> along with HBase regions?
>> >> >>
>> >> >
>> >> > Possibly.  But if you use term + counter and post vectors of limited
>> >> length
>> >> > you might be OK.
>> >> >
>> >>
>> >
>>
>

Re: HBase and Lucene for realtime search

Reply via email to