No. And I doubt there ever will be. That was one reason to split the larger posting vectors. That way you can multi-thread the fetching and the scoring.
On Fri, Feb 11, 2011 at 6:56 PM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > Thanks! In browsing the HBase code, I think it'd be optimal to stream > the posting/binary data directly from the underlying storage (instead > of loading the entire byte[]), it doesn't look like there's a way to > do this (yet)? > > On Fri, Feb 11, 2011 at 6:20 PM, Ted Dunning <tdunn...@maprtech.com> > wrote: > > Go for it! > > > > On Fri, Feb 11, 2011 at 4:44 PM, Jason Rutherglen < > > jason.rutherg...@gmail.com> wrote: > > > >> > Michi's stuff uses flexible indexing with a zero lock architecture. > The > >> > speed *is* much higher. > >> > >> The speed's higher, and there isn't much Lucene left there either, as > >> I believe it was built specifically for the 140 characters use case > >> (eg, not the general use case). I don't think most indexes can be > >> compressed to only exist in RAM on a single server? The Twitter use > >> case isn't one that the HBase RT search solution is useful for? > >> > >> > If you were to store entire posting vectors as values with terms as > keys, > >> > you might be OK. Very long posting vectors or add-ons could be added > >> using > >> > a key+serial number trick. > >> > >> This sounds like the right approach to try. Also, the Lucene terms > >> dict is sorted anyways, so moving the terms into HBase's sorted keys > >> probably makes sense. > >> > >> > For updates, speed would only be acceptable if you batch up a > >> > lot updates or possibly if you build in a value append function as a > >> > co-processor. > >> > >> Hmm... I think the main issue would be the way Lucene implements > >> deletes (eg, today as a BitVector). I think we'd keep that > >> functionality. The new docs/updates would be added to the > >> in-RAM-buffer. I think there'd be a RAM size based flush as there is > >> today. Where that'd be flushed to is an open question. > >> > >> I think the key advantages to the RT + HBase architecture is the index > >> would live alongside HBase columns, and so all other scaling problems > >> (especially those related to scaling RT, such as synchronization of > >> distributed data and updates) goes away. > >> > >> A distributed query would remain the same, eg, it'd hit N servers? > >> > >> In addition, Lucene offers a wide variety of new query types which > >> HBase'd get in realtime for free. > >> > >> On Fri, Feb 11, 2011 at 4:13 PM, Ted Dunning <tdunn...@maprtech.com> > >> wrote: > >> > On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen < > >> > jason.rutherg...@gmail.com> wrote: > >> > > >> >> > I can't imagine that the speed achieved by using Hbase would be > even > >> >> within > >> >> > orders of magnitude of what you can do in Lucene 4 (or even 3). > >> >> > >> >> The indexing speed in Lucene hasn't changed in quite a while, are you > >> >> saying HBase would somehow be overloaded? That doesn't seem to jive > >> >> with the sequential writes HBase performs? > >> >> > >> > > >> > Michi's stuff uses flexible indexing with a zero lock architecture. > The > >> > speed *is* much higher. > >> > > >> > The real problem is that hbase repeats keys. > >> > > >> > If you were to store entire posting vectors as values with terms as > keys, > >> > you might be OK. Very long posting vectors or add-ons could be added > >> using > >> > a key+serial number trick. > >> > > >> > Short queries would involve reading and merging several posting > vectors. > >> In > >> > that mode, query speeds might be OK, but there isn't a lot of Lucene > left > >> at > >> > that point. For updates, speed would only be acceptable if you batch > up > >> a > >> > lot updates or possibly if you build in a value append function as a > >> > co-processor. > >> > > >> > > >> > > >> >> The speed of indexing is a function of creating segments, with > >> >> flexible indexing, the underlying segment files (and postings) may be > >> >> significantly altered from the default file structures, eg, placed > >> >> into HBase in various ways. The posting lists could even be split > >> >> along with HBase regions? > >> >> > >> > > >> > Possibly. But if you use term + counter and post vectors of limited > >> length > >> > you might be OK. > >> > > >> > > >