Go for it! On Fri, Feb 11, 2011 at 4:44 PM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote:
> > Michi's stuff uses flexible indexing with a zero lock architecture. The > > speed *is* much higher. > > The speed's higher, and there isn't much Lucene left there either, as > I believe it was built specifically for the 140 characters use case > (eg, not the general use case). I don't think most indexes can be > compressed to only exist in RAM on a single server? The Twitter use > case isn't one that the HBase RT search solution is useful for? > > > If you were to store entire posting vectors as values with terms as keys, > > you might be OK. Very long posting vectors or add-ons could be added > using > > a key+serial number trick. > > This sounds like the right approach to try. Also, the Lucene terms > dict is sorted anyways, so moving the terms into HBase's sorted keys > probably makes sense. > > > For updates, speed would only be acceptable if you batch up a > > lot updates or possibly if you build in a value append function as a > > co-processor. > > Hmm... I think the main issue would be the way Lucene implements > deletes (eg, today as a BitVector). I think we'd keep that > functionality. The new docs/updates would be added to the > in-RAM-buffer. I think there'd be a RAM size based flush as there is > today. Where that'd be flushed to is an open question. > > I think the key advantages to the RT + HBase architecture is the index > would live alongside HBase columns, and so all other scaling problems > (especially those related to scaling RT, such as synchronization of > distributed data and updates) goes away. > > A distributed query would remain the same, eg, it'd hit N servers? > > In addition, Lucene offers a wide variety of new query types which > HBase'd get in realtime for free. > > On Fri, Feb 11, 2011 at 4:13 PM, Ted Dunning <tdunn...@maprtech.com> > wrote: > > On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen < > > jason.rutherg...@gmail.com> wrote: > > > >> > I can't imagine that the speed achieved by using Hbase would be even > >> within > >> > orders of magnitude of what you can do in Lucene 4 (or even 3). > >> > >> The indexing speed in Lucene hasn't changed in quite a while, are you > >> saying HBase would somehow be overloaded? That doesn't seem to jive > >> with the sequential writes HBase performs? > >> > > > > Michi's stuff uses flexible indexing with a zero lock architecture. The > > speed *is* much higher. > > > > The real problem is that hbase repeats keys. > > > > If you were to store entire posting vectors as values with terms as keys, > > you might be OK. Very long posting vectors or add-ons could be added > using > > a key+serial number trick. > > > > Short queries would involve reading and merging several posting vectors. > In > > that mode, query speeds might be OK, but there isn't a lot of Lucene left > at > > that point. For updates, speed would only be acceptable if you batch up > a > > lot updates or possibly if you build in a value append function as a > > co-processor. > > > > > > > >> The speed of indexing is a function of creating segments, with > >> flexible indexing, the underlying segment files (and postings) may be > >> significantly altered from the default file structures, eg, placed > >> into HBase in various ways. The posting lists could even be split > >> along with HBase regions? > >> > > > > Possibly. But if you use term + counter and post vectors of limited > length > > you might be OK. > > >