> Michi's stuff uses flexible indexing with a zero lock architecture. The > speed *is* much higher.
The speed's higher, and there isn't much Lucene left there either, as I believe it was built specifically for the 140 characters use case (eg, not the general use case). I don't think most indexes can be compressed to only exist in RAM on a single server? The Twitter use case isn't one that the HBase RT search solution is useful for? > If you were to store entire posting vectors as values with terms as keys, > you might be OK. Very long posting vectors or add-ons could be added using > a key+serial number trick. This sounds like the right approach to try. Also, the Lucene terms dict is sorted anyways, so moving the terms into HBase's sorted keys probably makes sense. > For updates, speed would only be acceptable if you batch up a > lot updates or possibly if you build in a value append function as a > co-processor. Hmm... I think the main issue would be the way Lucene implements deletes (eg, today as a BitVector). I think we'd keep that functionality. The new docs/updates would be added to the in-RAM-buffer. I think there'd be a RAM size based flush as there is today. Where that'd be flushed to is an open question. I think the key advantages to the RT + HBase architecture is the index would live alongside HBase columns, and so all other scaling problems (especially those related to scaling RT, such as synchronization of distributed data and updates) goes away. A distributed query would remain the same, eg, it'd hit N servers? In addition, Lucene offers a wide variety of new query types which HBase'd get in realtime for free. On Fri, Feb 11, 2011 at 4:13 PM, Ted Dunning <tdunn...@maprtech.com> wrote: > On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen < > jason.rutherg...@gmail.com> wrote: > >> > I can't imagine that the speed achieved by using Hbase would be even >> within >> > orders of magnitude of what you can do in Lucene 4 (or even 3). >> >> The indexing speed in Lucene hasn't changed in quite a while, are you >> saying HBase would somehow be overloaded? That doesn't seem to jive >> with the sequential writes HBase performs? >> > > Michi's stuff uses flexible indexing with a zero lock architecture. The > speed *is* much higher. > > The real problem is that hbase repeats keys. > > If you were to store entire posting vectors as values with terms as keys, > you might be OK. Very long posting vectors or add-ons could be added using > a key+serial number trick. > > Short queries would involve reading and merging several posting vectors. In > that mode, query speeds might be OK, but there isn't a lot of Lucene left at > that point. For updates, speed would only be acceptable if you batch up a > lot updates or possibly if you build in a value append function as a > co-processor. > > > >> The speed of indexing is a function of creating segments, with >> flexible indexing, the underlying segment files (and postings) may be >> significantly altered from the default file structures, eg, placed >> into HBase in various ways. The posting lists could even be split >> along with HBase regions? >> > > Possibly. But if you use term + counter and post vectors of limited length > you might be OK. >