On Fri, Feb 11, 2011 at 4:13 PM, Ted Dunning <tdunn...@maprtech.com> wrote:
> On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen < > jason.rutherg...@gmail.com> wrote: > > > > I can't imagine that the speed achieved by using Hbase would be even > > within > > > orders of magnitude of what you can do in Lucene 4 (or even 3). > > > > The indexing speed in Lucene hasn't changed in quite a while, are you > > saying HBase would somehow be overloaded? That doesn't seem to jive > > with the sequential writes HBase performs? > > > > Michi's stuff uses flexible indexing with a zero lock architecture. The > speed *is* much higher. > > The real problem is that hbase repeats keys. > > If you were to store entire posting vectors as values with terms as keys, > you might be OK. Very long posting vectors or add-ons could be added using > a key+serial number trick. > > Short queries would involve reading and merging several posting vectors. > In > that mode, query speeds might be OK, but there isn't a lot of Lucene left > at > that point. For updates, speed would only be acceptable if you batch up a > lot updates or possibly if you build in a value append function as a > co-processor. > "speed would only be acceptable if you batch up " -- I understand what you are talking about here (without batching-up, HBase simply become very sluggish). Can you comment if Cassandra needs a batch-up mode? (I recall Twitter said they just keep putting results into Cassandra for its analytics application) > > > > > The speed of indexing is a function of creating segments, with > > flexible indexing, the underlying segment files (and postings) may be > > significantly altered from the default file structures, eg, placed > > into HBase in various ways. The posting lists could even be split > > along with HBase regions? > > > > Possibly. But if you use term + counter and post vectors of limited length > you might be OK. > -- --Sean