On Mon, Feb 14, 2011 at 12:37 AM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote:
> > Another issue is that maybe the scalability needs for search might be > > different. An HBase region is always only active in one region server, > there > > are no active replica's, while often for search you need replicas to > scale, > > since a search will typically hit all partitions. > > > Really? That seems odd. > Yep, really. The replication is [only] on the HDFS-level. For HBase, this is not much of a problem as long as the requests are not strongly skewed towards one region (this requires good consideration from users when choosing row keys), but for search this could be a real issue. Also, HBase and Lucene might be different in how much rows/documents they can handle on one server, or in one region (an HBase region is typically only 256MB), leading to difficult choices (optimize region size for hbase vs for lucene). > > to be the main action and all what follows just secondary side-effects > (i.e. > > there's no rollback). > > I think inside a Coprocessor you could block the HBase 'commit' until > a successful updateDoc call to Lucene (which is only an update to RAM > anyways)? > Yes, that should work. But doesn't it assume that the index is updated synchronously with the HBase row? I can imagine this will sometimes be an issue, e.g. if it would involve performing expensive content extraction (tika) or analysis. BTW, something we do in Lily, and which might be interesting to think about in this context as well, is denormalization, thus in the Lucene document of some HBase row information is stored from related (linked) rows. This requires that, when one row changes, you need to find out what other rows denormalize info from this row, and update the Lucene documents of those rows as well. Just bringing this up as a random feature to think about ;-) -- Bruno Dumon Outerthought http://outerthought.org/