> Yep, really. The replication is [only] on the HDFS-level. For HBase, this is > not much of a problem as long as the requests are not strongly skewed > towards one region (this requires good consideration from users when > choosing row keys), but for search this could be a real issue.
I think this can be solved rather easily? Or is there an underlying design rationale? > Also, HBase and Lucene might be different in how much rows/documents they > can handle on one server, or in one region (an HBase region is typically > only 256MB), leading to difficult choices (optimize region size for hbase vs > for lucene). I think that case, either we can map multiple regions to a Lucene index or increase the size of the HBase region. Either way'd be fine. > Yes, that should work. But doesn't it assume that the index is updated > synchronously with the HBase row? I can imagine this will sometimes be an > issue, e.g. if it would involve performing expensive content extraction > (tika) or analysis. I don't understand here. You mean that the delay in indexing a document will adversely affect the HBase row insert because it's all in the same transaction? I think that fine, eg, it's just how the system'd work? On Mon, Feb 14, 2011 at 9:28 AM, Bruno Dumon <br...@outerthought.org> wrote: > On Mon, Feb 14, 2011 at 12:37 AM, Jason Rutherglen < > jason.rutherg...@gmail.com> wrote: > >> > Another issue is that maybe the scalability needs for search might be >> > different. An HBase region is always only active in one region server, >> there >> > are no active replica's, while often for search you need replicas to >> scale, >> > since a search will typically hit all partitions. >> >> >> Really? That seems odd. >> > > Yep, really. The replication is [only] on the HDFS-level. For HBase, this is > not much of a problem as long as the requests are not strongly skewed > towards one region (this requires good consideration from users when > choosing row keys), but for search this could be a real issue. > > Also, HBase and Lucene might be different in how much rows/documents they > can handle on one server, or in one region (an HBase region is typically > only 256MB), leading to difficult choices (optimize region size for hbase vs > for lucene). > > >> > to be the main action and all what follows just secondary side-effects >> (i.e. >> > there's no rollback). >> >> I think inside a Coprocessor you could block the HBase 'commit' until >> a successful updateDoc call to Lucene (which is only an update to RAM >> anyways)? >> > > Yes, that should work. But doesn't it assume that the index is updated > synchronously with the HBase row? I can imagine this will sometimes be an > issue, e.g. if it would involve performing expensive content extraction > (tika) or analysis. > > BTW, something we do in Lily, and which might be interesting to think about > in this context as well, is denormalization, thus in the Lucene document of > some HBase row information is stored from related (linked) rows. This > requires that, when one row changes, you need to find out what other rows > denormalize info from this row, and update the Lucene documents of those > rows as well. Just bringing this up as a random feature to think about ;-) > > -- > Bruno Dumon > Outerthought > http://outerthought.org/ >