Re: HBase and Lucene for realtime search

Jason Rutherglen Mon, 14 Feb 2011 09:48:59 -0800

> Yep, really. The replication is [only] on the HDFS-level. For HBase, this is
> not much of a problem as long as the requests are not strongly skewed
> towards one region (this requires good consideration from users when
> choosing row keys), but for search this could be a real issue.


I think this can be solved rather easily?  Or is there an underlying
design rationale?

> Also, HBase and Lucene might be different in how much rows/documents they
> can handle on one server, or in one region (an HBase region is typically
> only 256MB), leading to difficult choices (optimize region size for hbase vs
> for lucene).

I think that case, either we can map multiple regions to a Lucene
index or increase the size of the HBase region.  Either way'd be fine.

> Yes, that should work. But doesn't it assume that the index is updated
> synchronously with the HBase row? I can imagine this will sometimes be an
> issue, e.g. if it would involve performing expensive content extraction
> (tika) or analysis.

I don't understand here.  You mean that the delay in indexing a
document will adversely affect the HBase row insert because it's all
in the same transaction?  I think that fine, eg, it's just how the
system'd work?

On Mon, Feb 14, 2011 at 9:28 AM, Bruno Dumon <br...@outerthought.org> wrote:
> On Mon, Feb 14, 2011 at 12:37 AM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>>  > Another issue is that maybe the scalability needs for search might be
>> > different. An HBase region is always only active in one region server,
>> there
>> > are no active replica's, while often for search you need replicas to
>> scale,
>> > since a search will typically hit all partitions.
>>
>>
>> Really?  That seems odd.
>>
>
> Yep, really. The replication is [only] on the HDFS-level. For HBase, this is
> not much of a problem as long as the requests are not strongly skewed
> towards one region (this requires good consideration from users when
> choosing row keys), but for search this could be a real issue.
>
> Also, HBase and Lucene might be different in how much rows/documents they
> can handle on one server, or in one region (an HBase region is typically
> only 256MB), leading to difficult choices (optimize region size for hbase vs
> for lucene).
>
>
>> > to be the main action and all what follows just secondary side-effects
>> (i.e.
>> > there's no rollback).
>>
>> I think inside a Coprocessor you could block the HBase 'commit' until
>> a successful updateDoc call to Lucene (which is only an update to RAM
>> anyways)?
>>
>
> Yes, that should work. But doesn't it assume that the index is updated
> synchronously with the HBase row? I can imagine this will sometimes be an
> issue, e.g. if it would involve performing expensive content extraction
> (tika) or analysis.
>
> BTW, something we do in Lily, and which might be interesting to think about
> in this context as well, is denormalization, thus in the Lucene document of
> some HBase row information is stored from related (linked) rows. This
> requires that, when one row changes, you need to find out what other rows
> denormalize info from this row, and update the Lucene documents of those
> rows as well. Just bringing this up as a random feature to think about ;-)
>
> --
> Bruno Dumon
> Outerthought
> http://outerthought.org/
>

Re: HBase and Lucene for realtime search

Reply via email to