> The analysis can be very slow if you are doing Tika things and named entity
> extraction and PDF interpretation and so on.

I'd consider those different/separate use cases where likely realtime
isn't important?  If large [static] documents are being stored in
HBase why would expediency be required?

On Mon, Feb 14, 2011 at 11:18 AM, Ted Dunning <tdunn...@maprtech.com> wrote:
> The analysis can be very slow if you are doing Tika things and named entity
> extraction and PDF interpretation and so on.
>
> On Mon, Feb 14, 2011 at 11:09 AM, Jason Rutherglen <
> jason.rutherg...@gmail.com> wrote:
>
>> The older versions of Lucene NRT indexing is slow, the newer version
>> with RT will be as fast as Lucene's batch indexing is today, which I'm
>> guessing will be fast enough for many/most users?  Eg, it's simply
>> analyzing and throwing the data into a RAM buffer (there's no IO or
>> segment merging happening).
>>
>> On Mon, Feb 14, 2011 at 10:57 AM, Ted Dunning <tdunn...@maprtech.com>
>> wrote:
>> > I would find that unacceptable for many systems I have worked on.  Lucene
>> > update-behind would be fine, but waiting the insert until all of the
>> Lucene
>> > stuff happened would not be acceptable.
>> >
>> > I would much rather that Lucene update from the write log in batches that
>> > are as big as needed to catch/keep up.
>> >
>> > On Mon, Feb 14, 2011 at 9:48 AM, Jason Rutherglen <
>> > jason.rutherg...@gmail.com> wrote:
>> >
>> >> > Yes, that should work. But doesn't it assume that the index is updated
>> >> > synchronously with the HBase row? I can imagine this will sometimes be
>> an
>> >> > issue, e.g. if it would involve performing expensive content
>> extraction
>> >> > (tika) or analysis.
>> >>
>> >> I don't understand here.  You mean that the delay in indexing a
>> >> document will adversely affect the HBase row insert because it's all
>> >> in the same transaction?  I think that fine, eg, it's just how the
>> >> system'd work?
>> >
>>
>

Reply via email to