> The analysis can be very slow if you are doing Tika things and named entity > extraction and PDF interpretation and so on.
I'd consider those different/separate use cases where likely realtime isn't important? If large [static] documents are being stored in HBase why would expediency be required? On Mon, Feb 14, 2011 at 11:18 AM, Ted Dunning <tdunn...@maprtech.com> wrote: > The analysis can be very slow if you are doing Tika things and named entity > extraction and PDF interpretation and so on. > > On Mon, Feb 14, 2011 at 11:09 AM, Jason Rutherglen < > jason.rutherg...@gmail.com> wrote: > >> The older versions of Lucene NRT indexing is slow, the newer version >> with RT will be as fast as Lucene's batch indexing is today, which I'm >> guessing will be fast enough for many/most users? Eg, it's simply >> analyzing and throwing the data into a RAM buffer (there's no IO or >> segment merging happening). >> >> On Mon, Feb 14, 2011 at 10:57 AM, Ted Dunning <tdunn...@maprtech.com> >> wrote: >> > I would find that unacceptable for many systems I have worked on. Lucene >> > update-behind would be fine, but waiting the insert until all of the >> Lucene >> > stuff happened would not be acceptable. >> > >> > I would much rather that Lucene update from the write log in batches that >> > are as big as needed to catch/keep up. >> > >> > On Mon, Feb 14, 2011 at 9:48 AM, Jason Rutherglen < >> > jason.rutherg...@gmail.com> wrote: >> > >> >> > Yes, that should work. But doesn't it assume that the index is updated >> >> > synchronously with the HBase row? I can imagine this will sometimes be >> an >> >> > issue, e.g. if it would involve performing expensive content >> extraction >> >> > (tika) or analysis. >> >> >> >> I don't understand here. You mean that the delay in indexing a >> >> document will adversely affect the HBase row insert because it's all >> >> in the same transaction? I think that fine, eg, it's just how the >> >> system'd work? >> > >> >