As you like. My experience is that analyzing a document takes longer than I want to cause the user to wait when inserting it. I almost always prefer write-behind indexing of some kind.
On Mon, Feb 14, 2011 at 11:28 AM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > > The analysis can be very slow if you are doing Tika things and named > entity > > extraction and PDF interpretation and so on. > > I'd consider those different/separate use cases where likely realtime > isn't important? If large [static] documents are being stored in > HBase why would expediency be required? > > On Mon, Feb 14, 2011 at 11:18 AM, Ted Dunning <tdunn...@maprtech.com> > wrote: > > The analysis can be very slow if you are doing Tika things and named > entity > > extraction and PDF interpretation and so on. > > > > On Mon, Feb 14, 2011 at 11:09 AM, Jason Rutherglen < > > jason.rutherg...@gmail.com> wrote: > > > >> The older versions of Lucene NRT indexing is slow, the newer version > >> with RT will be as fast as Lucene's batch indexing is today, which I'm > >> guessing will be fast enough for many/most users? Eg, it's simply > >> analyzing and throwing the data into a RAM buffer (there's no IO or > >> segment merging happening). > >> > >> On Mon, Feb 14, 2011 at 10:57 AM, Ted Dunning <tdunn...@maprtech.com> > >> wrote: > >> > I would find that unacceptable for many systems I have worked on. > Lucene > >> > update-behind would be fine, but waiting the insert until all of the > >> Lucene > >> > stuff happened would not be acceptable. > >> > > >> > I would much rather that Lucene update from the write log in batches > that > >> > are as big as needed to catch/keep up. > >> > > >> > On Mon, Feb 14, 2011 at 9:48 AM, Jason Rutherglen < > >> > jason.rutherg...@gmail.com> wrote: > >> > > >> >> > Yes, that should work. But doesn't it assume that the index is > updated > >> >> > synchronously with the HBase row? I can imagine this will sometimes > be > >> an > >> >> > issue, e.g. if it would involve performing expensive content > >> extraction > >> >> > (tika) or analysis. > >> >> > >> >> I don't understand here. You mean that the delay in indexing a > >> >> document will adversely affect the HBase row insert because it's all > >> >> in the same transaction? I think that fine, eg, it's just how the > >> >> system'd work? > >> > > >> > > >