Hi Tim
Thanks for sharing your thoughts. I find them very helpful,
On 02/07/14 14:32, Allison, Timothy B. wrote:
Hi Sergey,
I'd take a look at what the DataImportHandler in Solr does. If you want to
store the field, you need to create the field with a String (as opposed to a
Reader); which means you have to have the whole thing in memory. Also, if
you're proposing adding a field entry in a multivalued field for a given SAX
event, I don't think that will help, because you still have to hold the entire
document in memory before calling addDocument() if you are storing the field.
If you aren't storing the field, then you could try a Reader.
Some thoughts:
At the least, you could create a separate Lucene document for each container
document and each of its embedded documents.
You could also break large documents into logical sections and index those
as separate documents; but that gets very use-case dependent.
Right. I think this is something we might investigate further. The goal
is to generalize some Tika Parser to Lucene code sequences, and perhaps
we can offer some boilerplate ContentHandler as we don't know of the
concrete/final requirements of the would be API consumers.
What is your opinion of having a Tika Parser ContentHandler that would
try to do it in a minimal kind of way, store character sequences as
unique individual Lucene fields. Suppose we have a single PDF file, and
we have a content handler reporting every line in such a file. So
instead of storing all the PDF content in a single "content" field we'd
have "content1":"line1", "content2":"line2", etc and then offer a
support for searching across all of these contentN fields ?
I guess it would be somewhat similar to your idea of having a separate
Lucene Document per every logical chunk, except that in this case we'd
have a single Document with many fields covering a single PDF/etc
Does it make any sense at all from the performance point of view or may
be not worth it ?
In practice, for many, many use cases I've come across, you can index quite large documents
with no problems, e.g. "Moby Dick" or "Dream of the Red Chamber." There may be
a hit at highlighting time for large docs depending on which highlighter you use. In the old days,
there used to be a 10k default limit on the number of tokens, but that is now long gone.
Sounds reasonable
For truly large docs (probably machine generated), yes, you could run into
problems if you need to hold the whole thing in memory.
Sure, if we get the users reporting OOM or similar related issues
against our API then it would be a good start :-)
Thanks, Sergey
Cheers,
Tim
-----Original Message-----
From: Sergey Beryozkin [mailto:[email protected]]
Sent: Wednesday, July 02, 2014 8:27 AM
To: [email protected]
Subject: How to index the parsed content effectively
Hi All,
We've been experimenting with indexing the parsed content in Lucene and
our initial attempt was to index the output from
ToTextContentHandler.toString() as a Lucene Text field.
This is unlikely to be effective for large files. So I wonder what
strategies exist for a more effective indexing/tokenization of the
possibly large content.
Perhaps a custom ContentHandler can index content fragments in a unique
Lucene field every time its characters(...) method is called, something
I've been planning to experiment with.
The feedback will be appreciated
Cheers, Sergey