How to index the parsed content effectively

Sergey Beryozkin Wed, 02 Jul 2014 05:28:24 -0700

Hi All,

We've been experimenting with indexing the parsed content in Lucene and
our initial attempt was to index the output from
ToTextContentHandler.toString() as a Lucene Text field.


This is unlikely to be effective for large files. So I wonder what
strategies exist for a more effective indexing/tokenization of the the
possibly large content.

Perhaps a custom ContentHandler can index content fragments in a unique
Lucene field every time its characters(...) method is called, something
I've been planning to experiment with.

The feedback will be appreciated
Cheers, Sergey

How to index the parsed content effectively

Reply via email to