*Background :*

I am using Lucene for indexing the text for of files, In my scenario a
single document can have multiple files in it. As lucene document is a
linear document with out hierarchy so I have stored the text for each page
of file in a property file_text. So document structure is like

document -

          |- other properties

          |- file_text

          |- file_text

          |- so on file_text of 'n' number of pages of files

I Have indexed file id's and page no. with each file text property so that
I can find the corresponding file and page if there is a match in property
of file text.

*Solution :*

I have indexed the file_text property with term vectors so while searching
I use term vector to find index of file_text that has matched term and by
this I am able to get file id and page no. of file. Solution works
perfectly as I am able to get all required info that is file with match and
page on which match exists and also the no. of word occurrences as well.

*Current Problem :*

The problem with the solution is when there are large files lucene unable
to create the term vectors for whole file text. For example I have a file
with 222 pages and lucene is able to index term vectors of only first 127
pages. the matches on 128 page never found for this file. (end offset of
last term vector was 63122 but actual last index of file text is 140743)

*I am wondering if there is any limitation for term vectors with lucene
that I am missing at the moment.*

So the solution never works for big files.

*Workarounds :*

I can find the matching document with lucene search while indexing the
file_text without term vectors and simply store the text as a whole. Once
the matching document is found then I can use regex/String methods to find
the no. of matches file id and page no. etc.

But this will be very slow as string operations will need to run on whole
file text.

*Looking for :*

Is there any way which can get me the index for matching file_text field in
document. I know Explain can find the matching field and in may case there
are multiple fields with same name in documents so I need to get the index
along with field name. This will make me able to only run string methods on
single text page that will improve the performance.

Is there any way to make it work with term vectors.

Reply via email to