*Background :* I am using Lucene for indexing the text for of files, In my scenario a single document can have multiple files in it. As lucene document is a linear document with out hierarchy so I have stored the text for each page of file in a property file_text. So document structure is like
document - |- other properties |- file_text |- file_text |- so on file_text of 'n' number of pages of files I Have indexed file id's and page no. with each file text property so that I can find the corresponding file and page if there is a match in property of file text. *Solution :* I have indexed the file_text property with term vectors so while searching I use term vector to find index of file_text that has matched term and by this I am able to get file id and page no. of file. Solution works perfectly as I am able to get all required info that is file with match and page on which match exists and also the no. of word occurrences as well. *Current Problem :* The problem with the solution is when there are large files lucene unable to create the term vectors for whole file text. For example I have a file with 222 pages and lucene is able to index term vectors of only first 127 pages. the matches on 128 page never found for this file. (end offset of last term vector was 63122 but actual last index of file text is 140743) *I am wondering if there is any limitation for term vectors with lucene that I am missing at the moment.* So the solution never works for big files. *Workarounds :* I can find the matching document with lucene search while indexing the file_text without term vectors and simply store the text as a whole. Once the matching document is found then I can use regex/String methods to find the no. of matches file id and page no. etc. But this will be very slow as string operations will need to run on whole file text. *Looking for :* Is there any way which can get me the index for matching file_text field in document. I know Explain can find the matching field and in may case there are multiple fields with same name in documents so I need to get the index along with field name. This will make me able to only run string methods on single text page that will improve the performance. Is there any way to make it work with term vectors.