Badr, Is the problem there in the latest 6.x release? Can you try making a small unit test? Or modify an existing Lucene test (even in a hacky/temp way -- it's okay). Lucene 3.x isn't going to see another release, nor will 4x. ~ David
On Fri, Oct 14, 2016 at 2:42 AM badr <b...@convo.com> wrote: > Background : > > I am using Lucene for indexing the text for of files, In my scenario a > single document can have multiple files in it. As lucene document is a > linear document with out hierarchy so I have stored the text for each page > of file in a property file_text. So document structure is like > > document - > |- other properties > |- file_text > |- file_text > |- so on file_text of 'n' number of pages of files > I Have indexed file id's and page no. with each file text property so that > I > can find the corresponding file and page if there is a match in property of > file text. > > Solution : > > I have indexed the file_text property with term vectors so while searching > I > use term vector to find index of file_text that has matched term and by > this > I am able to get file id and page no. of file. Solution works perfectly as > I > am able to get all required info that is file with match and page on which > match exists and also the no. of word occurrences as well. > > Current Problem : > > The problem with the solution is when there are large files lucene unable > to > create the term vectors for whole file text. For example I have a file with > 222 pages and lucene is able to index term vectors of only first 127 pages. > the matches on 128 page never found for this file. (end offset of last term > vector was 63122 but actual last index of file text is 140743) > > I am wondering if there is any limitation for term vectors with lucene that > I am missing at the moment. > > So the solution never works for big files. > > Workarounds : > > I can find the matching document with lucene search while indexing the > file_text without term vectors and simply store the text as a whole. Once > the matching document is found then I can use regex/String methods to find > the no. of matches file id and page no. etc. > > But this will be very slow as string operations will need to run on whole > file text. > > Looking for : > > Is there any way which can get me the index for matching file_text field in > document. I know Explain can find the matching field and in may case there > are multiple fields with same name in documents so I need to get the index > along with field name. This will make me able to only run string methods on > single text page that will improve the performance. > > Is there any way to make it work with term vectors. > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Lucene-3-0-Problems-with-term-vectors-for-large-text-tp4301073.html > Sent from the Lucene - Java Developer mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > -- Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com