Re: Lucene 3.0 Problems with term vectors for large text

David Smiley Mon, 17 Oct 2016 05:23:33 -0700

Badr,
Is the problem there in the latest 6.x release?  Can you try making a small
unit test?  Or modify an existing Lucene test (even in a hacky/temp way --
it's okay).  Lucene 3.x isn't going to see another release, nor will 4x.
~ David


On Fri, Oct 14, 2016 at 2:42 AM badr <b...@convo.com> wrote:

> Background :
>
> I am using Lucene for indexing the text for of files, In my scenario a
> single document can have multiple files in it. As lucene document is a
> linear document with out hierarchy so I have stored the text for each page
> of file in a property file_text. So document structure is like
>
> document -
>           |- other properties
>           |- file_text
>           |- file_text
>           |- so on file_text of 'n' number of pages of files
> I Have indexed file id's and page no. with each file text property so that
> I
> can find the corresponding file and page if there is a match in property of
> file text.
>
> Solution :
>
> I have indexed the file_text property with term vectors so while searching
> I
> use term vector to find index of file_text that has matched term and by
> this
> I am able to get file id and page no. of file. Solution works perfectly as
> I
> am able to get all required info that is file with match and page on which
> match exists and also the no. of word occurrences as well.
>
> Current Problem :
>
> The problem with the solution is when there are large files lucene unable
> to
> create the term vectors for whole file text. For example I have a file with
> 222 pages and lucene is able to index term vectors of only first 127 pages.
> the matches on 128 page never found for this file. (end offset of last term
> vector was 63122 but actual last index of file text is 140743)
>
> I am wondering if there is any limitation for term vectors with lucene that
> I am missing at the moment.
>
> So the solution never works for big files.
>
> Workarounds :
>
> I can find the matching document with lucene search while indexing the
> file_text without term vectors and simply store the text as a whole. Once
> the matching document is found then I can use regex/String methods to find
> the no. of matches file id and page no. etc.
>
> But this will be very slow as string operations will need to run on whole
> file text.
>
> Looking for :
>
> Is there any way which can get me the index for matching file_text field in
> document. I know Explain can find the matching field and in may case there
> are multiple fields with same name in documents so I need to get the index
> along with field name. This will make me able to only run string methods on
> single text page that will improve the performance.
>
> Is there any way to make it work with term vectors.
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Lucene-3-0-Problems-with-term-vectors-for-large-text-tp4301073.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Lucene 3.0 Problems with term vectors for large text

Reply via email to