Re: Solr Highlighting, word index

Erick Erickson Fri, 30 Nov 2007 18:24:32 -0800

It's good you already have the data because if you somehow got it from
some sort of calculations I'd have to tell my product manager that
the feature he wanted that I told him couldn't be done with our data
was possible after all <G>...


About page breaks:

Another approach to paging is to index a special page token with an
increment of 0 from the last word of the page. Say you have the following:
last ctrl-l first. Then index last, $$$$$$$ at an increment of 0 then first.

You can then quite quickly calculate the pages by using
termdocs/termenum on your special token and count.

Which approach you use depends upon whether you want span and/or
phrase queries to match across page boundaries. If you use an increment as
Mike suggests, matching "last first"~3 won't work. It just depends upon
whether how you want to match across the page break.

Best
Erick

On Nov 30, 2007 4:37 PM, Mike Klaas <[EMAIL PROTECTED]> wrote:

> On 30-Nov-07, at 1:02 PM, Owens, Martin wrote:
>
> >
> > Hello everyone,
> >
> > We're working to replace the old Linux version of dtSearch with
> > Lucene/Solr, using the http requests for our perl side and java for
> > the indexing.
> >
> > The functionality that is causing the most problems is the
> > highlighting since we're not storing the text in solr (only
> > indexing) and we need to highlight an image file (ocr) so what we
> > really need is to request from solr the word indexes of the
> > matches, we then tie this up to the ocr image and create html boxes
> > to do the highlighting.
>
> This isn't possible with Solr out-of-the-box.  Also, the usual
> methods for highlighting won't work because Solr typically re-
> analyzes the raw text to find the appropriate highlighting points.
> However, it shouldn't be too hard to come up with a custom solution.
> You can tell lucene to store token offsets using TermVectors
> (configurable via schema.xml).  Then you can customize the request
> handler to return the token offsets (and/or positions) by retrieving
> the TVs.
>
> > The text is also multi page, each page is seperated by Ctrl-L page
> > breaks, should we handle the paging out selves or can Solr tell use
> > which page the match happened on too?
>
> Again, not automatically.  However, if you wrote an analyzer that
> bumped up the position increment of tokens every time a new page was
> found (to, say the next multiple of 1000), then you infer the
> matching page by the token position.
>
> cheers,
> -Mike
>

Re: Solr Highlighting, word index

Reply via email to