It's good you already have the data because if you somehow got it from
some sort of calculations I'd have to tell my product manager that
the feature he wanted that I told him couldn't be done with our data
was possible after all <G>...

About page breaks:

Another approach to paging is to index a special page token with an
increment of 0 from the last word of the page. Say you have the following:
last ctrl-l first. Then index last, $$$$$$$ at an increment of 0 then first.

You can then quite quickly calculate the pages by using
termdocs/termenum on your special token and count.

Which approach you use depends upon whether you want span and/or
phrase queries to match across page boundaries. If you use an increment as
Mike suggests, matching "last first"~3 won't work. It just depends upon
whether how you want to match across the page break.

Best
Erick

On Nov 30, 2007 4:37 PM, Mike Klaas <[EMAIL PROTECTED]> wrote:

> On 30-Nov-07, at 1:02 PM, Owens, Martin wrote:
>
> >
> > Hello everyone,
> >
> > We're working to replace the old Linux version of dtSearch with
> > Lucene/Solr, using the http requests for our perl side and java for
> > the indexing.
> >
> > The functionality that is causing the most problems is the
> > highlighting since we're not storing the text in solr (only
> > indexing) and we need to highlight an image file (ocr) so what we
> > really need is to request from solr the word indexes of the
> > matches, we then tie this up to the ocr image and create html boxes
> > to do the highlighting.
>
> This isn't possible with Solr out-of-the-box.  Also, the usual
> methods for highlighting won't work because Solr typically re-
> analyzes the raw text to find the appropriate highlighting points.
> However, it shouldn't be too hard to come up with a custom solution.
> You can tell lucene to store token offsets using TermVectors
> (configurable via schema.xml).  Then you can customize the request
> handler to return the token offsets (and/or positions) by retrieving
> the TVs.
>
> > The text is also multi page, each page is seperated by Ctrl-L page
> > breaks, should we handle the paging out selves or can Solr tell use
> > which page the match happened on too?
>
> Again, not automatically.  However, if you wrote an analyzer that
> bumped up the position increment of tokens every time a new page was
> found (to, say the next multiple of 1000), then you infer the
> matching page by the token position.
>
> cheers,
> -Mike
>

Reply via email to