On 30-Nov-07, at 1:02 PM, Owens, Martin wrote:
Hello everyone,
We're working to replace the old Linux version of dtSearch with
Lucene/Solr, using the http requests for our perl side and java for
the indexing.
The functionality that is causing the most problems is the
highlighting since we're not storing the text in solr (only
indexing) and we need to highlight an image file (ocr) so what we
really need is to request from solr the word indexes of the
matches, we then tie this up to the ocr image and create html boxes
to do the highlighting.
This isn't possible with Solr out-of-the-box. Also, the usual
methods for highlighting won't work because Solr typically re-
analyzes the raw text to find the appropriate highlighting points.
However, it shouldn't be too hard to come up with a custom solution.
You can tell lucene to store token offsets using TermVectors
(configurable via schema.xml). Then you can customize the request
handler to return the token offsets (and/or positions) by retrieving
the TVs.
The text is also multi page, each page is seperated by Ctrl-L page
breaks, should we handle the paging out selves or can Solr tell use
which page the match happened on too?
Again, not automatically. However, if you wrote an analyzer that
bumped up the position increment of tokens every time a new page was
found (to, say the next multiple of 1000), then you infer the
matching page by the token position.
cheers,
-Mike