On 30-Nov-07, at 1:02 PM, Owens, Martin wrote:


Hello everyone,

We're working to replace the old Linux version of dtSearch with Lucene/Solr, using the http requests for our perl side and java for the indexing.

The functionality that is causing the most problems is the highlighting since we're not storing the text in solr (only indexing) and we need to highlight an image file (ocr) so what we really need is to request from solr the word indexes of the matches, we then tie this up to the ocr image and create html boxes to do the highlighting.

This isn't possible with Solr out-of-the-box. Also, the usual methods for highlighting won't work because Solr typically re- analyzes the raw text to find the appropriate highlighting points. However, it shouldn't be too hard to come up with a custom solution. You can tell lucene to store token offsets using TermVectors (configurable via schema.xml). Then you can customize the request handler to return the token offsets (and/or positions) by retrieving the TVs.

The text is also multi page, each page is seperated by Ctrl-L page breaks, should we handle the paging out selves or can Solr tell use which page the match happened on too?

Again, not automatically. However, if you wrote an analyzer that bumped up the position increment of tokens every time a new page was found (to, say the next multiple of 1000), then you infer the matching page by the token position.

cheers,
-Mike

Reply via email to