Re: Solr Highlighting, word index

Mike Klaas Fri, 30 Nov 2007 13:38:25 -0800

On 30-Nov-07, at 1:02 PM, Owens, Martin wrote:

Hello everyone,
We're working to replace the old Linux version of dtSearch withLucene/Solr, using the http requests for our perl side and java forthe indexing.
The functionality that is causing the most problems is thehighlighting since we're not storing the text in solr (onlyindexing) and we need to highlight an image file (ocr) so what wereally need is to request from solr the word indexes of thematches, we then tie this up to the ocr image and create html boxesto do the highlighting.

This isn't possible with Solr out-of-the-box. Also, the usualmethods for highlighting won't work because Solr typically re-analyzes the raw text to find the appropriate highlighting points.However, it shouldn't be too hard to come up with a custom solution.You can tell lucene to store token offsets using TermVectors(configurable via schema.xml). Then you can customize the requesthandler to return the token offsets (and/or positions) by retrievingthe TVs.

The text is also multi page, each page is seperated by Ctrl-L pagebreaks, should we handle the paging out selves or can Solr tell usewhich page the match happened on too?

Again, not automatically. However, if you wrote an analyzer thatbumped up the position increment of tokens every time a new page wasfound (to, say the next multiple of 1000), then you infer thematching page by the token position.


cheers,
-Mike

Re: Solr Highlighting, word index

Reply via email to