It's good you already have the data because if you somehow got it from some sort of calculations I'd have to tell my product manager that the feature he wanted that I told him couldn't be done with our data was possible after all <G>...
About page breaks: Another approach to paging is to index a special page token with an increment of 0 from the last word of the page. Say you have the following: last ctrl-l first. Then index last, $$$$$$$ at an increment of 0 then first. You can then quite quickly calculate the pages by using termdocs/termenum on your special token and count. Which approach you use depends upon whether you want span and/or phrase queries to match across page boundaries. If you use an increment as Mike suggests, matching "last first"~3 won't work. It just depends upon whether how you want to match across the page break. Best Erick On Nov 30, 2007 4:37 PM, Mike Klaas <[EMAIL PROTECTED]> wrote: > On 30-Nov-07, at 1:02 PM, Owens, Martin wrote: > > > > > Hello everyone, > > > > We're working to replace the old Linux version of dtSearch with > > Lucene/Solr, using the http requests for our perl side and java for > > the indexing. > > > > The functionality that is causing the most problems is the > > highlighting since we're not storing the text in solr (only > > indexing) and we need to highlight an image file (ocr) so what we > > really need is to request from solr the word indexes of the > > matches, we then tie this up to the ocr image and create html boxes > > to do the highlighting. > > This isn't possible with Solr out-of-the-box. Also, the usual > methods for highlighting won't work because Solr typically re- > analyzes the raw text to find the appropriate highlighting points. > However, it shouldn't be too hard to come up with a custom solution. > You can tell lucene to store token offsets using TermVectors > (configurable via schema.xml). Then you can customize the request > handler to return the token offsets (and/or positions) by retrieving > the TVs. > > > The text is also multi page, each page is seperated by Ctrl-L page > > breaks, should we handle the paging out selves or can Solr tell use > > which page the match happened on too? > > Again, not automatically. However, if you wrote an analyzer that > bumped up the position increment of tokens every time a new page was > found (to, say the next multiple of 1000), then you infer the > matching page by the token position. > > cheers, > -Mike >