Hi, On Wed, Dec 16, 2009 at 3:21 PM, Ian Boston <[email protected]> wrote: > On 16 Dec 2009, at 10:25, Jukka Zitting wrote: >> Instead of reaching down to the underlying Lucene index, I would >> recommend reading the original document data stored in the JCR node >> and passing it through the Jackrabbit text extractors and the >> configured Lucene Analyzer to get the terms stored in the index. > > That can be quite expensive, especially for poor quality PDF,s, and some > docx word docs. I am expecting to want to do this for between 25 and 100 > nodes at a time aggregating the results.
You might also consider implementing a rep:fulltext() function that works like rep:excerpt() but returns the text content of the specified field as stored in the underlying index. You'd still need to pass the text through the analyzer to get the term vector, but that's quite a bit faster than extracting the text from the original binaries. A mechanism that returns the TermPositionVector (or some string representation of it) from the index is likely more complex than returning just the stored text. BR, Jukka Zitting BR, Jukka Zitting
