Thanks for the response guys: Grant: I had a brief look at LingPipe, it looks quite interesting but I'm concerned that the licensing may prevent me from using it in my project. Michael: I have used the Yahoo API in the past but due to it's generic nature, I wasn't entirely happy with the results in my test cases. Yonik: This is the approach I had in mind, will it still work if I put the SynonymFilter after the word-delimiter filter in the schema config? Ideally I want to strip out the underscore char before it gets indexed, is that possible by using a PatternReplaceFilterFactory after the SynonymFilter?
Cheers, Piete On 21/09/2007, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > On 9/19/07, Pieter Berkel <[EMAIL PROTECTED]> wrote: > > However, I'd like to be able to > > analyze documents more intelligently to recognize phrase keywords such > as > > "open source", "Microsoft Office", "Bill Gates" rather than splitting > each > > word into separate tokens (the field is never used in search queries so > > matching is not an issue). I've been looking at SynonymFilterFactory as > a > > possible solution to this problem but haven't been able to work out the > > specifics of how to configure it for phrase mappings. > > SynonymFilter works out-of-the-box with multi-token synonyms... > > Microsoft Office => microsoft_office > Bill Gates, William Gates => bill_gates > > Just don't use a word-delimiter filter if you use underscore to join > words. > > -Yonik >