On Dec 26, 2012, at 11:00 , Aleksandar Radovanovic <[email protected]> wrote:
> However, if I, for example, search for chemistry related phrase: OF(+) > search returns no result. On the other hand, the quoted phrase: "OF(+)" > returns every single document containing the preposition "of". The > highlighter clearly shows that "OF(+)" was still not not found as the > "(+)" part was not highlighted. > > Is there an easy solution, or must I analyze the user's input and decide > what to use: IndexSearcher for non quoted queries and > TermQuery/PhraseQuery for quoted, or must I create some special regex > rules for words containing non-letters? There are many of these in > biomedical field. You can use the RegexTokenizer to define how your documents are split into tokens: http://lucy.apache.org/docs/perl/Lucy/Analysis/RegexTokenizer.html To handle the use case described above, you could for example add parens and the plus sign to the list of word characters. So your pattern would look something like '[\w()+]+'. But this would match parens everywhere which is probably not what you want. Another approach is to split on parens and create tokens for sequences of plus signs resulting in a pattern like '\w+|\++'. Nick
