I'm hesitant to change Tokenizers at the moment because what we have is working so nicely - or so I thought.
What I'm looking for is case-insensitive search for words and numbers without any of the stemming features turned on. The new requirement is that we take punctuation out of the mix. Right now when I search for "Obama" I'm not getting any hits on "Obama." So I'm basically looking to strip punctuation. The consequence would be that "nation's", "nations" and "nations," would all be represented the same way. Would the StandardTokenizerFactory accomplish this? Does it have any language specific functionality? Does it do anything with stemming? Thanks for everyone's input! -Dave -----Original Message----- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Friday, January 15, 2010 12:42 PM To: solr-user@lucene.apache.org Subject: Re: Stripping Punctuation in a fieldType > I'm trying to find the best way to set up a fieldType that > strips punctuation. Use solr.StandardTokenizerFactory that strips punctuations. Or if you do not care about alphanumeric or numeric queries use solr.LowerCaseTokenizerFactory that uses LetterTokenizer. I think the right way to do this is using a > CharacterFilter > of some type, but I can't seem to find any examples of how > to set this > up in a schema.xml file. If you want to use solr.MappingCharFilterFactory you need to write all punctiation characters to a text file manually. e.g. "," => ""