I'm hesitant to change Tokenizers at the moment because what we have is
working so nicely - or so I thought.

What I'm looking for is case-insensitive search for words and numbers
without any of the stemming features turned on. The new requirement is
that we take punctuation out of the mix. 

Right now when I search for "Obama" I'm not getting any hits on "Obama."

So I'm basically looking to strip punctuation. The consequence would be
that "nation's", "nations" and "nations," would all be represented the
same way. 

Would the StandardTokenizerFactory accomplish this? 
Does it have any language specific functionality? 
Does it do anything with stemming?

Thanks for everyone's input!

-Dave



-----Original Message-----
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Friday, January 15, 2010 12:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Stripping Punctuation in a fieldType

> I'm trying to find the best way to set up a fieldType that
> strips punctuation. 

Use solr.StandardTokenizerFactory that strips punctuations. 

Or if you do not care about alphanumeric or numeric queries use 
solr.LowerCaseTokenizerFactory that uses LetterTokenizer.

I think the right way to do this is using a
> CharacterFilter
> of some type, but I can't seem to find any examples of how
> to set this
> up in a schema.xml file. 

If you want to use solr.MappingCharFilterFactory you need to write all
punctiation characters to a text file manually. e.g. "," => ""


      

Reply via email to