Re: Stripping Punctuation in a fieldType

Erick Erickson Fri, 15 Jan 2010 11:32:02 -0800

Ah, ok, your approach makes sense. Mostly I was trying
to insure that you weren't flying blind.


Perhaps you would find some joy with
PatternReplaceCharFilterFactory, replacing
all non-alphanum with empty string?

HTH
Erick

On Fri, Jan 15, 2010 at 2:07 PM, David Seltzer <dselt...@tveyes.com> wrote:

> Hi Erik,
>
> Thanks for your thoughtful reply!
>
> > It's actually quite rare for simple tokenizers like these to be
> satisfactory
> > unless it's a field you can guarantee is indexed/searched in a very
> > controlled manner, say part numbers or words from a list. In your
> > example above, none of the three variants would get a hit if the
> > user searched for "nation". Is that what you want?
>
> Yes, this is what I want. The reason for this behavior is that the
> output of SOLR needs to closely match the search results provided by a
> different legacy system. Our user have rigidly defined queries. A user
> who was interested in "nation's" is required either to search for
> "nations" or "nation*".
>
> > But no, Standard* don't have any stemming built in. And
> > what do you mean by "language specific functionality"?
> > They do NOT fold accents for instance if that's what
> > you're getting at.
>
> I asked that because I'm not super comfortable I know what's going on
> under the hood inside these tokenizers. Do they work the same on
> RightToLeft languages (such as Arabic) as they do in LeftToRight
> languages? (My assumption regarding the WhiteSpaceTokenizer is that it
> would be very language/direction neutral)
>
> > Could you explain a bit about *why* you want this behavior?
> In short we have to support multiple languages and match the behavior of
> an existing non-solr system.
>
> -Dave
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Friday, January 15, 2010 1:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Stripping Punctuation in a fieldType
>
> If you haven't seen it, this page is invaluable for this kind of
> question:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LetterT
> okenizerFactory
> <http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.Letter
> TokenizerFactory>
>
> LetterTokenizerFactory may well be your friend here, followed by
> LowerCaserFilterFactory. There is a problem that it would
> split "nation's" up into "nation" and "s", so searching on "nations"
> wouldn't get a hit.
>
> But you have equally ugly stuff with WhiteSpaceTokenizerFactory
> as you're finding out.
>
> It's actually quite rare for simple tokenizers like these to be
> satisfactory
> unless it's a field you can guarantee is indexed/searched in a very
> controlled manner, say part numbers or words from a list. In your
> example above, none of the three variants would get a hit if the
> user searched for "nation". Is that what you want?
>
> But no, Standard* don't have any stemming built in. And
> what do you mean by "language specific functionality"?
> They do NOT fold accents for instance if that's what
> you're getting at.
>
> Could you explain a bit about *why* you want this behavior?
>
> HTH
> Erick
>
> On Fri, Jan 15, 2010 at 1:17 PM, David Seltzer <dselt...@tveyes.com>
> wrote:
>
> > I'm hesitant to change Tokenizers at the moment because what we have
> is
> > working so nicely - or so I thought.
> >
> > What I'm looking for is case-insensitive search for words and numbers
> > without any of the stemming features turned on. The new requirement is
> > that we take punctuation out of the mix.
> >
> > Right now when I search for "Obama" I'm not getting any hits on
> "Obama."
> >
> > So I'm basically looking to strip punctuation. The consequence would
> be
> > that "nation's", "nations" and "nations," would all be represented the
> > same way.
> >
> > Would the StandardTokenizerFactory accomplish this?
> > Does it have any language specific functionality?
> > Does it do anything with stemming?
> >
> > Thanks for everyone's input!
> >
> > -Dave
> >
> >
> >
> > -----Original Message-----
> > From: Ahmet Arslan [mailto:iori...@yahoo.com]
> > Sent: Friday, January 15, 2010 12:42 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Stripping Punctuation in a fieldType
> >
> > > I'm trying to find the best way to set up a fieldType that
> > > strips punctuation.
> >
> > Use solr.StandardTokenizerFactory that strips punctuations.
> >
> > Or if you do not care about alphanumeric or numeric queries use
> > solr.LowerCaseTokenizerFactory that uses LetterTokenizer.
> >
> > I think the right way to do this is using a
> > > CharacterFilter
> > > of some type, but I can't seem to find any examples of how
> > > to set this
> > > up in a schema.xml file.
> >
> > If you want to use solr.MappingCharFilterFactory you need to write all
> > punctiation characters to a text file manually. e.g. "," => ""
> >
> >
> >
> >
>

Re: Stripping Punctuation in a fieldType

Reply via email to