in short: use stemming

Try the SnowballPorterFilterFactory with German2 as language attribute first and use synonyms for combined words i.e. "Herrenhose" => "Herren", "Hose".

By using stemming you will maybe have some "interesting" results, but it is much better living with them than having no or much less results ;o)

Find more infos on the Snowball stemming algorithms here:

http://snowball.tartarus.org/

Also have a look at the StopFilterFactory, here is a sample stopwordlist for the german language:

http://snowball.tartarus.org/algorithms/german/stop.txt

Good luck,

Tom


Martin Grotzke schrieb:
Hello,

with our application we have the issue, that we get different
results for singular and plural searches (german language).

E.g. for "hose" we get 1.000 documents back, but for "hosen"
we get 10.000 docs. The same applies to "t-shirt" or "t-shirts",
of e.g. "hut" and "hüte" - lots of cases :)

This is absolutely correct according to the schema.xml, as right
now we do not have any stemming or synonyms included.

Now we want to have similar search results for these singular/plural
searches. I'm thinking of a solution for this, and want to ask, what
are your experiences with this.

Basically I see two options: stemming and the usage of synonyms. Are
there others?

My concern with stemming is, that it might produce unexpected results,
so that docs are found that do not match the query from the users point
of view. I asume that this needs a lot of testing with different data.

The issue with synonyms is, that we would have to create a file
containing all synonyms, so we would have to figure out all cases, in
contrast to a solutions that is based on an algorithm.
The advantage of this approach is IMHO, that it is very predictable
which results will be returned for a certain query.

Some background information:
Our documents contain products (id, name, brand, category, producttype,
description, color etc). The singular/plural issue basically applied to
the fields name, category and producttype, so we would like to restrict
the solution to these fields.

Do you have suggestions how to handle this?

Thanx in advance for sharing your experiences,
cheers,
Martin


Reply via email to