Hello,

with our application we have the issue, that we get different
results for singular and plural searches (german language).

E.g. for "hose" we get 1.000 documents back, but for "hosen"
we get 10.000 docs. The same applies to "t-shirt" or "t-shirts",
of e.g. "hut" and "hüte" - lots of cases :)

This is absolutely correct according to the schema.xml, as right
now we do not have any stemming or synonyms included.

Now we want to have similar search results for these singular/plural
searches. I'm thinking of a solution for this, and want to ask, what
are your experiences with this.

Basically I see two options: stemming and the usage of synonyms. Are
there others?

My concern with stemming is, that it might produce unexpected results,
so that docs are found that do not match the query from the users point
of view. I asume that this needs a lot of testing with different data.

The issue with synonyms is, that we would have to create a file
containing all synonyms, so we would have to figure out all cases, in
contrast to a solutions that is based on an algorithm.
The advantage of this approach is IMHO, that it is very predictable
which results will be returned for a certain query.

Some background information:
Our documents contain products (id, name, brand, category, producttype,
description, color etc). The singular/plural issue basically applied to
the fields name, category and producttype, so we would like to restrict
the solution to these fields.

Do you have suggestions how to handle this?

Thanx in advance for sharing your experiences,
cheers,
Martin

-------------------------------------------------------------
Extracts of our schema.xml:

  <types>
    <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldtype>

    <fieldType name="trimmedString" class="solr.TextField" 
sortMissingLast="true" omitNorms="true">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
      </analyzer>
      <!-- we should also configure lcasing for index and query analyzer -->
    </fieldType>
  </types>

  <fields>
    <field name="name" type="text" indexed="true" stored="true"/>
    <field name="cat" type="trimmedString" indexed="true" stored="true" 
multiValued="true" omitNorms="true"/>
    <field name="type" type="trimmedString" indexed="true" stored="true" 
multiValued="false" omitNorms="true"/>
  </fields>

  <defaultSearchField>text</defaultSearchField>

  <copyField source="tag" dest="text"/>
  <copyField source="cat" dest="text"/>
  <copyField source="name" dest="text"/>
  <copyField source="type" dest="text" />
  <copyField source="brand" dest="text" />
-------------------------------------------------------------


Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to