Hi,

I am having a problem with the fact that no text analysis are performed on 
wildcard queries.  I have the following field type (a bit simplified):
    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.ASCIIFoldingFilterFactory" />
      </analyzer>
    </fieldType>

My problem has to do with Icelandic characters, when I index a document with a 
text field including the word "sjálfsögðu" it gets indexed as "sjalfsogdu" 
(because of the ASCIIFoldingFilterFactory which replaces the Icelandic 
characters with their English equivalents).  Then, when I search (without a 
wildcard) for "sjálfsögðu" or "sjalfsogdu" I get that document as a result.  
This is convenient since it enables people to search without using accented 
characters and yet get the results they want (e.g. if they are working on 
computers with English keyboards).

However this all falls apart when using wildcard searches, then the search 
string isn't passed through the filters, and even if I search for "sjálf*" I 
don't get any results because the index doesn't contain the original words (I 
get result if I search for "sjalf*").  I know people have been having a similar 
problem with the case sensitivity of wildcard queries and most often the 
solution seems to be to lowercase the string before passing it on to solr, 
which is not exactly an optimal solution (yet a simple one in that case).  The 
Icelandic characters complicate things a bit and applying the same solution 
(doing the lowercasing and character mapping) in my application seems like 
unnecessary duplication of code already part of solr, not to mention 
complication of my application and possible maintenance down the road.

Is there any way around this?  How are people solving this?  Is there a way to 
apply the filters to wildcard queries?  I guess removing the 
ASCIIFoldingFilterFactory is the simplest "solution" but this "normalization" 
(of the text done by the filter) is often very useful.

I hope I'm not overlooking some obvious explanation. :/

Thanks in advance,
Kári Hreinsson

Reply via email to