ICUFoldingFilter with swedish characters, and tokens with the keyword attribute?

jimi.hullegard Mon, 09 Jan 2017 22:03:09 -0800

Hi,

I wasn't happy with how our current solr configuration handled diacritics (like 
'é') in the text and in search queries, since it simply considered the letter 
with a diacritic as a distinct letter. Ie 'é' didn't match 'e', and vice versa. 
Except for a handful rare words where the diacritical sign in 'é' actually 
change the word meaning, it is usually used in names of people and places and 
the expected behaivor when searching is to not have to type them and still get 
the expected results (like searching for 'Penelope Cruz' and getting hits for 
'Penélope Cruz').


When reading online about how to handle diacritics in solr, it seems that the 
general recommendation, when no language specific solution exists that handles 
this, is to use the ICUFoldingFilter. However this filter doesn't really come 
with a lot of documentation, and doesn't seem to have any configuration options 
at all (at least not documented).

So what I ended up with doing was simply to add the ICUFoldingFilterFactory in 
the middle of the existing analyzer chain, like this:

<fieldType name="text_sv" class="solr.TextField" positionIncrementGap="100">
                             <analyzer>
                                                          <charFilter 
class="solr.HTMLStripCharFilterFactory" />
                                                          <charFilter 
class="solr.PatternReplaceCharFilterFactory" pattern="([.])" replacement=" " />
                                                          <tokenizer 
class="solr.StandardTokenizerFactory" />
                                                          <filter 
class="solr.LowerCaseFilterFactory" />
                                                          <filter 
class="solr.KeywordRepeatFilterFactory" />
                                                          <filter 
class="solr.ICUFoldingFilterFactory"/>
                                                          <filter 
class="solr.SwedishLightStemFilterFactory" />
                                                          <filter 
class="solr.RemoveDuplicatesTokenFilterFactory" />
                             </analyzer>
</fieldType>


But that didn't really give me the results I want. For example, using the 
analysis debug tool I see that the text 'café åäö' becomes 'cafe caf aao'. And 
there are two problems with that result:

1. It doesn't respect keyword attribute
2. It folds the Swedish characters 'åäö' into 'aao'

The disregard of the keyword attribute is bad enough, but the mangling of the 
Swedish language is really a show stopper for us. The Swedish language doesn't 
consider 'ö', for example, to be the letter 'o' with two diacritical dots above 
it, just as 'Q' isn't considered to be the letter 'O' with a diacritical 
"squiggly line" at the bottom. So when handling Swedish text, these characters 
('åäöÅÄÖ') shouldn't be folded, because then there will be to many "collisions".

For example, when searching for 'påstå' ('claim'), one doesn't want hits about 
'pasta' (you guessed it, it means 'pasta'), just as one doesn't want to get 
hits about 'aga' ('corporal punishment, usually against children') when 
searching for 'äga' ('to own'). Or even worse, when searching för 'höra' ('to 
hear'), one most likely doesn't want hits about 'hora' ('prostitute'). And I 
can go on... :)

So, is there a way for us to make the ICUFoldingFilter work in a better way? Ie 
configure it to respect the keyword attribute and ignore 'åäö' characters when 
folding, but otherwise fold all diacritical characters into the non-diacritical 
form. Or how would you recommend us to configure our analyzer chain to 
acomplice this?

Regards
/Jimi

ICUFoldingFilter with swedish characters, and tokens with the keyword attribute?

Reply via email to