Hi, I wasn't happy with how our current solr configuration handled diacritics (like 'é') in the text and in search queries, since it simply considered the letter with a diacritic as a distinct letter. Ie 'é' didn't match 'e', and vice versa. Except for a handful rare words where the diacritical sign in 'é' actually change the word meaning, it is usually used in names of people and places and the expected behaivor when searching is to not have to type them and still get the expected results (like searching for 'Penelope Cruz' and getting hits for 'Penélope Cruz').
When reading online about how to handle diacritics in solr, it seems that the general recommendation, when no language specific solution exists that handles this, is to use the ICUFoldingFilter. However this filter doesn't really come with a lot of documentation, and doesn't seem to have any configuration options at all (at least not documented). So what I ended up with doing was simply to add the ICUFoldingFilterFactory in the middle of the existing analyzer chain, like this: <fieldType name="text_sv" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory" /> <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([.])" replacement=" " /> <tokenizer class="solr.StandardTokenizerFactory" /> <filter class="solr.LowerCaseFilterFactory" /> <filter class="solr.KeywordRepeatFilterFactory" /> <filter class="solr.ICUFoldingFilterFactory"/> <filter class="solr.SwedishLightStemFilterFactory" /> <filter class="solr.RemoveDuplicatesTokenFilterFactory" /> </analyzer> </fieldType> But that didn't really give me the results I want. For example, using the analysis debug tool I see that the text 'café åäö' becomes 'cafe caf aao'. And there are two problems with that result: 1. It doesn't respect keyword attribute 2. It folds the Swedish characters 'åäö' into 'aao' The disregard of the keyword attribute is bad enough, but the mangling of the Swedish language is really a show stopper for us. The Swedish language doesn't consider 'ö', for example, to be the letter 'o' with two diacritical dots above it, just as 'Q' isn't considered to be the letter 'O' with a diacritical "squiggly line" at the bottom. So when handling Swedish text, these characters ('åäöÅÄÖ') shouldn't be folded, because then there will be to many "collisions". For example, when searching for 'påstå' ('claim'), one doesn't want hits about 'pasta' (you guessed it, it means 'pasta'), just as one doesn't want to get hits about 'aga' ('corporal punishment, usually against children') when searching for 'äga' ('to own'). Or even worse, when searching för 'höra' ('to hear'), one most likely doesn't want hits about 'hora' ('prostitute'). And I can go on... :) So, is there a way for us to make the ICUFoldingFilter work in a better way? Ie configure it to respect the keyword attribute and ignore 'åäö' characters when folding, but otherwise fold all diacritical characters into the non-diacritical form. Or how would you recommend us to configure our analyzer chain to acomplice this? Regards /Jimi