We are using Solr as the search engine for our public access library
catalog. In testing I did a search for a French movie that I know is in
the catalog named: Kirikou et la sorcière and nothing was returned.
If I search for just the work Kirikou several results are returned,
and the problem becomes apparent. The records contain Kirikou et la
sorcie?re where the accent is a unicode combining character following
the e.
After some research into Unicode normalization, I found and installed a
Unicode normalization filter that is set to convert letters followed by
combining codes into the precomposed form. I also installed a
solr.ISOLatin1AccentFilterFactory that will then convert these
precomposed forms into the latin equivalent without any accent. The
following is the fieldType definition taken from the schema.xml file:
fieldType name=text class=solr.TextField positionIncrementGap=100
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=schema.UnicodeNormalizationFilterFactory/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=1
catenateNumbers=1 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=schema.UnicodeNormalizationFilterFactory/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.SynonymFilterFactory
synonyms=synonyms.txt ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
catenateNumbers=0 catenateAll=0/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt/
filter class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType
So it seems like this should work.
However again searching for Kirikou et la sorcière or sorcière or
sorcie?re or just sorciere doesn't return the docment in question.
I've tried looking at the results from solr/admin/analysis.jsp entering
in text from the record for the Field value (Index) and entering in
sorciere in the Field value (Query) and I get the follow results, which
seems to indicate that there should be a match between the stemmed entry
sorcier in the record and the stemmed word sorcier from the query.
So clearly I am either doing something wrong or misinterpreting the
analyzers, but I am at a loss as to how to figure out what is wrong.
Any suggestions?
org.apache.solr.analysis.WhitespaceTokenizerFactory {}
term position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24 25 26 27 28 29
term text Kirikou et la sorcie?re France 3 Cinema / RTBF
(Te?le?vision belge). Grand Prix du festival d'Annecy 1999
France French VHS VIDEO .VHS10969 1 vide?ocassette (1h10 min.)
(VHS) Ocelot, Michel
term type word word word word word word word word word
word word word word word word word word word word word word
word word word word word word word word
source start,end 0,7 8,10 11,13 14,23 25,31 32,33 34,40 41,42
43,47 48,61 62,69 72,77 78,82 83,85 86,94 95,103 104,108
110,116 117,123 124,127 129,134 135,144 147,148 149,163 164,169
170,175 176,181 183,190 191,197
schema.UnicodeNormalizationFilterFactory {}
term position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24 25 26 27 28 29
term text (Kirikou,0,7) (et,8,10) (la,11,13) (sorcière,14,23)
(France,25,31) (3,32,33) (Cinema,34,40) (/,41,42) (RTBF,43,47)
((Télévision,48,61) (belge).,62,69) (Grand,72,77) (Prix,78,82)
(du,83,85) (festival,86,94) (d'Annecy,95,103) (1999,104,108)
(France,110,116) (French,117,123) (VHS,124,127) (VIDEO,129,134)
(.VHS10969,135,144) (1,147,148) (vidéocassette,149,163)
((1h10,164,169) (min.),170,175) ((VHS),176,181) (Ocelot,,183,190)
(Michel,191,197)
term type word word word word word word word word word
word word word word word word word word word word word word
word word word word word word word word
source start,end 0,7 8,10 11,13 14,23 25,31 32,33 34,40 41,42
43,47 48,61 62,69 72,77 78,82 83,85 86,94 95,103 104,108
110,116 117,123 124,127 129,134 135,144 147,148 149,163 164,169
170,175 176,181 183,190 191,197
org.apache.solr.analysis.ISOLatin1AccentFilterFactory {}
term position 1 2