diacritics on query string

Andrea Gazzarini Fri, 13 Aug 2010 00:21:30 -0700

 Hi,
I have a problem regarding a diacritic character on my query string :


*q=intertestualità
*
which is encoded in

*q=intertestualit%E0
*
What I'm not understanding is the following query response fragments :

<lst  name="responseHeader">
 <int  name="status">0</int>
 <int  name="QTime">23</int>
 <lst  name="params">
  <str  name="sort">score desc</str>
  <str  name="fl">score,title</str>

  <str  name="debugQuery">on</str>
  <str  name="indent">on</str>
  <str  name="start">0</str>
  *<str  name="q">intertestualit</str>*
  <str  name="version">2.2</str>

  <str  name="rows">3</str>
 </lst>

and

<lst  name="debug">
 <str  name="rawquerystring">*intertestualit*</str>
 <str  name="querystring">*intertestualit*</str>

I saw that my index contains the token "intertestualita" (with the 'à' char replaced with 
'a'). Indeed if I query for "intertestualita" I found my results.
The queried field is configured with the same chain :

<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
                        <tokenizer class="solr.WhitespaceTokenizerFactory" />
                        <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" 
composed="false" remove_diacritics="true" remove_modifiers="true" fold="true" />
                <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" enablePositionIncrements="true" />
                        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" 
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
                        <filter class="solr.LowerCaseFilterFactory" />
                        <filter class="solr.RemoveDuplicatesTokenFilterFactory" 
/>
        </analyzer>
        <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory" />
        <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" 
remove_diacritics="true" remove_modifiers="true" fold="true" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" 
expand="true" />
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" 
enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" 
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
        </analyzer>

</fieldtype>

So my question is : who is removing the "à" (%E0) characters from theinput query? It seems that the query arrives to SOLR already withoutthat character...


Regards,
Andrea

diacritics on query string

Reply via email to