Eric,
Your solution will have other effects while it performs Spanish language
Porter stemming, which you may or may not want depending on your use case.
I think you can accomplish what you want by using ICUFoldingFilterFactory
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory
which should simply perform ICU (cf http://site.icu-project.org/) based
character folding (cf. http://www.unicode.org/reports/tr30/tr30-4.html)
In schema.xml I generally have in both index and query:
That should take care of many search issues relating to diacritics and
accents. For example, I wanted to have Łódź and Lodz index and search
identically, and this does that.
f you are using Tomcat, you might also want to set up the URIENcoding. See
https://wiki.apache.org/solr/SolrTomcat and the line on that page
including
For example it might be like:
By the way, I also wanted to have ö, ä, and ü index and query the same as
oe, ae, and ue because those are very common variants in German terms
rendered in English texts. The only way I could figure out how to
accomplish that was to use a charFilter by creating a file named
mapping-GermanUmlauts.txt
containing
"ae" => "a"
"oe" => "o"
"ue" => "u"
and then I added this after the filter class=solr.ICUFoldingFilterFactory:
I hope this is helpful.
-- Forwarded message --
From: Eric Lease Morgan
Date: Mon, Feb 16, 2015 at 4:58 PM
Subject: Re: [CODE4LIB] indexing word documents using solr [diacritics,
resolved (i think) ]
To: CODE4LIB@listserv.nd.edu
I know the documents I’m indexing are written in Spanish, and adding the
following filters to my field definition, I believe I have resolved my
problem:
In other words, my searchable content is defined thus:
And “text_general” is defined to include the filters in both the index and
query sections: