Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]
On Feb 16, 2015, at 4:54 PM, Levy, Michael ml...@ushmm.org wrote: I think you can accomplish what you want by using ICUFoldingFilterFactory https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory which should simply perform ICU (cf http://site.icu-project.org/) based character folding (cf. http://www.unicode.org/reports/tr30/tr30-4.html) In schema.xml I generally have in both index and query: tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory / For unknown reasons, I was unable to load the ICUFoldingFilterFactory, but nonetheless, my interface works as expected. And I was able to do this after a combination of things. First, I needed to tell the indexer my content was Spanish, and after doing so, Solr parses things correctly. Second, I needed to explicitly tell my Web browser that the search form and returned content were using UTF-8. This was done the HTTP content-type header, the HTML meta tag, and even in the HTML form. Geesh! Through this whole process I also learned about Solr’s edismax (extended dismax) handler. Edismax supports free form queries as well as Boolean logic. solr++ But also solr+- because Solr is getting more and more and more complicated. —Eric “Lost In Chicago” Morgan
Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]
I know the documents I’m indexing are written in Spanish, and adding the following filters to my field definition, I believe I have resolved my problem: filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=Spanish / In other words, my searchable content is defined thus: field name=“text type=text_general indexed=true stored=true multiValued=false / And “text_general” is defined to include the filters in both the index and query sections: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=Spanish / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.LowerCaseFilterFactory / filter class=solr.SnowballPorterFilterFactory language=Spanish / /analyzer /fieldType