Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]
On Feb 16, 2015, at 4:54 PM, Levy, Michael ml...@ushmm.org wrote: I think you can accomplish what you want by using ICUFoldingFilterFactory https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory which should simply perform ICU (cf http://site.icu-project.org/) based character folding (cf. http://www.unicode.org/reports/tr30/tr30-4.html) In schema.xml I generally have in both index and query: tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory / For unknown reasons, I was unable to load the ICUFoldingFilterFactory, but nonetheless, my interface works as expected. And I was able to do this after a combination of things. First, I needed to tell the indexer my content was Spanish, and after doing so, Solr parses things correctly. Second, I needed to explicitly tell my Web browser that the search form and returned content were using UTF-8. This was done the HTTP content-type header, the HTML meta tag, and even in the HTML form. Geesh! Through this whole process I also learned about Solr’s edismax (extended dismax) handler. Edismax supports free form queries as well as Boolean logic. solr++ But also solr+- because Solr is getting more and more and more complicated. —Eric “Lost In Chicago” Morgan
Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]
I know the documents I’m indexing are written in Spanish, and adding the following filters to my field definition, I believe I have resolved my problem: filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=Spanish / In other words, my searchable content is defined thus: field name=“text type=text_general indexed=true stored=true multiValued=false / And “text_general” is defined to include the filters in both the index and query sections: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=Spanish / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.LowerCaseFilterFactory / filter class=solr.SnowballPorterFilterFactory language=Spanish / /analyzer /fieldType
Re: [CODE4LIB] indexing word documents using solr [diacritics]
Ah, the wonderful world of character encoding... To quote the Solr wiki: There are no known bugs with Solr's character handling, but there have been some reported issues with the way different application servers (and different versions of the same application server) treat incoming and outgoing multibyte characters. In particular, people have reported better success with Tomcat than with Jetty... (https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F ) I'd probably start by enabling UTF-8 in Tomcat/Jetty and see if that resolves the issue. If not, I'd check the original files to see what its character encoding is, and then check each application that handles the documents to make sure it's using that encoding. It might be that the original isn't in UTF-8, or if it is, that somewhere along the way the parser, the perl interface, or some other unknown culprit is attempting to change it. Regards, Karl Holten Systems Integration Specialist SWITCH Inc 414-382-6711 -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric Lease Morgan Sent: Thursday, February 12, 2015 2:38 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] indexing word documents using solr [diacritics] How do I retain diacritics in a Solr index, and how to I search for words containing them? I have extracted the plain text out of set of Word documents. I have then used a Perl interface (WebService::Solr) to add the plain text to a Solr index using a field type called text_general: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.LowerCaseFilterFactory / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType It seems as if I am unable to search for words like ejecución because the diacritic gets in the way. What am I doing wrong? — Eric