Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

2015-02-20 Thread Eric Lease Morgan
On Feb 16, 2015, at 4:54 PM, Levy, Michael  wrote:

> I think you can accomplish what you want by using ICUFoldingFilterFactory
> https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory
> 
> which should simply perform ICU (cf http://site.icu-project.org/) based 
> character folding (cf. http://www.unicode.org/reports/tr30/tr30-4.html)
> 
> In schema.xml I generally have in both index and query:
> 
> 
> 


For unknown reasons, I was unable to load the ICUFoldingFilterFactory, but 
nonetheless, my interface works as expected. And I was able to do this after a 
combination of things. First, I needed to tell the indexer my content was 
Spanish, and after doing so, Solr parses things correctly. Second, I needed to 
explicitly tell my Web browser that the search form and returned content were 
using UTF-8. This was done the HTTP content-type header, the HTML meta tag, and 
even in the HTML form. Geesh! Through this whole process I also learned about 
Solr’s edismax (extended dismax) handler. Edismax supports free form queries as 
well as Boolean logic.  solr++  But also solr+- because Solr is getting more 
and more and more complicated. —Eric “Lost In Chicago” Morgan


Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

2015-02-16 Thread Levy, Michael
Eric,

Your solution will have other effects while it performs Spanish language
Porter stemming, which you may or may not want depending on your use case.

I think you can accomplish what you want by using ICUFoldingFilterFactory
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory

which should simply perform ICU (cf http://site.icu-project.org/) based
character folding (cf. http://www.unicode.org/reports/tr30/tr30-4.html)

In schema.xml I generally have in both index and query:




That should take care of many search issues relating to diacritics and
accents. For example, I wanted to have Łódź and Lodz index and search
identically, and this does that.

f you are using Tomcat, you might also want to set up the URIENcoding. See
https://wiki.apache.org/solr/SolrTomcat and the line  on that page
including 

For example it might be like:


By the way, I also wanted to have ö, ä, and ü index and query the same as
oe, ae, and ue because those are very common variants in German terms
rendered in English texts. The only way I could figure out how to
accomplish that was to use a charFilter by creating a file named
mapping-GermanUmlauts.txt
containing
"ae" => "a"
"oe" => "o"
"ue" => "u"
and then I added this after the filter class=solr.ICUFoldingFilterFactory:


I hope this is helpful.

-- Forwarded message --
From: Eric Lease Morgan 
Date: Mon, Feb 16, 2015 at 4:58 PM
Subject: Re: [CODE4LIB] indexing word documents using solr [diacritics,
resolved (i think) ]
To: CODE4LIB@listserv.nd.edu


I know the documents I’m indexing are written in Spanish, and adding the
following filters to my field definition, I believe I have resolved my
problem:

  
  

In other words, my searchable content is defined thus:

  

And “text_general” is defined to include the filters in both the index and
query sections:

  

  
  
  
  


  
  
  
  
  

  


Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

2015-02-16 Thread Eric Lease Morgan
I know the documents I’m indexing are written in Spanish, and adding the 
following filters to my field definition, I believe I have resolved my problem:

  
  

In other words, my searchable content is defined thus:

  

And “text_general” is defined to include the filters in both the index and 
query sections: