subject:"Re\: \[CODE4LIB\] indexing word documents using solr \[diacritics, resolved \(i think\) \]"

Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

2015-02-20 Thread Eric Lease Morgan

On Feb 16, 2015, at 4:54 PM, Levy, Michael ml...@ushmm.org wrote:

 I think you can accomplish what you want by using ICUFoldingFilterFactory
 https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory
 
 which should simply perform ICU (cf http://site.icu-project.org/) based 
 character folding (cf. http://www.unicode.org/reports/tr30/tr30-4.html)
 
 In schema.xml I generally have in both index and query:
 
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.ICUFoldingFilterFactory /


For unknown reasons, I was unable to load the ICUFoldingFilterFactory, but 
nonetheless, my interface works as expected. And I was able to do this after a 
combination of things. First, I needed to tell the indexer my content was 
Spanish, and after doing so, Solr parses things correctly. Second, I needed to 
explicitly tell my Web browser that the search form and returned content were 
using UTF-8. This was done the HTTP content-type header, the HTML meta tag, and 
even in the HTML form. Geesh! Through this whole process I also learned about 
Solr’s edismax (extended dismax) handler. Edismax supports free form queries as 
well as Boolean logic.  solr++  But also solr+- because Solr is getting more 
and more and more complicated. —Eric “Lost In Chicago” Morgan

Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

2015-02-16 Thread Eric Lease Morgan

I know the documents I’m indexing are written in Spanish, and adding the 
following filters to my field definition, I believe I have resolved my problem:

  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.SnowballPorterFilterFactory language=Spanish /

In other words, my searchable content is defined thus:

  field name=“text type=text_general indexed=true stored=true 
multiValued=false /

And “text_general” is defined to include the filters in both the index and 
query sections:

  fieldType name=text_general class=solr.TextField 
positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory /
  filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt /
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.SnowballPorterFilterFactory language=Spanish /
/analyzer
analyzer type=query
  tokenizer class=solr.StandardTokenizerFactory /
  filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt /
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true /
  filter class=solr.LowerCaseFilterFactory /
  filter class=solr.SnowballPorterFilterFactory language=Spanish /
/analyzer
  /fieldType

Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

2 matches

Site Navigation

Mail list logo

Footer information