subject:"\[CODE4LIB\] indexing word documents using solr \[diacritics\]"

Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

2015-02-20 Thread Eric Lease Morgan

On Feb 16, 2015, at 4:54 PM, Levy, Michael ml...@ushmm.org wrote:

 I think you can accomplish what you want by using ICUFoldingFilterFactory
 https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory
 
 which should simply perform ICU (cf http://site.icu-project.org/) based 
 character folding (cf. http://www.unicode.org/reports/tr30/tr30-4.html)
 
 In schema.xml I generally have in both index and query:
 
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.ICUFoldingFilterFactory /


For unknown reasons, I was unable to load the ICUFoldingFilterFactory, but 
nonetheless, my interface works as expected. And I was able to do this after a 
combination of things. First, I needed to tell the indexer my content was 
Spanish, and after doing so, Solr parses things correctly. Second, I needed to 
explicitly tell my Web browser that the search form and returned content were 
using UTF-8. This was done the HTTP content-type header, the HTML meta tag, and 
even in the HTML form. Geesh! Through this whole process I also learned about 
Solr’s edismax (extended dismax) handler. Edismax supports free form queries as 
well as Boolean logic.  solr++  But also solr+- because Solr is getting more 
and more and more complicated. —Eric “Lost In Chicago” Morgan

Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

2015-02-16 Thread Eric Lease Morgan

I know the documents I’m indexing are written in Spanish, and adding the 
following filters to my field definition, I believe I have resolved my problem:

  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.SnowballPorterFilterFactory language=Spanish /

In other words, my searchable content is defined thus:

  field name=“text type=text_general indexed=true stored=true 
multiValued=false /

And “text_general” is defined to include the filters in both the index and 
query sections:

  fieldType name=text_general class=solr.TextField 
positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory /
  filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt /
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.SnowballPorterFilterFactory language=Spanish /
/analyzer
analyzer type=query
  tokenizer class=solr.StandardTokenizerFactory /
  filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt /
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true /
  filter class=solr.LowerCaseFilterFactory /
  filter class=solr.SnowballPorterFilterFactory language=Spanish /
/analyzer
  /fieldType

Re: [CODE4LIB] indexing word documents using solr [diacritics]

2015-02-12 Thread Karl Holten

Ah, the wonderful world of character encoding...

To quote the Solr wiki:
There are no known bugs with Solr's character handling, but there have been 
some reported issues with the way different application servers (and different 
versions of the same application server) treat incoming and outgoing multibyte 
characters. In particular, people have reported better success with Tomcat than 
with Jetty... 
(https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F )

I'd probably start by enabling UTF-8 in Tomcat/Jetty and see if that resolves 
the issue. 

If not, I'd check the original files to see what its character encoding is, and 
then check each application that handles the documents to make sure it's using 
that encoding. It might be that the original isn't in UTF-8, or if it is, that 
somewhere along the way the parser, the perl interface, or some other unknown 
culprit is attempting to change it.

Regards,
Karl Holten
Systems Integration Specialist
SWITCH Inc
414-382-6711

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric 
Lease Morgan
Sent: Thursday, February 12, 2015 2:38 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] indexing word documents using solr [diacritics]

How do I retain diacritics in a Solr index, and how to I search for words 
containing them?

I have extracted the plain text out of set of Word documents. I have then used 
a Perl interface (WebService::Solr) to add the plain text to a Solr index using 
a field type called text_general:

fieldType name=text_general class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory /
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt /
filter class=solr.LowerCaseFilterFactory /
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory /
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt /
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true /
filter class=solr.LowerCaseFilterFactory /
  /analyzer
/fieldType

It seems as if I am unable to search for words like ejecución because the 
diacritic gets in the way. What am I doing wrong?

— 
Eric

Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

Re: [CODE4LIB] indexing word documents using solr [diacritics]

3 matches

Site Navigation

Mail list logo

Footer information