Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]
On Feb 16, 2015, at 4:54 PM, Levy, Michael ml...@ushmm.org wrote: I think you can accomplish what you want by using ICUFoldingFilterFactory https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory which should simply perform ICU (cf http://site.icu-project.org/) based character folding (cf. http://www.unicode.org/reports/tr30/tr30-4.html) In schema.xml I generally have in both index and query: tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.ICUFoldingFilterFactory / For unknown reasons, I was unable to load the ICUFoldingFilterFactory, but nonetheless, my interface works as expected. And I was able to do this after a combination of things. First, I needed to tell the indexer my content was Spanish, and after doing so, Solr parses things correctly. Second, I needed to explicitly tell my Web browser that the search form and returned content were using UTF-8. This was done the HTTP content-type header, the HTML meta tag, and even in the HTML form. Geesh! Through this whole process I also learned about Solr’s edismax (extended dismax) handler. Edismax supports free form queries as well as Boolean logic. solr++ But also solr+- because Solr is getting more and more and more complicated. —Eric “Lost In Chicago” Morgan
Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]
I know the documents I’m indexing are written in Spanish, and adding the following filters to my field definition, I believe I have resolved my problem: filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=Spanish / In other words, my searchable content is defined thus: field name=“text type=text_general indexed=true stored=true multiValued=false / And “text_general” is defined to include the filters in both the index and query sections: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=Spanish / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.LowerCaseFilterFactory / filter class=solr.SnowballPorterFilterFactory language=Spanish / /analyzer /fieldType
Re: [CODE4LIB] indexing word documents using solr [diacritics]
Ah, the wonderful world of character encoding... To quote the Solr wiki: There are no known bugs with Solr's character handling, but there have been some reported issues with the way different application servers (and different versions of the same application server) treat incoming and outgoing multibyte characters. In particular, people have reported better success with Tomcat than with Jetty... (https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F ) I'd probably start by enabling UTF-8 in Tomcat/Jetty and see if that resolves the issue. If not, I'd check the original files to see what its character encoding is, and then check each application that handles the documents to make sure it's using that encoding. It might be that the original isn't in UTF-8, or if it is, that somewhere along the way the parser, the perl interface, or some other unknown culprit is attempting to change it. Regards, Karl Holten Systems Integration Specialist SWITCH Inc 414-382-6711 -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric Lease Morgan Sent: Thursday, February 12, 2015 2:38 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] indexing word documents using solr [diacritics] How do I retain diacritics in a Solr index, and how to I search for words containing them? I have extracted the plain text out of set of Word documents. I have then used a Perl interface (WebService::Solr) to add the plain text to a Solr index using a field type called text_general: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.LowerCaseFilterFactory / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType It seems as if I am unable to search for words like ejecución because the diacritic gets in the way. What am I doing wrong? — Eric
Re: [CODE4LIB] indexing word documents using solr
On Feb 10, 2015, at 11:46 AM, Erik Hatcher erikhatc...@mac.com wrote: bin/post -c collection_name /path/to/file.doc The almost trivial command to index a Word document in Solr, above, is most certainly appealing, but I’m wondering about the underlying index’s schema. Tika makes every effort to extract as much metadata from Word documents as possible. This metadata includes dates, titles, authors, names of applications, last edit, etc. Some of this data can be very useful. The metadata can be packaged up as an XML file/stream and then sent to Solr for indexing. Tastes great. Less filling.” But my question is, “To what degree does Solr know what to do with the metadata when the (kewl) command, above, is seemingly so generic? Does one need to create a Solr schema to specifically accommodate the Tika-created metadata, or do such things also come for ‘free’?” — Eric Morgan
[CODE4LIB] indexing word documents using solr
Can somebody point me to a good tutorial on how to index Word documents using Solr? I have a few hundred Microsoft Word documents I want to search. Through the use of the Tika library it seems as if I ought to be able to index my Word documents directly into Solr, but none of the tutorials I have found on the Web are complete. Missing directories. Missing files. Documentation for versions unreleased. Etc. Put another way, Tika can create a (nice) XHTML file complete with some useful metadata that can all be fed to Solr for indexing, but I can barely get out of the starting gate. Have you indexed Word documents using Solr, and if so, then how? — Eric Morgan
Re: [CODE4LIB] indexing word documents using solr
I found this book helped me get my head around Solr: https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-beginner%E2%80%99s-guide. Chapter 8 explains indexing rich text formats including MS Word. Chris Gray Systems Analyst 519-888-4567, ext. 35764 cpg...@uwaterloo.ca University of Waterloo On 15-02-10 11:12 AM, Eric Lease Morgan wrote: Can somebody point me to a good tutorial on how to index Word documents using Solr? I have a few hundred Microsoft Word documents I want to search. Through the use of the Tika library it seems as if I ought to be able to index my Word documents directly into Solr, but none of the tutorials I have found on the Web are complete. Missing directories. Missing files. Documentation for versions unreleased. Etc. Put another way, Tika can create a (nice) XHTML file complete with some useful metadata that can all be fed to Solr for indexing, but I can barely get out of the starting gate. Have you indexed Word documents using Solr, and if so, then how? — Eric Morgan
Re: [CODE4LIB] indexing word documents using solr
On Feb 10, 2015, at 12:43, Eric Lease Morgan emor...@nd.edu wrote: On Feb 10, 2015, at 11:46 AM, Erik Hatcher erikhatc...@mac.com wrote: First, with Solr 5, it’s this easy: Where can I download Solr 5 because none of the other version seem to be complete. —ELM It's not yet released but will be in a matter of days. RC2 was generated last night here: http://people.apache.org/~anshum/staging_area/lucene-solr-5.0.0-RC2-rev1658469/solr/ Sorry for the tease on Solr 5, that's just where I've been living lately :) Erik