Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

2015-02-20 Thread Eric Lease Morgan
On Feb 16, 2015, at 4:54 PM, Levy, Michael ml...@ushmm.org wrote:

 I think you can accomplish what you want by using ICUFoldingFilterFactory
 https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUFoldingFilterFactory
 
 which should simply perform ICU (cf http://site.icu-project.org/) based 
 character folding (cf. http://www.unicode.org/reports/tr30/tr30-4.html)
 
 In schema.xml I generally have in both index and query:
 
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.ICUFoldingFilterFactory /


For unknown reasons, I was unable to load the ICUFoldingFilterFactory, but 
nonetheless, my interface works as expected. And I was able to do this after a 
combination of things. First, I needed to tell the indexer my content was 
Spanish, and after doing so, Solr parses things correctly. Second, I needed to 
explicitly tell my Web browser that the search form and returned content were 
using UTF-8. This was done the HTTP content-type header, the HTML meta tag, and 
even in the HTML form. Geesh! Through this whole process I also learned about 
Solr’s edismax (extended dismax) handler. Edismax supports free form queries as 
well as Boolean logic.  solr++  But also solr+- because Solr is getting more 
and more and more complicated. —Eric “Lost In Chicago” Morgan


Re: [CODE4LIB] indexing word documents using solr [diacritics, resolved (i think) ]

2015-02-16 Thread Eric Lease Morgan
I know the documents I’m indexing are written in Spanish, and adding the 
following filters to my field definition, I believe I have resolved my problem:

  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.SnowballPorterFilterFactory language=Spanish /

In other words, my searchable content is defined thus:

  field name=“text type=text_general indexed=true stored=true 
multiValued=false /

And “text_general” is defined to include the filters in both the index and 
query sections:

  fieldType name=text_general class=solr.TextField 
positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.StandardTokenizerFactory /
  filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt /
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.SnowballPorterFilterFactory language=Spanish /
/analyzer
analyzer type=query
  tokenizer class=solr.StandardTokenizerFactory /
  filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt /
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true /
  filter class=solr.LowerCaseFilterFactory /
  filter class=solr.SnowballPorterFilterFactory language=Spanish /
/analyzer
  /fieldType


Re: [CODE4LIB] indexing word documents using solr [diacritics]

2015-02-12 Thread Karl Holten
Ah, the wonderful world of character encoding...

To quote the Solr wiki:
There are no known bugs with Solr's character handling, but there have been 
some reported issues with the way different application servers (and different 
versions of the same application server) treat incoming and outgoing multibyte 
characters. In particular, people have reported better success with Tomcat than 
with Jetty... 
(https://wiki.apache.org/solr/FAQ#Why_don.27t_International_Characters_Work.3F )

I'd probably start by enabling UTF-8 in Tomcat/Jetty and see if that resolves 
the issue. 

If not, I'd check the original files to see what its character encoding is, and 
then check each application that handles the documents to make sure it's using 
that encoding. It might be that the original isn't in UTF-8, or if it is, that 
somewhere along the way the parser, the perl interface, or some other unknown 
culprit is attempting to change it.

Regards,
Karl Holten
Systems Integration Specialist
SWITCH Inc
414-382-6711

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric 
Lease Morgan
Sent: Thursday, February 12, 2015 2:38 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] indexing word documents using solr [diacritics]

How do I retain diacritics in a Solr index, and how to I search for words 
containing them?

I have extracted the plain text out of set of Word documents. I have then used 
a Perl interface (WebService::Solr) to add the plain text to a Solr index using 
a field type called text_general:

fieldType name=text_general class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory /
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt /
filter class=solr.LowerCaseFilterFactory /
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory /
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt /
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true /
filter class=solr.LowerCaseFilterFactory /
  /analyzer
/fieldType

It seems as if I am unable to search for words like ejecución because the 
diacritic gets in the way. What am I doing wrong?

— 
Eric


Re: [CODE4LIB] indexing word documents using solr

2015-02-11 Thread Eric Lease Morgan
On Feb 10, 2015, at 11:46 AM, Erik Hatcher erikhatc...@mac.com wrote:

 bin/post -c collection_name /path/to/file.doc

The almost trivial command to index a Word document in Solr, above, is most 
certainly appealing, but I’m wondering about the underlying index’s schema.

Tika makes every effort to extract as much metadata from Word documents as 
possible. This metadata includes dates, titles, authors, names of applications, 
last edit, etc. Some of this data can be very useful. The metadata can be 
packaged up as an XML file/stream and then sent to Solr for indexing. Tastes 
great. Less filling.” But my question is, “To what degree does Solr know what 
to do with the metadata when the (kewl) command, above, is seemingly so 
generic? Does one need to create a Solr schema to specifically accommodate the 
Tika-created metadata, or do such things also come for ‘free’?”

— 
Eric Morgan


[CODE4LIB] indexing word documents using solr

2015-02-10 Thread Eric Lease Morgan
Can somebody point me to a good tutorial on how to index Word documents using 
Solr?

I have a few hundred Microsoft Word documents I want to search. Through the use 
of the Tika library it seems as if I ought to be able to index my Word 
documents directly into Solr, but none of the tutorials I have found on the Web 
are complete. Missing directories. Missing files. Documentation for versions 
unreleased. Etc.

Put another way, Tika can create a (nice) XHTML file complete with some useful 
metadata that can all be fed to Solr for indexing, but I can barely get out of 
the starting gate. Have you indexed Word documents using Solr, and if so, then 
how? 

—
Eric Morgan


Re: [CODE4LIB] indexing word documents using solr

2015-02-10 Thread Chris Gray
I found this book helped me get my head around Solr: 
https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-beginner%E2%80%99s-guide.


Chapter 8 explains indexing rich text formats including MS Word.

Chris Gray
Systems Analyst
519-888-4567, ext. 35764
cpg...@uwaterloo.ca
University of Waterloo

On 15-02-10 11:12 AM, Eric Lease Morgan wrote:

Can somebody point me to a good tutorial on how to index Word documents using 
Solr?

I have a few hundred Microsoft Word documents I want to search. Through the use 
of the Tika library it seems as if I ought to be able to index my Word 
documents directly into Solr, but none of the tutorials I have found on the Web 
are complete. Missing directories. Missing files. Documentation for versions 
unreleased. Etc.

Put another way, Tika can create a (nice) XHTML file complete with some useful 
metadata that can all be fed to Solr for indexing, but I can barely get out of 
the starting gate. Have you indexed Word documents using Solr, and if so, then 
how?

—
Eric Morgan


Re: [CODE4LIB] indexing word documents using solr

2015-02-10 Thread Erik Hatcher
 On Feb 10, 2015, at 12:43, Eric Lease Morgan emor...@nd.edu wrote:
 
 On Feb 10, 2015, at 11:46 AM, Erik Hatcher erikhatc...@mac.com wrote:
 
 First, with Solr 5, it’s this easy:
 
  Where can I download Solr 5 because none of the other version seem to be 
 complete. —ELM

It's not yet released but will be in a matter of days.   RC2 was generated last 
night here: 
http://people.apache.org/~anshum/staging_area/lucene-solr-5.0.0-RC2-rev1658469/solr/

Sorry for the tease on Solr 5, that's just where I've been living lately :)

Erik