I think the issue was related to the "-Dfile.encoding=UTF8" parameter passed as VM option when Tomcat started.
On Sat, Nov 5, 2011 at 5:24 PM, Luca Fagioli <[email protected]>wrote: > Hi, > i'm new to Jackrabbit, so I've started from the 2.2.9 release. > > It is a fresh installation, with a clean apache Tomcat (6.0.29). > > The problem is that an exagerate number of documents cannot get indexed. > The logs report (for a .doc file): > > WARN [jackrabbit-pool-3] LazyTextExtractorField.java:180 Failed to > extract text from a binary property > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException > from org.apache.tika.parser.ParserDecorator$1@34faaa93 > > In fact, it seems working with plain text files, but not with binary > files, like pdf, doc or odt. > > So, it seems a Tika issue, but it seems strange to me that so many > documents generate an error. I've tried to build Jackrabbit with the > version 0.10 of Tika, but I got no improvement. > > > The files below are the jar packages available to Tomcat, in the > $tomcat_install/lib directory: > > annotations-api.jar > asm-3.1.jar > bcmail-jdk15-1.45.jar > bcprov-jdk15-1.45.jar > boilerpipe-1.1.0.jar > catalina-ant.jar > catalina-ha.jar > catalina-tribes.jar > catalina.jar > commons-codec-1.4.jar > commons-collections-3.2.1.jar > commons-compress-1.1.jar > commons-dbcp-1.2.2.jar > commons-fileupload-1.2.1.jar > commons-httpclient-3.0.jar > commons-io-1.4.jar > commons-logging-1.1.1.jar > commons-pool-1.3.jar > concurrent-1.3.4.jar > derby-10.5.3.0_1.jar > dom4j-1.6.1.jar > el-api.jar > fontbox-1.3.1.jar > geronimo-stax-api_1.0_spec-1.0.1.jar > jackrabbit-api-2.2.9.jar > jackrabbit-core-2.2.9.jar > jackrabbit-jcr-commons-2.2.9.jar > jackrabbit-jcr-rmi-2.2.9.jar > jackrabbit-jcr-server-2.2.9.jar > jackrabbit-jcr-servlet-2.2.9.jar > jackrabbit-spi-2.2.9.jar > jackrabbit-spi-commons-2.2.9.jar > jackrabbit-text-extractors-1.6.5.jar > jackrabbit-webdav-2.2.9.jar > jasper-el.jar > jasper-jdt.jar > jasper.jar > jcr-2.0.jar > jdom-1.0.jar > jempbox-1.3.1.jar > jsp-api.jar > log4j-over-slf4j-1.5.11.jar > logback-classic-0.9.20.jar > logback-core-0.9.20.jar > lucene-core-2.4.1.jar > metadata-extractor-2.4.0-beta-1.jar > mysql-connector-java-5.1.18-bin.jar > netcdf-4.2-min.jar > pdfbox-1.3.1.jar > poi-3.7.jar > poi-ooxml-3.7.jar > poi-ooxml-schemas-3.7.jar > poi-scratchpad-3.7.jar > rome-0.9.jar > servlet-api.jar > slf4j-api-1.5.11.jar > tagsoup-1.2.jar > tika-app-0.10.jar (after the 0.10 substitution, I get rid of the 0.8 > runtime library) > tomcat-coyote.jar > tomcat-dbcp.jar > tomcat-i18n-es.jar > tomcat-i18n-fr.jar > tomcat-i18n-ja.jar > xmlbeans-2.3.0.jar > > And this is my repository.xml configuration file, stripped out of the non > relevant (as I think) parts: > > <?xml version="1.0"?> > > <!DOCTYPE Repository > PUBLIC "-//The Apache Software Foundation//DTD Jackrabbit > 2.0//EN" > "http://jackrabbit.apache.org/dtd/repository-2.0.dtd"> > > <Repository> > <FileSystem class="org.apache.jackrabbit.core.fs.db.DbFileSystem"> > <!-- hidden configuration --> > </FileSystem> > > <DataStore class="org.apache.jackrabbit.core.data.db.DbDataStore"> > <!-- hidden configuration --> > </DataStore> > > <Security appName="Jackrabbit"> > <!-- hidden configuration --> > </Security> > > <Workspaces rootPath="${rep.home}/workspaces" > defaultWorkspace="default"/> > > <Workspace name="${wsp.name}"> > > <!-- hidden configuration --> > > <SearchIndex > class="org.apache.jackrabbit.core.query.lucene.SearchIndex"> > <param name="path" value="${wsp.home}/index"/> > <param name="textFilterClasses" > value="org.apache.jackrabbit.extractor.PlainTextExtractor,org.apache.jackrabbit.extractor.MsWordTextExtractor,org.apache.jackrabbit.extractor.MsExcelTextExtractor,org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,org.apache.jackrabbit.extractor.PdfTextExtractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,org.apache.jackrabbit.extractor.RTFTextExtractor,org.apache.jackrabbit.extractor.HTMLTextExtractor,org.apache.jackrabbit.extractor.XMLTextExtractor"/> > <param name="extractorPoolSize" value="2"/> > <param name="supportHighlighting" value="true"/> > </SearchIndex> > > </Workspace> > > <Versioning rootPath="${rep.home}/version"> > <!-- hidden configuration --> > </Versioning> > > <SearchIndex > class="org.apache.jackrabbit.core.query.lucene.SearchIndex"> > <param name="path" value="${rep.home}/repository/index"/> > <param name="textFilterClasses" > value="org.apache.jackrabbit.extractor.PlainTextExtractor,org.apache.jackrabbit.extractor.MsWordTextExtractor,org.apache.jackrabbit.extractor.MsExcelTextExtractor,org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,org.apache.jackrabbit.extractor.PdfTextExtractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,org.apache.jackrabbit.extractor.RTFTextExtractor,org.apache.jackrabbit.extractor.HTMLTextExtractor,org.apache.jackrabbit.extractor.XMLTextExtractor"/> > <param name="extractorPoolSize" value="2"/> > <param name="supportHighlighting" value="true"/> > </SearchIndex> > </Repository> > > > > What do I do wrong? > > > Thanks, > Luca > -- Luca Fagioli Montezuma Interactive Tel: +39 340 340 69 72 http://www.montezuma.it
