Re: LazyTextExtractorField.java:180 Failed to extract text from a binary property

Luca Fagioli Wed, 09 Nov 2011 08:06:56 -0800

I think the issue was related to the "-Dfile.encoding=UTF8" parameter
passed as VM option when Tomcat started.


On Sat, Nov 5, 2011 at 5:24 PM, Luca Fagioli <[email protected]>wrote:

> Hi,
> i'm new to Jackrabbit, so I've started from the 2.2.9 release.
>
> It is a fresh installation, with a clean apache Tomcat (6.0.29).
>
> The problem is that an exagerate number of documents cannot get indexed.
> The logs report (for a .doc file):
>
> WARN  [jackrabbit-pool-3] LazyTextExtractorField.java:180 Failed to
> extract text from a binary property
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
> from org.apache.tika.parser.ParserDecorator$1@34faaa93
>
> In fact, it seems working with plain text files, but not with binary
> files, like pdf, doc or odt.
>
> So, it seems a Tika issue, but it seems strange to me that so many
> documents generate an error. I've tried to build Jackrabbit with the
> version 0.10 of Tika, but I got no improvement.
>
>
> The files below are the jar packages available to Tomcat, in the
> $tomcat_install/lib directory:
>
> annotations-api.jar
> asm-3.1.jar
> bcmail-jdk15-1.45.jar
> bcprov-jdk15-1.45.jar
> boilerpipe-1.1.0.jar
> catalina-ant.jar
> catalina-ha.jar
> catalina-tribes.jar
> catalina.jar
> commons-codec-1.4.jar
> commons-collections-3.2.1.jar
> commons-compress-1.1.jar
> commons-dbcp-1.2.2.jar
> commons-fileupload-1.2.1.jar
> commons-httpclient-3.0.jar
> commons-io-1.4.jar
> commons-logging-1.1.1.jar
> commons-pool-1.3.jar
> concurrent-1.3.4.jar
> derby-10.5.3.0_1.jar
> dom4j-1.6.1.jar
> el-api.jar
> fontbox-1.3.1.jar
> geronimo-stax-api_1.0_spec-1.0.1.jar
> jackrabbit-api-2.2.9.jar
> jackrabbit-core-2.2.9.jar
> jackrabbit-jcr-commons-2.2.9.jar
> jackrabbit-jcr-rmi-2.2.9.jar
> jackrabbit-jcr-server-2.2.9.jar
> jackrabbit-jcr-servlet-2.2.9.jar
> jackrabbit-spi-2.2.9.jar
> jackrabbit-spi-commons-2.2.9.jar
> jackrabbit-text-extractors-1.6.5.jar
> jackrabbit-webdav-2.2.9.jar
> jasper-el.jar
> jasper-jdt.jar
> jasper.jar
> jcr-2.0.jar
> jdom-1.0.jar
> jempbox-1.3.1.jar
> jsp-api.jar
> log4j-over-slf4j-1.5.11.jar
> logback-classic-0.9.20.jar
> logback-core-0.9.20.jar
> lucene-core-2.4.1.jar
> metadata-extractor-2.4.0-beta-1.jar
> mysql-connector-java-5.1.18-bin.jar
> netcdf-4.2-min.jar
> pdfbox-1.3.1.jar
> poi-3.7.jar
> poi-ooxml-3.7.jar
> poi-ooxml-schemas-3.7.jar
> poi-scratchpad-3.7.jar
> rome-0.9.jar
> servlet-api.jar
> slf4j-api-1.5.11.jar
> tagsoup-1.2.jar
> tika-app-0.10.jar (after the 0.10 substitution, I get rid of the 0.8
> runtime library)
> tomcat-coyote.jar
> tomcat-dbcp.jar
> tomcat-i18n-es.jar
> tomcat-i18n-fr.jar
> tomcat-i18n-ja.jar
> xmlbeans-2.3.0.jar
>
> And this is my repository.xml configuration file, stripped out of the non
> relevant (as I think) parts:
>
> <?xml version="1.0"?>
>
> <!DOCTYPE Repository
>           PUBLIC "-//The Apache Software Foundation//DTD Jackrabbit
> 2.0//EN"
>           "http://jackrabbit.apache.org/dtd/repository-2.0.dtd";>
>
> <Repository>
>     <FileSystem class="org.apache.jackrabbit.core.fs.db.DbFileSystem">
>             <!-- hidden configuration -->
>     </FileSystem>
>
>     <DataStore class="org.apache.jackrabbit.core.data.db.DbDataStore">
>         <!-- hidden configuration -->
>     </DataStore>
>
>     <Security appName="Jackrabbit">
>         <!-- hidden configuration -->
>     </Security>
>
>     <Workspaces rootPath="${rep.home}/workspaces"
> defaultWorkspace="default"/>
>
>     <Workspace name="${wsp.name}">
>
>         <!-- hidden configuration -->
>
>         <SearchIndex
> class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>             <param name="path" value="${wsp.home}/index"/>
>             <param name="textFilterClasses"
> value="org.apache.jackrabbit.extractor.PlainTextExtractor,org.apache.jackrabbit.extractor.MsWordTextExtractor,org.apache.jackrabbit.extractor.MsExcelTextExtractor,org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,org.apache.jackrabbit.extractor.PdfTextExtractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,org.apache.jackrabbit.extractor.RTFTextExtractor,org.apache.jackrabbit.extractor.HTMLTextExtractor,org.apache.jackrabbit.extractor.XMLTextExtractor"/>
>             <param name="extractorPoolSize" value="2"/>
>             <param name="supportHighlighting" value="true"/>
>         </SearchIndex>
>
>     </Workspace>
>
>     <Versioning rootPath="${rep.home}/version">
>         <!-- hidden configuration -->
>     </Versioning>
>
>     <SearchIndex
> class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
>         <param name="path" value="${rep.home}/repository/index"/>
>         <param name="textFilterClasses"
> value="org.apache.jackrabbit.extractor.PlainTextExtractor,org.apache.jackrabbit.extractor.MsWordTextExtractor,org.apache.jackrabbit.extractor.MsExcelTextExtractor,org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,org.apache.jackrabbit.extractor.PdfTextExtractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,org.apache.jackrabbit.extractor.RTFTextExtractor,org.apache.jackrabbit.extractor.HTMLTextExtractor,org.apache.jackrabbit.extractor.XMLTextExtractor"/>
>         <param name="extractorPoolSize" value="2"/>
>         <param name="supportHighlighting" value="true"/>
>     </SearchIndex>
> </Repository>
>
>
>
> What do I do wrong?
>
>
> Thanks,
> Luca
>



-- 
Luca Fagioli
Montezuma Interactive
Tel: +39 340 340 69 72
http://www.montezuma.it

Re: LazyTextExtractorField.java:180 Failed to extract text from a binary property

Reply via email to