Hi,
i'm new to Jackrabbit, so I've started from the 2.2.9 release.
It is a fresh installation, with a clean apache Tomcat (6.0.29).
The problem is that an exagerate number of documents cannot get indexed.
The logs report (for a .doc file):
WARN [jackrabbit-pool-3] LazyTextExtractorField.java:180 Failed to extract
text from a binary property
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.ParserDecorator$1@34faaa93
In fact, it seems working with plain text files, but not with binary files,
like pdf, doc or odt.
So, it seems a Tika issue, but it seems strange to me that so many
documents generate an error. I've tried to build Jackrabbit with the
version 0.10 of Tika, but I got no improvement.
The files below are the jar packages available to Tomcat, in the
$tomcat_install/lib directory:
annotations-api.jar
asm-3.1.jar
bcmail-jdk15-1.45.jar
bcprov-jdk15-1.45.jar
boilerpipe-1.1.0.jar
catalina-ant.jar
catalina-ha.jar
catalina-tribes.jar
catalina.jar
commons-codec-1.4.jar
commons-collections-3.2.1.jar
commons-compress-1.1.jar
commons-dbcp-1.2.2.jar
commons-fileupload-1.2.1.jar
commons-httpclient-3.0.jar
commons-io-1.4.jar
commons-logging-1.1.1.jar
commons-pool-1.3.jar
concurrent-1.3.4.jar
derby-10.5.3.0_1.jar
dom4j-1.6.1.jar
el-api.jar
fontbox-1.3.1.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
jackrabbit-api-2.2.9.jar
jackrabbit-core-2.2.9.jar
jackrabbit-jcr-commons-2.2.9.jar
jackrabbit-jcr-rmi-2.2.9.jar
jackrabbit-jcr-server-2.2.9.jar
jackrabbit-jcr-servlet-2.2.9.jar
jackrabbit-spi-2.2.9.jar
jackrabbit-spi-commons-2.2.9.jar
jackrabbit-text-extractors-1.6.5.jar
jackrabbit-webdav-2.2.9.jar
jasper-el.jar
jasper-jdt.jar
jasper.jar
jcr-2.0.jar
jdom-1.0.jar
jempbox-1.3.1.jar
jsp-api.jar
log4j-over-slf4j-1.5.11.jar
logback-classic-0.9.20.jar
logback-core-0.9.20.jar
lucene-core-2.4.1.jar
metadata-extractor-2.4.0-beta-1.jar
mysql-connector-java-5.1.18-bin.jar
netcdf-4.2-min.jar
pdfbox-1.3.1.jar
poi-3.7.jar
poi-ooxml-3.7.jar
poi-ooxml-schemas-3.7.jar
poi-scratchpad-3.7.jar
rome-0.9.jar
servlet-api.jar
slf4j-api-1.5.11.jar
tagsoup-1.2.jar
tika-app-0.10.jar (after the 0.10 substitution, I get rid of the 0.8
runtime library)
tomcat-coyote.jar
tomcat-dbcp.jar
tomcat-i18n-es.jar
tomcat-i18n-fr.jar
tomcat-i18n-ja.jar
xmlbeans-2.3.0.jar
And this is my repository.xml configuration file, stripped out of the non
relevant (as I think) parts:
<?xml version="1.0"?>
<!DOCTYPE Repository
PUBLIC "-//The Apache Software Foundation//DTD Jackrabbit 2.0//EN"
"http://jackrabbit.apache.org/dtd/repository-2.0.dtd">
<Repository>
<FileSystem class="org.apache.jackrabbit.core.fs.db.DbFileSystem">
<!-- hidden configuration -->
</FileSystem>
<DataStore class="org.apache.jackrabbit.core.data.db.DbDataStore">
<!-- hidden configuration -->
</DataStore>
<Security appName="Jackrabbit">
<!-- hidden configuration -->
</Security>
<Workspaces rootPath="${rep.home}/workspaces"
defaultWorkspace="default"/>
<Workspace name="${wsp.name}">
<!-- hidden configuration -->
<SearchIndex
class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
<param name="path" value="${wsp.home}/index"/>
<param name="textFilterClasses"
value="org.apache.jackrabbit.extractor.PlainTextExtractor,org.apache.jackrabbit.extractor.MsWordTextExtractor,org.apache.jackrabbit.extractor.MsExcelTextExtractor,org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,org.apache.jackrabbit.extractor.PdfTextExtractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,org.apache.jackrabbit.extractor.RTFTextExtractor,org.apache.jackrabbit.extractor.HTMLTextExtractor,org.apache.jackrabbit.extractor.XMLTextExtractor"/>
<param name="extractorPoolSize" value="2"/>
<param name="supportHighlighting" value="true"/>
</SearchIndex>
</Workspace>
<Versioning rootPath="${rep.home}/version">
<!-- hidden configuration -->
</Versioning>
<SearchIndex
class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
<param name="path" value="${rep.home}/repository/index"/>
<param name="textFilterClasses"
value="org.apache.jackrabbit.extractor.PlainTextExtractor,org.apache.jackrabbit.extractor.MsWordTextExtractor,org.apache.jackrabbit.extractor.MsExcelTextExtractor,org.apache.jackrabbit.extractor.MsPowerPointTextExtractor,org.apache.jackrabbit.extractor.PdfTextExtractor,org.apache.jackrabbit.extractor.OpenOfficeTextExtractor,org.apache.jackrabbit.extractor.RTFTextExtractor,org.apache.jackrabbit.extractor.HTMLTextExtractor,org.apache.jackrabbit.extractor.XMLTextExtractor"/>
<param name="extractorPoolSize" value="2"/>
<param name="supportHighlighting" value="true"/>
</SearchIndex>
</Repository>
What do I do wrong?
Thanks,
Luca