Hi Claus,
I switched off PDF parsing following your advice:
<SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
<param name="tikaConfigPath" value="${wsp.home}/tika-config.xml"/>
where tika config contains:
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
<parser class="org.apache.tika.parser.EmptyParser">
<mime>application/pdf</mime>
</parser>
</parsers>
Does it mean I sill made something wrong?
Regards,
Anton
Hi Anton,
It seems that you index the pdf File as fulltext ?!?
org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:530)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:172)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:878)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:843)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:74)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
I think you have disabled it ?
Indexing huge pdf files will take some time and memory :-)
greets
claus