It wasn't OK for you to add a indexing_configuration.xml and exclude the indexing of the binary data?
I tried that and decreased the re-indexing time to around 25%, and presumably much smaller index files too. I left in the attributes that I may sometimes want to search on, eg unique keys, document title. Ross. From: Kevin Jansz <[email protected]> To: [email protected] Date: 09/03/2011 02:52 PM Subject: jackrabbit, lucene, tika ... and pdfbox It's been discussed on this list before but I'm summarising my latest issues/findings ... Our use of jackrabbit is for content storage without the built-in search/querying mechanism. It's possible to leave out the "SearchIndex" definition in the configuration but you're effectively "breaking" the weak reference handling (used by user-management) - non-critical and the repository *seems* to work without it despite logging warnings. But I feel it's better to leave the SearchIndex, therefore querying in ... so: Weak-references -> requires SearchIndex / querying -> requires lucene (for now, there's no simple alternative) -> requires tika (core) -> requires various other format handling libraries for different parser implementations In jackrabbit 2.1.x if you want custom parsers - or in my case no parsers and the associated overhead and library dependence - you can't easily do this as the jackrabbit-core jar includes a tika-config.xml and loads this explicitly (from org\apache\jackrabbit\core\query\lucene\tika-config.xml). The only work-around is to replace this file in the jar file - not ideal. It's raised in jiras JCR-2642 (& then TIKA-317) that making (very sensible) use of the jar file "Service Provider" mechanism could simply things. Drop in a jar file into the classpath that defines parsers and this gets used ... my reading of this was that to get no parsers we'd simply leave out tika-parsers-0.8.jar from the classpath. It also made sense that the jackrabbit-core may still include a tika-config.xml to a) use DefaultParser b) explicitly disable zip and image extraction. Unfortunately, on upgrading to 2.2.4 errors about missing pdfbox libraries (when storing PDF content) led me to this in tika-config.xml (in the jackrabbit-core jar file): <parser class="org.apache.jackrabbit.core.query.pdf.PDFParser"> <!-- JCR-2838: Override the faulty PDF parser in Tika 0.8 --> <mime>application/pdf</mime> </parser> Looking at jiras JCR-2838 (& then TIKA-548) it's clear there's a problem. I'm not entirely sure why the work around is in jackrabbit-core. I would have though putting this in a xxxxx-parsers-2.2.4.jar with a META-INF/services/... definition would have been the correct way to handle this? To avoid issues of parser/service-provider precedence? Perhaps a separate jar-build for this issue would be overkill for a point release? It's not a huge issue I guess as it seems with tika 0.9 (or 0.8.1?) the PDF parser issue will be resolved in which case I expect the code in org.apache.jackrabbit.core.query.pdf.* will disappear along with reference to it from the tika-config.xml. In the mean time we're back to having to replace org\apache\jackrabbit\core\query\lucene\tika-config.xml in the jackrabbit-core to avoid custom parsers (and errors about their dependencies). I'm taking the time to mention it here in case it saves someone time and also to gauge if our view of lucene, tika and the parsers is incorrect - that future releases of jackrabbit may still include parsers other than DefaultParser and EmptyParser in it's tika-config.xml. Regards, Kevin -- Kevin Jansz [email protected] Level 7, 10-16 Queen Street, Melbourne 3000 Australia Tel +61 3 9621 2773 | Fax +61 3 9621 2776 Exari Systems Boston | London | Melbourne | Munich www.exari.com Test drive our software online - www.exari.com/demo-trial.html Read our blog on document assembly - blog.exari.com
-- This message contains privileged and confidential information only for use by the intended recipient. If you are not the intended recipient of this message, you must not disseminate, copy or use it in any manner. If you have received this message in error, please advise the sender by reply e-mail. Please ensure all e-mail attachments are scanned for viruses prior to opening or using.
