Re: jackrabbit, lucene, tika ... and pdfbox

Jukka Zitting Thu, 10 Mar 2011 01:28:03 -0800

Hi,

On 03/09/2011 04:51 AM, Kevin Jansz wrote:

It's not a huge issue I guess as it seems with tika 0.9 (or 0.8.1?)
the PDF parser issue will be resolved in which case I expect the
code in org.apache.jackrabbit.core.query.pdf.* will disappear along
with reference to it from the tika-config.xml.


Yes, that's what we've already done in trunk.

I'm taking the time to mention it here in case it saves someone time
and also to gauge if our view of lucene, tika and the parsers is
incorrect - that future releases of jackrabbit may still include
parsers other than DefaultParser and EmptyParser in it's
tika-config.xml.

Your view is correct. The idea is to avoid direct parser classreferences in jackrabbit-core and just rely on the service providerloader mechanism in Tika to pick up all the available parsers.

We also decided to move the tika-parsers dependency from jackrabbit-coreto deployment packages like jackrabbit-webapp and jackrabbit-standalone.This should make it even easier for people to set up custom deploymentswith few or no parser libraries.


--
Jukka Zitting

Re: jackrabbit, lucene, tika ... and pdfbox

Reply via email to