Hi, On 03/09/2011 04:51 AM, Kevin Jansz wrote:
It's not a huge issue I guess as it seems with tika 0.9 (or 0.8.1?) the PDF parser issue will be resolved in which case I expect the code in org.apache.jackrabbit.core.query.pdf.* will disappear along with reference to it from the tika-config.xml.
Yes, that's what we've already done in trunk.
I'm taking the time to mention it here in case it saves someone time and also to gauge if our view of lucene, tika and the parsers is incorrect - that future releases of jackrabbit may still include parsers other than DefaultParser and EmptyParser in it's tika-config.xml.
Your view is correct. The idea is to avoid direct parser class references in jackrabbit-core and just rely on the service provider loader mechanism in Tika to pick up all the available parsers.
We also decided to move the tika-parsers dependency from jackrabbit-core to deployment packages like jackrabbit-webapp and jackrabbit-standalone. This should make it even easier for people to set up custom deployments with few or no parser libraries.
-- Jukka Zitting
