Re: jackrabbit, lucene, tika ... and pdfbox [SEC=UNCLASSIFIED]

Ross . Dyson Tue, 08 Mar 2011 20:05:41 -0800

It wasn't OK for you to add a indexing_configuration.xml and exclude the 
indexing of the binary data?


I tried that and decreased the re-indexing time to around 25%, and 
presumably much smaller index files too.  I left in the attributes that I 
may sometimes want to search on, eg unique keys, document title.

Ross.



From:   Kevin Jansz <[email protected]>
To:     [email protected]
Date:   09/03/2011 02:52 PM
Subject:        jackrabbit, lucene, tika ... and pdfbox



It's been discussed on this list before but I'm summarising my latest
issues/findings ...

Our use of jackrabbit is for content storage without the built-in
search/querying mechanism. It's possible to leave out the
"SearchIndex" definition in the configuration but you're effectively
"breaking" the weak reference handling (used by user-management) -
non-critical and the repository *seems* to work without it despite
logging warnings. But I feel it's better to leave the SearchIndex,
therefore querying in ... so:

Weak-references
-> requires SearchIndex / querying
    -> requires lucene (for now, there's no simple alternative)
        -> requires tika (core)
            -> requires various other format handling libraries for
different parser implementations

In jackrabbit 2.1.x if you want custom parsers - or in my case no
parsers and the associated overhead and library dependence - you can't
easily do this as the jackrabbit-core jar includes a tika-config.xml
and loads this explicitly (from
org\apache\jackrabbit\core\query\lucene\tika-config.xml). The only
work-around is to replace this file in the jar file - not ideal.

It's raised in jiras JCR-2642 (& then TIKA-317) that making (very
sensible) use of the jar file "Service Provider" mechanism could
simply things. Drop in a jar file into the classpath that defines
parsers and this gets used ... my reading of this was that to get no
parsers we'd simply leave out tika-parsers-0.8.jar from the classpath.
It also made sense that the jackrabbit-core may still include a
tika-config.xml to a) use DefaultParser b) explicitly disable zip and
image extraction. Unfortunately, on upgrading to 2.2.4 errors about
missing pdfbox libraries (when storing PDF content) led me to this in
tika-config.xml (in the jackrabbit-core jar file):
    <parser class="org.apache.jackrabbit.core.query.pdf.PDFParser">
      <!-- JCR-2838: Override the faulty PDF parser in Tika 0.8 -->
      <mime>application/pdf</mime>
    </parser>

Looking at jiras JCR-2838 (& then TIKA-548) it's clear there's a
problem. I'm not entirely sure why the work around is in
jackrabbit-core. I would have though putting this in a
xxxxx-parsers-2.2.4.jar with a META-INF/services/... definition would
have been the correct way to handle this? To avoid issues of
parser/service-provider precedence? Perhaps a separate jar-build for
this issue would be overkill for a point release?

It's not a huge issue I guess as it seems with tika 0.9 (or 0.8.1?)
the PDF parser issue will be resolved in which case I expect the code
in org.apache.jackrabbit.core.query.pdf.* will disappear along with
reference to it from the tika-config.xml. In the mean time we're back
to having to replace
org\apache\jackrabbit\core\query\lucene\tika-config.xml in the
jackrabbit-core to avoid custom parsers (and errors about their
dependencies). I'm taking the time to mention it here in case it saves
someone time and also to gauge if our view of lucene, tika and the
parsers is incorrect - that future releases of jackrabbit may still
include parsers other than DefaultParser and EmptyParser in it's
tika-config.xml.

Regards,
Kevin

--
Kevin Jansz
[email protected]
Level 7, 10-16 Queen Street, Melbourne 3000 Australia
Tel +61 3 9621 2773 | Fax +61 3 9621 2776
Exari Systems
Boston | London | Melbourne | Munich
www.exari.com

Test drive our software online - www.exari.com/demo-trial.html
Read our blog on document assembly - blog.exari.com

--
This message contains privileged and confidential information only 
for use by the intended recipient.  If you are not the intended 
recipient of this message, you must not disseminate, copy or use 
it in any manner.  If you have received this message in error, 
please advise the sender by reply e-mail.  Please ensure all 
e-mail attachments are scanned for viruses prior to opening or 
using.

Re: jackrabbit, lucene, tika ... and pdfbox [SEC=UNCLASSIFIED]

Reply via email to