Presently, I'm trying a mix of generally PDF and OpenOffice documents that have
had previous success. The relevant code is mainly something like this:
AutoDetectParser parser = new AutoDetectParser();
Metadata md = new Metadata();
CustomContentHandler handler = new CustomContentHandler();
parser.parse(Files.newInputStream(Paths.get("myfile"), handler, md);
In the CustomContentHandler (extending DefaultHandler from SAX) I implement the
startElement and endElement events, and log each such item. Previously, I
would get a number of "a", "div", and other HTML tag events. Since 2.1.0, I
only get "html", "head", a few "meta" tags, and a "title", and "body".
Neal Ensor
U.S. Department of Energy
Office of Scientific and Technical Information
Oak Ridge, TN
(865) 576-1295
[email protected]
________________________________
From: Tim Allison <[email protected]>
Sent: Tuesday, September 7, 2021 2:57 PM
To: [email protected] <[email protected]>
Subject: Re: Tika 2.1 custom ContentHandler question
Can you share the code you're using? How are you injecting your
custom ContentHandler? Is this occurring on pdf documents or ppt/pptx
or something else?
On Tue, Sep 7, 2021 at 2:22 PM Ensor, Neal <[email protected]> wrote:
>
> I have been using Tika for quite some time to handle document text extraction
> for SOLR indexing, but attempting to update to 2.1.0 I am encountering an
> issue:
>
> I have a custom SAX ContentHandler wired in to AutoDetectParser calls that I
> have previously relied on to trigger page-based calls around <div
> class="page"> events on startElement. This do not appear to be generated for
> me now? Same code with 1.27 had no issues here. I am in fact also no longer
> "seeing" any "<a>" tag events in my handler either. Is there some
> alternative way to access the content handler I am not employing?
>
> When using Tika 2.1.0 "app" via the command-line with --xml, the resulting
> document shows all the tags I expect, but my ContentHandler is not. Any
> advice would be appreciated, and if more info or code snippets from me might
> help, I'd be happy to provide.
>
> Thanks in advance!
>
> Neal Ensor
> U.S. Department of Energy
> Office of Scientific and Technical Information
> Oak Ridge, TN
> (865) 576-1295
> [email protected]