Presently, I'm trying a mix of generally PDF and OpenOffice documents that have 
had previous success.  The relevant code is mainly something like this:

AutoDetectParser parser = new AutoDetectParser();
Metadata md = new Metadata();
CustomContentHandler handler = new CustomContentHandler();

parser.parse(Files.newInputStream(Paths.get("myfile"), handler, md);

In the CustomContentHandler (extending DefaultHandler from SAX) I implement the 
startElement and endElement events, and log each such item.  Previously, I 
would get a number of "a", "div", and other HTML tag events.   Since 2.1.0, I 
only get "html", "head", a few "meta" tags, and a "title", and "body".

Neal Ensor
U.S. Department of Energy
Office of Scientific and Technical Information
Oak Ridge, TN
(865) 576-1295
[email protected]
________________________________
From: Tim Allison <[email protected]>
Sent: Tuesday, September 7, 2021 2:57 PM
To: [email protected] <[email protected]>
Subject: Re: Tika 2.1 custom ContentHandler question

Can you share the code you're using?  How are you injecting your
custom ContentHandler?  Is this occurring on pdf documents or ppt/pptx
or something else?

On Tue, Sep 7, 2021 at 2:22 PM Ensor, Neal <[email protected]> wrote:
>
> I have been using Tika for quite some time to handle document text extraction 
> for SOLR indexing, but attempting to update to 2.1.0 I am encountering an 
> issue:
>
> I have a custom SAX ContentHandler wired in to AutoDetectParser calls that I 
> have previously relied on to trigger page-based calls around <div 
> class="page"> events on startElement.  This do not appear to be generated for 
> me now?  Same code with 1.27 had no issues here.  I am in fact also no longer 
> "seeing" any "<a>" tag events in my handler either.  Is there some 
> alternative way to access the content handler I am not employing?
>
> When using Tika 2.1.0 "app" via the command-line with --xml, the resulting 
> document shows all the tags I expect, but my ContentHandler is not.   Any 
> advice would be appreciated, and if more info or code snippets from me might 
> help, I'd be happy to provide.
>
> Thanks in advance!
>
> Neal Ensor
> U.S. Department of Energy
> Office of Scientific and Technical Information
> Oak Ridge, TN
> (865) 576-1295
> [email protected]

Reply via email to