Re: Tika 2.1 custom ContentHandler question

Ensor, Neal Tue, 07 Sep 2021 12:18:28 -0700

Also, and probably more importantly, NONE of the text is being captured by my 
handler.  The "characters" method of SAX isn't being called at all, everything 
is coming back blank, so something is clearly not hooking up at all.

Using the downloaded "app":

java -jar tika-app-2.1.0.jar --xml myfile.pdf

results in what I would expect, XML tags and all content intact.

The maven dependencies of my project:

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-core</artifactId>
  <version>2.1.0</version>
</dependency>

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>2.1.0</version>
  <type>pom</type>
</dependency>

If that makes any difference..

Neal Ensor
U.S. Department of Energy
Office of Scientific and Technical Information
Oak Ridge, TN
(865) 576-1295
[email protected]
________________________________
From: Ensor, Neal <[email protected]>
Sent: Tuesday, September 7, 2021 3:03 PM
To: [email protected] <[email protected]>; [email protected] 
<[email protected]>
Subject: Re: Tika 2.1 custom ContentHandler question

Presently, I'm trying a mix of generally PDF and OpenOffice documents that have 
had previous success.  The relevant code is mainly something like this:

AutoDetectParser parser = new AutoDetectParser();
Metadata md = new Metadata();
CustomContentHandler handler = new CustomContentHandler();

parser.parse(Files.newInputStream(Paths.get("myfile"), handler, md);

In the CustomContentHandler (extending DefaultHandler from SAX) I implement the 
startElement and endElement events, and log each such item.  Previously, I 
would get a number of "a", "div", and other HTML tag events.   Since 2.1.0, I 
only get "html", "head", a few "meta" tags, and a "title", and "body".

Neal Ensor
U.S. Department of Energy
Office of Scientific and Technical Information
Oak Ridge, TN
(865) 576-1295
[email protected]
________________________________
From: Tim Allison <[email protected]>
Sent: Tuesday, September 7, 2021 2:57 PM
To: [email protected] <[email protected]>
Subject: Re: Tika 2.1 custom ContentHandler question

Can you share the code you're using?  How are you injecting your
custom ContentHandler?  Is this occurring on pdf documents or ppt/pptx
or something else?

On Tue, Sep 7, 2021 at 2:22 PM Ensor, Neal <[email protected]> wrote:
>
> I have been using Tika for quite some time to handle document text extraction 
> for SOLR indexing, but attempting to update to 2.1.0 I am encountering an 
> issue:
>
> I have a custom SAX ContentHandler wired in to AutoDetectParser calls that I 
> have previously relied on to trigger page-based calls around <div 
> class="page"> events on startElement.  This do not appear to be generated for 
> me now?  Same code with 1.27 had no issues here.  I am in fact also no longer 
> "seeing" any "<a>" tag events in my handler either.  Is there some 
> alternative way to access the content handler I am not employing?
>
> When using Tika 2.1.0 "app" via the command-line with --xml, the resulting 
> document shows all the tags I expect, but my ContentHandler is not.   Any 
> advice would be appreciated, and if more info or code snippets from me might 
> help, I'd be happy to provide.
>
> Thanks in advance!
>
> Neal Ensor
> U.S. Department of Energy
> Office of Scientific and Technical Information
> Oak Ridge, TN
> (865) 576-1295
> [email protected]

Re: Tika 2.1 custom ContentHandler question

Reply via email to