Re: Tika 2.1 custom ContentHandler question

Ensor, Neal Tue, 07 Sep 2021 13:01:34 -0700

Well, I must apologize for taking this to the list:  it seems this is entirely 
my fault.

I had not properly followed the 
https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0
Migrating to Tika 2.0.0 - TIKA - Apache Software 
Foundation<https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0>
Note! In 2.x, Tika will not warn you if a PDF page that you're trying to render 
has a JPEG2000 in it. PDFBox will log a warning. tika-app tika-server
cwiki.apache.org

Specifically, it seems I should be importing "tika-parser-standard-package" 
rather than simply "tika-parsers".  After fixing that dependency my tests ran 
fine.  So, if anyone can learn from my mistake, hopefully that will have been 
at least worth it!

Neal Ensor
U.S. Department of Energy
Office of Scientific and Technical Information
Oak Ridge, TN
(865) 576-1295
[email protected]
________________________________
From: Ensor, Neal <[email protected]>
Sent: Tuesday, September 7, 2021 3:18 PM
To: [email protected] <[email protected]>; [email protected] 
<[email protected]>
Subject: Re: Tika 2.1 custom ContentHandler question

Also, and probably more importantly, NONE of the text is being captured by my 
handler.  The "characters" method of SAX isn't being called at all, everything 
is coming back blank, so something is clearly not hooking up at all.

Using the downloaded "app":

java -jar tika-app-2.1.0.jar --xml myfile.pdf

results in what I would expect, XML tags and all content intact.

The maven dependencies of my project:

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-core</artifactId>
  <version>2.1.0</version>
</dependency>

<dependency>
  <groupId>org.apache.tika</groupId>
  <artifactId>tika-parsers</artifactId>
  <version>2.1.0</version>
  <type>pom</type>
</dependency>

If that makes any difference..

Neal Ensor
U.S. Department of Energy
Office of Scientific and Technical Information
Oak Ridge, TN
(865) 576-1295
[email protected]
________________________________
From: Ensor, Neal <[email protected]>
Sent: Tuesday, September 7, 2021 3:03 PM
To: [email protected] <[email protected]>; [email protected] 
<[email protected]>
Subject: Re: Tika 2.1 custom ContentHandler question

Presently, I'm trying a mix of generally PDF and OpenOffice documents that have 
had previous success.  The relevant code is mainly something like this:

AutoDetectParser parser = new AutoDetectParser();
Metadata md = new Metadata();
CustomContentHandler handler = new CustomContentHandler();

parser.parse(Files.newInputStream(Paths.get("myfile"), handler, md);

In the CustomContentHandler (extending DefaultHandler from SAX) I implement the 
startElement and endElement events, and log each such item.  Previously, I 
would get a number of "a", "div", and other HTML tag events.   Since 2.1.0, I 
only get "html", "head", a few "meta" tags, and a "title", and "body".

Neal Ensor
U.S. Department of Energy
Office of Scientific and Technical Information
Oak Ridge, TN
(865) 576-1295
[email protected]
________________________________
From: Tim Allison <[email protected]>
Sent: Tuesday, September 7, 2021 2:57 PM
To: [email protected] <[email protected]>
Subject: Re: Tika 2.1 custom ContentHandler question

Can you share the code you're using?  How are you injecting your
custom ContentHandler?  Is this occurring on pdf documents or ppt/pptx
or something else?

On Tue, Sep 7, 2021 at 2:22 PM Ensor, Neal <[email protected]> wrote:
>
> I have been using Tika for quite some time to handle document text extraction 
> for SOLR indexing, but attempting to update to 2.1.0 I am encountering an 
> issue:
>
> I have a custom SAX ContentHandler wired in to AutoDetectParser calls that I 
> have previously relied on to trigger page-based calls around <div 
> class="page"> events on startElement.  This do not appear to be generated for 
> me now?  Same code with 1.27 had no issues here.  I am in fact also no longer 
> "seeing" any "<a>" tag events in my handler either.  Is there some 
> alternative way to access the content handler I am not employing?
>
> When using Tika 2.1.0 "app" via the command-line with --xml, the resulting 
> document shows all the tags I expect, but my ContentHandler is not.   Any 
> advice would be appreciated, and if more info or code snippets from me might 
> help, I'd be happy to provide.
>
> Thanks in advance!
>
> Neal Ensor
> U.S. Department of Energy
> Office of Scientific and Technical Information
> Oak Ridge, TN
> (865) 576-1295
> [email protected]

Re: Tika 2.1 custom ContentHandler question

Reply via email to