Re: Tika 2.1 custom ContentHandler question

Tim Allison Tue, 07 Sep 2021 13:14:52 -0700

No need to apologize at all.  There are some major differences in 2.x.
Many thanks for migrating and sharing your pain!


Please let us know what else you find.

Best,

     Tim

On Tue, Sep 7, 2021 at 4:01 PM Ensor, Neal <[email protected]> wrote:

> Well, I must apologize for taking this to the list:  it seems this is
> entirely my fault.
>
> I had not properly followed the
> https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0
> Migrating to Tika 2.0.0 - TIKA - Apache Software Foundation
> <https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0>
> Note! In 2.x, Tika will not warn you if a PDF page that you're trying to
> render has a JPEG2000 in it. PDFBox will log a warning. tika-app tika-server
> cwiki.apache.org
>
> Specifically, it seems I should be importing
> "tika-parser-standard-package" rather than simply "tika-parsers".  After
> fixing that dependency my tests ran fine.  So, if anyone can learn from my
> mistake, hopefully that will have been at least worth it!
>
> Neal Ensor
> U.S. Department of Energy
> Office of Scientific and Technical Information
> Oak Ridge, TN
> (865) 576-1295
> [email protected]
> ------------------------------
> *From:* Ensor, Neal <[email protected]>
> *Sent:* Tuesday, September 7, 2021 3:18 PM
> *To:* [email protected] <[email protected]>; [email protected] <
> [email protected]>
> *Subject:* Re: Tika 2.1 custom ContentHandler question
>
> Also, and probably more importantly, NONE of the text is being captured by
> my handler.  The "characters" method of SAX isn't being called at all,
> everything is coming back blank, so something is clearly not hooking up at
> all.
>
> Using the downloaded "app":
>
> java -jar tika-app-2.1.0.jar --xml myfile.pdf
>
> results in what I would expect, XML tags and all content intact.
>
> The maven dependencies of my project:
>
> <dependency>
>   <groupId>org.apache.tika</groupId>
>   <artifactId>tika-core</artifactId>
>   <version>2.1.0</version>
> </dependency>
>
> <dependency>
>   <groupId>org.apache.tika</groupId>
>   <artifactId>tika-parsers</artifactId>
>   <version>2.1.0</version>
>   <type>pom</type>
> </dependency>
>
> If that makes any difference..
>
> Neal Ensor
> U.S. Department of Energy
> Office of Scientific and Technical Information
> Oak Ridge, TN
> (865) 576-1295
> [email protected]
> ------------------------------
> *From:* Ensor, Neal <[email protected]>
> *Sent:* Tuesday, September 7, 2021 3:03 PM
> *To:* [email protected] <[email protected]>; [email protected] <
> [email protected]>
> *Subject:* Re: Tika 2.1 custom ContentHandler question
>
> Presently, I'm trying a mix of generally PDF and OpenOffice documents that
> have had previous success.  The relevant code is mainly something like this:
>
> AutoDetectParser parser = new AutoDetectParser();
> Metadata md = new Metadata();
> CustomContentHandler handler = new CustomContentHandler();
>
> parser.parse(Files.newInputStream(Paths.get("myfile"), handler, md);
>
> In the CustomContentHandler (extending DefaultHandler from SAX) I
> implement the startElement and endElement events, and log each such item.
> Previously, I would get a number of "a", "div", and other HTML tag events.
>  Since 2.1.0, I only get "html", "head", a few "meta" tags, and a "title",
> and "body".
>
> Neal Ensor
> U.S. Department of Energy
> Office of Scientific and Technical Information
> Oak Ridge, TN
> (865) 576-1295
> [email protected]
> ------------------------------
> *From:* Tim Allison <[email protected]>
> *Sent:* Tuesday, September 7, 2021 2:57 PM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: Tika 2.1 custom ContentHandler question
>
> Can you share the code you're using?  How are you injecting your
> custom ContentHandler?  Is this occurring on pdf documents or ppt/pptx
> or something else?
>
> On Tue, Sep 7, 2021 at 2:22 PM Ensor, Neal <[email protected]> wrote:
> >
> > I have been using Tika for quite some time to handle document text
> extraction for SOLR indexing, but attempting to update to 2.1.0 I am
> encountering an issue:
> >
> > I have a custom SAX ContentHandler wired in to AutoDetectParser calls
> that I have previously relied on to trigger page-based calls around <div
> class="page"> events on startElement.  This do not appear to be generated
> for me now?  Same code with 1.27 had no issues here.  I am in fact also no
> longer "seeing" any "<a>" tag events in my handler either.  Is there some
> alternative way to access the content handler I am not employing?
> >
> > When using Tika 2.1.0 "app" via the command-line with --xml, the
> resulting document shows all the tags I expect, but my ContentHandler is
> not.   Any advice would be appreciated, and if more info or code snippets
> from me might help, I'd be happy to provide.
> >
> > Thanks in advance!
> >
> > Neal Ensor
> > U.S. Department of Energy
> > Office of Scientific and Technical Information
> > Oak Ridge, TN
> > (865) 576-1295
> > [email protected]
>

Re: Tika 2.1 custom ContentHandler question

Reply via email to