Well, I must apologize for taking this to the list: it seems this is entirely my fault.
I had not properly followed the https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0 Migrating to Tika 2.0.0 - TIKA - Apache Software Foundation<https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0> Note! In 2.x, Tika will not warn you if a PDF page that you're trying to render has a JPEG2000 in it. PDFBox will log a warning. tika-app tika-server cwiki.apache.org Specifically, it seems I should be importing "tika-parser-standard-package" rather than simply "tika-parsers". After fixing that dependency my tests ran fine. So, if anyone can learn from my mistake, hopefully that will have been at least worth it! Neal Ensor U.S. Department of Energy Office of Scientific and Technical Information Oak Ridge, TN (865) 576-1295 [email protected] ________________________________ From: Ensor, Neal <[email protected]> Sent: Tuesday, September 7, 2021 3:18 PM To: [email protected] <[email protected]>; [email protected] <[email protected]> Subject: Re: Tika 2.1 custom ContentHandler question Also, and probably more importantly, NONE of the text is being captured by my handler. The "characters" method of SAX isn't being called at all, everything is coming back blank, so something is clearly not hooking up at all. Using the downloaded "app": java -jar tika-app-2.1.0.jar --xml myfile.pdf results in what I would expect, XML tags and all content intact. The maven dependencies of my project: <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>2.1.0</version> </dependency> <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>2.1.0</version> <type>pom</type> </dependency> If that makes any difference.. Neal Ensor U.S. Department of Energy Office of Scientific and Technical Information Oak Ridge, TN (865) 576-1295 [email protected] ________________________________ From: Ensor, Neal <[email protected]> Sent: Tuesday, September 7, 2021 3:03 PM To: [email protected] <[email protected]>; [email protected] <[email protected]> Subject: Re: Tika 2.1 custom ContentHandler question Presently, I'm trying a mix of generally PDF and OpenOffice documents that have had previous success. The relevant code is mainly something like this: AutoDetectParser parser = new AutoDetectParser(); Metadata md = new Metadata(); CustomContentHandler handler = new CustomContentHandler(); parser.parse(Files.newInputStream(Paths.get("myfile"), handler, md); In the CustomContentHandler (extending DefaultHandler from SAX) I implement the startElement and endElement events, and log each such item. Previously, I would get a number of "a", "div", and other HTML tag events. Since 2.1.0, I only get "html", "head", a few "meta" tags, and a "title", and "body". Neal Ensor U.S. Department of Energy Office of Scientific and Technical Information Oak Ridge, TN (865) 576-1295 [email protected] ________________________________ From: Tim Allison <[email protected]> Sent: Tuesday, September 7, 2021 2:57 PM To: [email protected] <[email protected]> Subject: Re: Tika 2.1 custom ContentHandler question Can you share the code you're using? How are you injecting your custom ContentHandler? Is this occurring on pdf documents or ppt/pptx or something else? On Tue, Sep 7, 2021 at 2:22 PM Ensor, Neal <[email protected]> wrote: > > I have been using Tika for quite some time to handle document text extraction > for SOLR indexing, but attempting to update to 2.1.0 I am encountering an > issue: > > I have a custom SAX ContentHandler wired in to AutoDetectParser calls that I > have previously relied on to trigger page-based calls around <div > class="page"> events on startElement. This do not appear to be generated for > me now? Same code with 1.27 had no issues here. I am in fact also no longer > "seeing" any "<a>" tag events in my handler either. Is there some > alternative way to access the content handler I am not employing? > > When using Tika 2.1.0 "app" via the command-line with --xml, the resulting > document shows all the tags I expect, but my ContentHandler is not. Any > advice would be appreciated, and if more info or code snippets from me might > help, I'd be happy to provide. > > Thanks in advance! > > Neal Ensor > U.S. Department of Energy > Office of Scientific and Technical Information > Oak Ridge, TN > (865) 576-1295 > [email protected]
