Can you share the code you're using? How are you injecting your custom ContentHandler? Is this occurring on pdf documents or ppt/pptx or something else?
On Tue, Sep 7, 2021 at 2:22 PM Ensor, Neal <[email protected]> wrote: > > I have been using Tika for quite some time to handle document text extraction > for SOLR indexing, but attempting to update to 2.1.0 I am encountering an > issue: > > I have a custom SAX ContentHandler wired in to AutoDetectParser calls that I > have previously relied on to trigger page-based calls around <div > class="page"> events on startElement. This do not appear to be generated for > me now? Same code with 1.27 had no issues here. I am in fact also no longer > "seeing" any "<a>" tag events in my handler either. Is there some > alternative way to access the content handler I am not employing? > > When using Tika 2.1.0 "app" via the command-line with --xml, the resulting > document shows all the tags I expect, but my ContentHandler is not. Any > advice would be appreciated, and if more info or code snippets from me might > help, I'd be happy to provide. > > Thanks in advance! > > Neal Ensor > U.S. Department of Energy > Office of Scientific and Technical Information > Oak Ridge, TN > (865) 576-1295 > [email protected]
