Can you share the code you're using?  How are you injecting your
custom ContentHandler?  Is this occurring on pdf documents or ppt/pptx
or something else?

On Tue, Sep 7, 2021 at 2:22 PM Ensor, Neal <[email protected]> wrote:
>
> I have been using Tika for quite some time to handle document text extraction 
> for SOLR indexing, but attempting to update to 2.1.0 I am encountering an 
> issue:
>
> I have a custom SAX ContentHandler wired in to AutoDetectParser calls that I 
> have previously relied on to trigger page-based calls around <div 
> class="page"> events on startElement.  This do not appear to be generated for 
> me now?  Same code with 1.27 had no issues here.  I am in fact also no longer 
> "seeing" any "<a>" tag events in my handler either.  Is there some 
> alternative way to access the content handler I am not employing?
>
> When using Tika 2.1.0 "app" via the command-line with --xml, the resulting 
> document shows all the tags I expect, but my ContentHandler is not.   Any 
> advice would be appreciated, and if more info or code snippets from me might 
> help, I'd be happy to provide.
>
> Thanks in advance!
>
> Neal Ensor
> U.S. Department of Energy
> Office of Scientific and Technical Information
> Oak Ridge, TN
> (865) 576-1295
> [email protected]

Reply via email to