I have been using Tika for quite some time to handle document text extraction for SOLR indexing, but attempting to update to 2.1.0 I am encountering an issue:
I have a custom SAX ContentHandler wired in to AutoDetectParser calls that I have previously relied on to trigger page-based calls around <div class="page"> events on startElement. This do not appear to be generated for me now? Same code with 1.27 had no issues here. I am in fact also no longer "seeing" any "<a>" tag events in my handler either. Is there some alternative way to access the content handler I am not employing? When using Tika 2.1.0 "app" via the command-line with --xml, the resulting document shows all the tags I expect, but my ContentHandler is not. Any advice would be appreciated, and if more info or code snippets from me might help, I'd be happy to provide. Thanks in advance! Neal Ensor U.S. Department of Energy Office of Scientific and Technical Information Oak Ridge, TN (865) 576-1295 [email protected]
