I have been using Tika for quite some time to handle document text extraction 
for SOLR indexing, but attempting to update to 2.1.0 I am encountering an issue:

I have a custom SAX ContentHandler wired in to AutoDetectParser calls that I 
have previously relied on to trigger page-based calls around <div class="page"> 
events on startElement.  This do not appear to be generated for me now?  Same 
code with 1.27 had no issues here.  I am in fact also no longer "seeing" any 
"<a>" tag events in my handler either.  Is there some alternative way to access 
the content handler I am not employing?

When using Tika 2.1.0 "app" via the command-line with --xml, the resulting 
document shows all the tags I expect, but my ContentHandler is not.   Any 
advice would be appreciated, and if more info or code snippets from me might 
help, I'd be happy to provide.

Thanks in advance!

Neal Ensor
U.S. Department of Energy
Office of Scientific and Technical Information
Oak Ridge, TN
(865) 576-1295
[email protected]

Reply via email to