Hi Jim,
On a second read, I don't _think_ there's a good way to do this currently.
Although there are subtleties in how "underlying parsers" deal with different
types of errors.
For example, if the PDFBox's parser logs an "I can't find the Unicode mapping
for Font X", you're right, Tika doesn't let you know about this because Tika
itself doesn't know about this.
If, however, the dependent parser throws an exception that can be recovered
from, Tika sometimes does now about this and will let you know...e.g. Tika's
PDFParser might catch an IOException on page 3 and then try to parse page
4...it will throw the page 3 exception after it has finished parsing the
document.
Generally speaking with embedded documents, Tika's AutoDetectParser's legacy
behavior has been to swallow exceptions. So, if you're trying to identify
exceptions in embedded files (e.g. macros), I'd strongly recommend using the
RecursiveParserWrapper (-J option in tika-app, /rmeta endpoint in tika-server).
Unlike the AutoDetectParser, the RecursiveParserWrapper catches exceptions and
records them in a field in the metadata [1].
That's the behavior if a parser throws an exception on an embedded document.
However, if a parent document (let's say a .doc file) has problems handling an
embedded InputStream (say with an embedded image), that exception will be
stored in the metadata of the .doc file[2].
In short, things are complicated. Please let us know if we can modify our code
or documentation to help your use cases.
Best,
Tim
[1]
https://tika.apache.org/1.15/api/org/apache/tika/parser/RecursiveParserWrapper.html#EMBEDDED_EXCEPTION
[2]
https://tika.apache.org/1.15/api/org/apache/tika/metadata/TikaCoreProperties.html#TIKA_META_EXCEPTION_EMBEDDED_STREAM
From: Jim Idle [mailto:[email protected]]
Sent: Sunday, June 4, 2017 4:34 AM
To: [email protected]
Subject: Detecting document format/parsing problems
When using Java direct calls and the AutoDectect parser I notice that if a
document is deliberately (malware) or accidentally (some bug say) corrupt or
badly formatted, then the underlying parsers will oft times log an error, but
this is not passed on by Tika.
Any examples out there on how I can be informed of parsing errors? Basically I
would like to know that the document has format problems and as much info as I
can about what is wrong (though in fact I could live with just counting the
number of errors if that's all that can be done), but I don't want to stop the
parse if the underlying parser can recover (good to know if it aborts before
finishing though).
Jim