RE: Detecting document format/parsing problems

Allison, Timothy B. Mon, 05 Jun 2017 10:32:09 -0700

Hi Jim,

  On a second read, I don't _think_ there's a good way to do this currently.  
Although there are subtleties in how "underlying parsers" deal with different 
types of errors.


For example, if the PDFBox's parser logs an "I can't find the Unicode mapping 
for Font X", you're right, Tika doesn't let you know about this because Tika 
itself doesn't know about this.

If, however, the dependent parser throws an exception that can be recovered 
from, Tika sometimes does now about this and will let you know...e.g. Tika's 
PDFParser might catch an IOException on page 3 and then try to parse page 
4...it will throw the page 3 exception after it has finished parsing the 
document.

Generally speaking with embedded documents, Tika's AutoDetectParser's legacy 
behavior has been to swallow exceptions.  So, if you're trying to identify 
exceptions in embedded files (e.g. macros), I'd strongly recommend using the 
RecursiveParserWrapper (-J option in tika-app, /rmeta endpoint in tika-server). 
 Unlike the AutoDetectParser, the RecursiveParserWrapper catches exceptions and 
records them in a field in the metadata [1].

That's the behavior if a parser throws an exception on an embedded document.  
However, if a parent document (let's say a .doc file) has problems handling an 
embedded InputStream (say with an embedded image), that exception will be 
stored in the metadata of the .doc file[2].

In short, things are complicated.  Please let us know if we can modify our code 
or documentation to help your use cases.

Best,

             Tim


[1] 
https://tika.apache.org/1.15/api/org/apache/tika/parser/RecursiveParserWrapper.html#EMBEDDED_EXCEPTION

[2] 
https://tika.apache.org/1.15/api/org/apache/tika/metadata/TikaCoreProperties.html#TIKA_META_EXCEPTION_EMBEDDED_STREAM

From: Jim Idle [mailto:[email protected]]
Sent: Sunday, June 4, 2017 4:34 AM
To: [email protected]
Subject: Detecting document format/parsing problems

When using Java direct calls and the AutoDectect parser I notice that if a 
document is deliberately (malware) or accidentally (some bug say) corrupt or 
badly formatted, then the underlying parsers will oft times log an error, but 
this is not passed on by Tika.

Any examples out there on how I can be informed of parsing errors? Basically I 
would like to know that the document has format problems and as much info as I 
can about what is wrong (though in fact I could live with just counting the 
number of errors if that's all that can be done), but I don't want to stop the 
parse if the underlying parser can recover (good to know if it aborts before 
finishing though).

Jim

RE: Detecting document format/parsing problems

Reply via email to