I have a few PDF files that trigger Zip Bomb protections.

org.apache.tika.exception.TikaException: Zip bomb detected!
Caused by: org.apache.tika.sax.SecureContentHandler$SecureSAXException:
Suspected zip bomb: 100 levels of XML element nesting

One example PDF is: https://www.princeton.edu/~hmilner/milner_ipe_money.pdf
The Zip Bomb detection triggers because of a very weird
outline/bookmarks structure, doing a lot of recursion in
AbstractPDF2XHTML.extractBookmarkText
Fails with tika-app GUI, command-line as XML, text and text main
content only ( -T )

AbstractPDF2XHTML already has a hard recursion limit for AcroForms (
MAX_ACROFORM_RECURSIONS = 10 ), does it make sense to add a limit also
for PDF bookmark extraction?
In case it seems like a good idea I can create an issue and have a go
at a patch in the next days.

Pros:
- no more errors and PDF main text would be fully available after parser runs
- for most PDFs nothing would change, for these strange ones you would
get *some* bookmarks extracted

Cons:
- hiding possible issue with document from the user? Is there a
preferred way in Tika to log "extracted but truncated, please check"?
would also be good for existing AcroForms behavior
- would be another magic number, maybe in the future configurable via
TIKA-2642 ?


-
Cristian Vat

Reply via email to