I have a few PDF files that trigger Zip Bomb protections. org.apache.tika.exception.TikaException: Zip bomb detected! Caused by: org.apache.tika.sax.SecureContentHandler$SecureSAXException: Suspected zip bomb: 100 levels of XML element nesting
One example PDF is: https://www.princeton.edu/~hmilner/milner_ipe_money.pdf The Zip Bomb detection triggers because of a very weird outline/bookmarks structure, doing a lot of recursion in AbstractPDF2XHTML.extractBookmarkText Fails with tika-app GUI, command-line as XML, text and text main content only ( -T ) AbstractPDF2XHTML already has a hard recursion limit for AcroForms ( MAX_ACROFORM_RECURSIONS = 10 ), does it make sense to add a limit also for PDF bookmark extraction? In case it seems like a good idea I can create an issue and have a go at a patch in the next days. Pros: - no more errors and PDF main text would be fully available after parser runs - for most PDFs nothing would change, for these strange ones you would get *some* bookmarks extracted Cons: - hiding possible issue with document from the user? Is there a preferred way in Tika to log "extracted but truncated, please check"? would also be good for existing AcroForms behavior - would be another magic number, maybe in the future configurable via TIKA-2642 ? - Cristian Vat
