I'd recommend you run Tika externally to Solr, which will allow you to
catch this kind of problem and prevent it bringing down your Solr



On 9 April 2018 at 16:59, Hanjan, Harinder <harinder.han...@calgary.ca>

> Hello!
> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we
> have in our Sharepoint system. I have used the tika-app.jar directly to
> extract the document in question and it does _not_ throw an exception and
> extract the contents just fine. So it would seem Solr is doing something
> different than a Tika standalone installation.
> After some Googling, I found out that Solr uses its custom HtmlMapper
> (MostlyPassthroughHtmlMapper) which passes through all elements in the HTML
> document to Tika. As Tika limits nested elements to 100, this causes Tika
> to throw an exception: Suspected zip bomb: 100 levels of XML element
> nesting. This is metioned in TIKA-2091 (https://issues.apache.org/
> jira/browse/TIKA-2091?focusedCommentId=15514131&page=com.atlassian.jira.
> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The
> "solution" is to use Tika's default parsing/mapping mechanism but no
> details have been provided on how to configure this at Solr.
> I'm hoping some folks here have the knowledge on how to configure Solr to
> effectively by-pass its built in MostlyPassthroughHtmlMapper and use Tika's
> implementation.
> Thank you!
> Harinder
> ________________________________
> This communication is intended ONLY for the use of the person or entity
> named above and may contain information that is confidential or legally
> privileged. If you are not the intended recipient named above or a person
> responsible for delivering messages or communications to the intended
> recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying
> of this communication or any of the information contained in it is strictly
> prohibited. If you have received this communication in error, please notify
> us immediately by telephone and then destroy or delete this communication,
> or return it to us by mail if requested by us. The City of Calgary thanks
> you for your attention and co-operation.

Reply via email to