We should add a chatbot to the list that includes Charlie's advice and the link
to Erick's blog post whenever Tika is used. 😊
From: Charlie Hull [mailto:char...@flax.co.uk]
Sent: Monday, April 9, 2018 12:44 PM
Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document
instead of Solr's MostlyPassthroughHtmlMapper ?
I'd recommend you run Tika externally to Solr, which will allow you to catch
this kind of problem and prevent it bringing down your Solr installation.
On 9 April 2018 at 16:59, Hanjan, Harinder <harinder.han...@calgary.ca>
> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents
> we have in our Sharepoint system. I have used the tika-app.jar
> directly to extract the document in question and it does _not_ throw
> an exception and extract the contents just fine. So it would seem Solr
> is doing something different than a Tika standalone installation.
> After some Googling, I found out that Solr uses its custom HtmlMapper
> (MostlyPassthroughHtmlMapper) which passes through all elements in the
> HTML document to Tika. As Tika limits nested elements to 100, this
> causes Tika to throw an exception: Suspected zip bomb: 100 levels of
> XML element nesting. This is metioned in TIKA-2091
> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The
> "solution" is to use Tika's default parsing/mapping mechanism but no
> details have been provided on how to configure this at Solr.
> I'm hoping some folks here have the knowledge on how to configure Solr
> to effectively by-pass its built in MostlyPassthroughHtmlMapper and
> use Tika's implementation.
> Thank you!
> NOTICE -
> This communication is intended ONLY for the use of the person or
> entity named above and may contain information that is confidential or
> legally privileged. If you are not the intended recipient named above
> or a person responsible for delivering messages or communications to
> the intended recipient, YOU ARE HEREBY NOTIFIED that any use,
> distribution, or copying of this communication or any of the
> information contained in it is strictly prohibited. If you have
> received this communication in error, please notify us immediately by
> telephone and then destroy or delete this communication, or return it
> to us by mail if requested by us. The City of Calgary thanks you for your
> attention and co-operation.