Thank you Charlie, Tim.
I will integrate Tika in my Java app and use SolrJ to send data to Solr. 

-----Original Message-----
From: Allison, Timothy B. [] 
Sent: Monday, April 09, 2018 11:24 AM
Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from HTML 
document instead of Solr's MostlyPassthroughHtmlMapper ?


We should add a chatbot to the list that includes Charlie's advice and the link 
to Erick's blog post whenever Tika is used. 😊

-----Original Message-----

From: Charlie Hull [] 

Sent: Monday, April 9, 2018 12:44 PM


Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document 
instead of Solr's MostlyPassthroughHtmlMapper ?

I'd recommend you run Tika externally to Solr, which will allow you to catch 
this kind of problem and prevent it bringing down your Solr installation.



On 9 April 2018 at 16:59, Hanjan, Harinder <>


> Hello!


> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents 

> we have in our Sharepoint system. I have used the tika-app.jar 

> directly to extract the document in question and it does _not_ throw 

> an exception and extract the contents just fine. So it would seem Solr 

> is doing something different than a Tika standalone installation.


> After some Googling, I found out that Solr uses its custom HtmlMapper

> (MostlyPassthroughHtmlMapper) which passes through all elements in the 

> HTML document to Tika. As Tika limits nested elements to 100, this 

> causes Tika to throw an exception: Suspected zip bomb: 100 levels of 

> XML element nesting. This is metioned in TIKA-2091 

> (
>  jira/browse/TIKA-2091?focusedCommentId=15514131&page=com.atlassian.jira.

> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The 

> "solution" is to use Tika's default parsing/mapping mechanism but no 

> details have been provided on how to configure this at Solr.


> I'm hoping some folks here have the knowledge on how to configure Solr 

> to effectively by-pass its built in MostlyPassthroughHtmlMapper and 

> use Tika's implementation.


> Thank you!

> Harinder



> ________________________________


> This communication is intended ONLY for the use of the person or 

> entity named above and may contain information that is confidential or 

> legally privileged. If you are not the intended recipient named above 

> or a person responsible for delivering messages or communications to 

> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, 

> distribution, or copying of this communication or any of the 

> information contained in it is strictly prohibited. If you have 

> received this communication in error, please notify us immediately by 

> telephone and then destroy or delete this communication, or return it 

> to us by mail if requested by us. The City of Calgary thanks you for your 
> attention and co-operation.


Reply via email to