RE: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

Hanjan, Harinder Mon, 09 Apr 2018 15:21:07 -0700

Oh this is great! Saves me a whole bunch of manual work.

Thanks!


-----Original Message-----
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Monday, April 09, 2018 2:15 PM
To: solr-user@lucene.apache.org
Subject: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML 
document instead of Solr's MostlyPassthroughHtmlMapper ?

As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mattflax_dropwizard-2Dtika-2Dserver&d=DwIFaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=RkNfel_ImtzaUi1-fKXjGS0tiL3Vg2u2A2HKc0iMBGM&s=VrGqjG23NC5KbsEV-SZuu6s-Njx_XZRPp4uHkrmM_KY&e=
 written by a colleague of mine at Flax. Hope this is useful.

Cheers

Charlie

On 9 April 2018 at 19:26, Hanjan, Harinder <harinder.han...@calgary.ca>
wrote:

> Thank you Charlie, Tim.
> I will integrate Tika in my Java app and use SolrJ to send data to Solr.
>
>
> -----Original Message-----
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Monday, April 09, 2018 11:24 AM
> To: solr-user@lucene.apache.org
> Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from 
> HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
>
> +1
>
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__
> lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_&d=DwIGaQ&c=jdm1Hby_
> BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M&r=N30IrhmaeKKhVHu13d-
> HO9gO9CysWnvGGoKrSNEuM3U&m=7XZTNWKY6A53HuY_2qeWA_
> 3ndvYmpHBHjZXJ5pTMP2w&s=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0&e=
>
>
>
> We should add a chatbot to the list that includes Charlie's advice and 
> the link to Erick's blog post whenever Tika is used. 😊
>
>
>
>
>
> -----Original Message-----
>
> From: Charlie Hull [mailto:char...@flax.co.uk]
>
> Sent: Monday, April 9, 2018 12:44 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: Re: How to use Tika (Solr Cell) to extract content from HTML 
> document instead of Solr's MostlyPassthroughHtmlMapper ?
>
>
>
> I'd recommend you run Tika externally to Solr, which will allow you to 
> catch this kind of problem and prevent it bringing down your Solr 
> installation.
>
>
>
> Cheers
>
>
>
> Charlie
>
>
>
> On 9 April 2018 at 16:59, Hanjan, Harinder 
> <harinder.han...@calgary.ca>
>
> wrote:
>
>
>
> > Hello!
>
> >
>
> > Solr (i.e. Tika) throws a "zip bomb" exception with certain 
> > documents
>
> > we have in our Sharepoint system. I have used the tika-app.jar
>
> > directly to extract the document in question and it does _not_ throw
>
> > an exception and extract the contents just fine. So it would seem 
> > Solr
>
> > is doing something different than a Tika standalone installation.
>
> >
>
> > After some Googling, I found out that Solr uses its custom 
> > HtmlMapper
>
> > (MostlyPassthroughHtmlMapper) which passes through all elements in 
> > the
>
> > HTML document to Tika. As Tika limits nested elements to 100, this
>
> > causes Tika to throw an exception: Suspected zip bomb: 100 levels of
>
> > XML element nesting. This is metioned in TIKA-2091
>
> > (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__issues.apache.org_&d=DwIGaQ&c=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyK
> Du vdq3M&r=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U&m=
> 7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w&s=Il6-
> in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0&e= jira/browse/TIKA-2091?
> focusedCommentId=15514131&page=com.atlassian.jira.
>
> > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). 
> > The
>
> > "solution" is to use Tika's default parsing/mapping mechanism but no
>
> > details have been provided on how to configure this at Solr.
>
> >
>
> > I'm hoping some folks here have the knowledge on how to configure 
> > Solr
>
> > to effectively by-pass its built in MostlyPassthroughHtmlMapper and
>
> > use Tika's implementation.
>
> >
>
> > Thank you!
>
> > Harinder
>
> >
>
> >
>
> > ________________________________
>
> > NOTICE -
>
> > This communication is intended ONLY for the use of the person or
>
> > entity named above and may contain information that is confidential 
> > or
>
> > legally privileged. If you are not the intended recipient named 
> > above
>
> > or a person responsible for delivering messages or communications to
>
> > the intended recipient, YOU ARE HEREBY NOTIFIED that any use,
>
> > distribution, or copying of this communication or any of the
>
> > information contained in it is strictly prohibited. If you have
>
> > received this communication in error, please notify us immediately 
> > by
>
> > telephone and then destroy or delete this communication, or return 
> > it
>
> > to us by mail if requested by us. The City of Calgary thanks you for
> your attention and co-operation.
>
> >
>
>

RE: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

Reply via email to