RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-12 Thread Allison, Timothy B.
There's also, of course, tika-server. 

No matter the method, it is always best to isolate Tika to its own jvm, vm or m.

-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Monday, April 9, 2018 4:15 PM
To: solr-user@lucene.apache.org
Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document 
instead of Solr's MostlyPassthroughHtmlMapper ?

As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service 
https://github.com/mattflax/dropwizard-tika-server written by a colleague of 
mine at Flax. Hope this is useful.

Cheers

Charlie




Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-10 Thread David Hastings
I actually used solr 5.x, the more like this features, and a subset of
human tagged data (about 10%) to apply subject coding with around a 95%
accuracy rate to over 2 million documents, so it is definitely doable

On Tue, Apr 10, 2018 at 10:40 AM, Alexandre Rafalovitch 
wrote:

> I know it was a joke, but I've been thinking of something like that.
> Not a chatbot per say, but perhaps something that uses Machine
> Learning/topic clustering on the past discussions and match them to
> the new questions. Still would need to be rechecked by a human for
> final response, but could be very helpful. I certainly wished for that
> many times as I was answering newbie's questions (or my own).
>
> And, I feel, current version of Solr actually has all the pieces to
> make such thing happen. Could be a fun project/demo/service for
> the next LuceneSolrRevolution for somebody with time on their hands
> :-)
>
> Regards,
>Alex.
>
> On 9 April 2018 at 13:24, Allison, Timothy B.  wrote:
> > +1
> >
> > https://lucidworks.com/2012/02/14/indexing-with-solrj/
> >
> > We should add a chatbot to the list that includes Charlie's advice and
> the link to Erick's blog post whenever Tika is used. 
> >
> >
> > -Original Message-
> > From: Charlie Hull [mailto:char...@flax.co.uk]
> > Sent: Monday, April 9, 2018 12:44 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML
> document instead of Solr's MostlyPassthroughHtmlMapper ?
> >
> > I'd recommend you run Tika externally to Solr, which will allow you to
> catch this kind of problem and prevent it bringing down your Solr
> installation.
> >
> > Cheers
> >
> > Charlie
> >
> > On 9 April 2018 at 16:59, Hanjan, Harinder 
> > wrote:
> >
> >> Hello!
> >>
> >> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents
> >> we have in our Sharepoint system. I have used the tika-app.jar
> >> directly to extract the document in question and it does _not_ throw
> >> an exception and extract the contents just fine. So it would seem Solr
> >> is doing something different than a Tika standalone installation.
> >>
> >> After some Googling, I found out that Solr uses its custom HtmlMapper
> >> (MostlyPassthroughHtmlMapper) which passes through all elements in the
> >> HTML document to Tika. As Tika limits nested elements to 100, this
> >> causes Tika to throw an exception: Suspected zip bomb: 100 levels of
> >> XML element nesting. This is metioned in TIKA-2091
> >> (https://issues.apache.org/ jira/browse/TIKA-2091?
> focusedCommentId=15514131=com.atlassian.jira.
> >> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The
> >> "solution" is to use Tika's default parsing/mapping mechanism but no
> >> details have been provided on how to configure this at Solr.
> >>
> >> I'm hoping some folks here have the knowledge on how to configure Solr
> >> to effectively by-pass its built in MostlyPassthroughHtmlMapper and
> >> use Tika's implementation.
> >>
> >> Thank you!
> >> Harinder
> >>
> >>
> >> 
> >> NOTICE -
> >> This communication is intended ONLY for the use of the person or
> >> entity named above and may contain information that is confidential or
> >> legally privileged. If you are not the intended recipient named above
> >> or a person responsible for delivering messages or communications to
> >> the intended recipient, YOU ARE HEREBY NOTIFIED that any use,
> >> distribution, or copying of this communication or any of the
> >> information contained in it is strictly prohibited. If you have
> >> received this communication in error, please notify us immediately by
> >> telephone and then destroy or delete this communication, or return it
> >> to us by mail if requested by us. The City of Calgary thanks you for
> your attention and co-operation.
> >>
>


Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-10 Thread Alexandre Rafalovitch
I know it was a joke, but I've been thinking of something like that.
Not a chatbot per say, but perhaps something that uses Machine
Learning/topic clustering on the past discussions and match them to
the new questions. Still would need to be rechecked by a human for
final response, but could be very helpful. I certainly wished for that
many times as I was answering newbie's questions (or my own).

And, I feel, current version of Solr actually has all the pieces to
make such thing happen. Could be a fun project/demo/service for
the next LuceneSolrRevolution for somebody with time on their hands
:-)

Regards,
   Alex.

On 9 April 2018 at 13:24, Allison, Timothy B.  wrote:
> +1
>
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> We should add a chatbot to the list that includes Charlie's advice and the 
> link to Erick's blog post whenever Tika is used. 
>
>
> -Original Message-
> From: Charlie Hull [mailto:char...@flax.co.uk]
> Sent: Monday, April 9, 2018 12:44 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to use Tika (Solr Cell) to extract content from HTML 
> document instead of Solr's MostlyPassthroughHtmlMapper ?
>
> I'd recommend you run Tika externally to Solr, which will allow you to catch 
> this kind of problem and prevent it bringing down your Solr installation.
>
> Cheers
>
> Charlie
>
> On 9 April 2018 at 16:59, Hanjan, Harinder 
> wrote:
>
>> Hello!
>>
>> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents
>> we have in our Sharepoint system. I have used the tika-app.jar
>> directly to extract the document in question and it does _not_ throw
>> an exception and extract the contents just fine. So it would seem Solr
>> is doing something different than a Tika standalone installation.
>>
>> After some Googling, I found out that Solr uses its custom HtmlMapper
>> (MostlyPassthroughHtmlMapper) which passes through all elements in the
>> HTML document to Tika. As Tika limits nested elements to 100, this
>> causes Tika to throw an exception: Suspected zip bomb: 100 levels of
>> XML element nesting. This is metioned in TIKA-2091
>> (https://issues.apache.org/ 
>> jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira.
>> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The
>> "solution" is to use Tika's default parsing/mapping mechanism but no
>> details have been provided on how to configure this at Solr.
>>
>> I'm hoping some folks here have the knowledge on how to configure Solr
>> to effectively by-pass its built in MostlyPassthroughHtmlMapper and
>> use Tika's implementation.
>>
>> Thank you!
>> Harinder
>>
>>
>> 
>> NOTICE -
>> This communication is intended ONLY for the use of the person or
>> entity named above and may contain information that is confidential or
>> legally privileged. If you are not the intended recipient named above
>> or a person responsible for delivering messages or communications to
>> the intended recipient, YOU ARE HEREBY NOTIFIED that any use,
>> distribution, or copying of this communication or any of the
>> information contained in it is strictly prohibited. If you have
>> received this communication in error, please notify us immediately by
>> telephone and then destroy or delete this communication, or return it
>> to us by mail if requested by us. The City of Calgary thanks you for your 
>> attention and co-operation.
>>


RE: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Hanjan, Harinder
Oh this is great! Saves me a whole bunch of manual work.

Thanks!

-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Monday, April 09, 2018 2:15 PM
To: solr-user@lucene.apache.org
Subject: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML 
document instead of Solr's MostlyPassthroughHtmlMapper ?

As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_mattflax_dropwizard-2Dtika-2Dserver=DwIFaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=RkNfel_ImtzaUi1-fKXjGS0tiL3Vg2u2A2HKc0iMBGM=VrGqjG23NC5KbsEV-SZuu6s-Njx_XZRPp4uHkrmM_KY=
 written by a colleague of mine at Flax. Hope this is useful.

Cheers

Charlie

On 9 April 2018 at 19:26, Hanjan, Harinder 
wrote:

> Thank you Charlie, Tim.
> I will integrate Tika in my Java app and use SolrJ to send data to Solr.
>
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Monday, April 09, 2018 11:24 AM
> To: solr-user@lucene.apache.org
> Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from 
> HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
>
> +1
>
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__
> lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_=DwIGaQ=jdm1Hby_
> BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-
> HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_
> 3ndvYmpHBHjZXJ5pTMP2w=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0=
>
>
>
> We should add a chatbot to the list that includes Charlie's advice and 
> the link to Erick's blog post whenever Tika is used. 
>
>
>
>
>
> -Original Message-
>
> From: Charlie Hull [mailto:char...@flax.co.uk]
>
> Sent: Monday, April 9, 2018 12:44 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: Re: How to use Tika (Solr Cell) to extract content from HTML 
> document instead of Solr's MostlyPassthroughHtmlMapper ?
>
>
>
> I'd recommend you run Tika externally to Solr, which will allow you to 
> catch this kind of problem and prevent it bringing down your Solr 
> installation.
>
>
>
> Cheers
>
>
>
> Charlie
>
>
>
> On 9 April 2018 at 16:59, Hanjan, Harinder 
> 
>
> wrote:
>
>
>
> > Hello!
>
> >
>
> > Solr (i.e. Tika) throws a "zip bomb" exception with certain 
> > documents
>
> > we have in our Sharepoint system. I have used the tika-app.jar
>
> > directly to extract the document in question and it does _not_ throw
>
> > an exception and extract the contents just fine. So it would seem 
> > Solr
>
> > is doing something different than a Tika standalone installation.
>
> >
>
> > After some Googling, I found out that Solr uses its custom 
> > HtmlMapper
>
> > (MostlyPassthroughHtmlMapper) which passes through all elements in 
> > the
>
> > HTML document to Tika. As Tika limits nested elements to 100, this
>
> > causes Tika to throw an exception: Suspected zip bomb: 100 levels of
>
> > XML element nesting. This is metioned in TIKA-2091
>
> > (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__issues.apache.org_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyK
> Du vdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=
> 7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=Il6-
> in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0= jira/browse/TIKA-2091?
> focusedCommentId=15514131=com.atlassian.jira.
>
> > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). 
> > The
>
> > "solution" is to use Tika's default parsing/mapping mechanism but no
>
> > details have been provided on how to configure this at Solr.
>
> >
>
> > I'm hoping some folks here have the knowledge on how to configure 
> > Solr
>
> > to effectively by-pass its built in MostlyPassthroughHtmlMapper and
>
> > use Tika's implementation.
>
> >
>
> > Thank you!
>
> > Harinder
>
> >
>
> >
>
> > 
>
> > NOTICE -
>
> > This communication is intended ONLY for the use of the person or
>
> > entity named above and may contain information that is confidential 
> > or
>
> > legally privileged. If you are not the intended recipient named 
> > above
>
> > or a person responsible for delivering messages or communications to
>
> > the intended recipient, YOU ARE HEREBY NOTIFIED that any use,
>
> > distribution, or copying of this communication or any of the
>
> > information contained in it is strictly prohibited. If you have
>
> > received this communication in error, please notify us immediately 
> > by
>
> > telephone and then destroy or delete this communication, or return 
> > it
>
> > to us by mail if requested by us. The City of Calgary thanks you for
> your attention and co-operation.
>
> >
>
>


Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Charlie Hull
As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web
service https://github.com/mattflax/dropwizard-tika-server written by a
colleague of mine at Flax. Hope this is useful.

Cheers

Charlie

On 9 April 2018 at 19:26, Hanjan, Harinder 
wrote:

> Thank you Charlie, Tim.
> I will integrate Tika in my Java app and use SolrJ to send data to Solr.
>
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Monday, April 09, 2018 11:24 AM
> To: solr-user@lucene.apache.org
> Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from
> HTML document instead of Solr's MostlyPassthroughHtmlMapper ?
>
> +1
>
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__
> lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_=DwIGaQ=jdm1Hby_
> BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-
> HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_
> 3ndvYmpHBHjZXJ5pTMP2w=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0=
>
>
>
> We should add a chatbot to the list that includes Charlie's advice and the
> link to Erick's blog post whenever Tika is used. 
>
>
>
>
>
> -Original Message-
>
> From: Charlie Hull [mailto:char...@flax.co.uk]
>
> Sent: Monday, April 9, 2018 12:44 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: Re: How to use Tika (Solr Cell) to extract content from HTML
> document instead of Solr's MostlyPassthroughHtmlMapper ?
>
>
>
> I'd recommend you run Tika externally to Solr, which will allow you to
> catch this kind of problem and prevent it bringing down your Solr
> installation.
>
>
>
> Cheers
>
>
>
> Charlie
>
>
>
> On 9 April 2018 at 16:59, Hanjan, Harinder 
>
> wrote:
>
>
>
> > Hello!
>
> >
>
> > Solr (i.e. Tika) throws a "zip bomb" exception with certain documents
>
> > we have in our Sharepoint system. I have used the tika-app.jar
>
> > directly to extract the document in question and it does _not_ throw
>
> > an exception and extract the contents just fine. So it would seem Solr
>
> > is doing something different than a Tika standalone installation.
>
> >
>
> > After some Googling, I found out that Solr uses its custom HtmlMapper
>
> > (MostlyPassthroughHtmlMapper) which passes through all elements in the
>
> > HTML document to Tika. As Tika limits nested elements to 100, this
>
> > causes Tika to throw an exception: Suspected zip bomb: 100 levels of
>
> > XML element nesting. This is metioned in TIKA-2091
>
> > (https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__issues.apache.org_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDu
> vdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=
> 7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=Il6-
> in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0= jira/browse/TIKA-2091?
> focusedCommentId=15514131=com.atlassian.jira.
>
> > plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The
>
> > "solution" is to use Tika's default parsing/mapping mechanism but no
>
> > details have been provided on how to configure this at Solr.
>
> >
>
> > I'm hoping some folks here have the knowledge on how to configure Solr
>
> > to effectively by-pass its built in MostlyPassthroughHtmlMapper and
>
> > use Tika's implementation.
>
> >
>
> > Thank you!
>
> > Harinder
>
> >
>
> >
>
> > 
>
> > NOTICE -
>
> > This communication is intended ONLY for the use of the person or
>
> > entity named above and may contain information that is confidential or
>
> > legally privileged. If you are not the intended recipient named above
>
> > or a person responsible for delivering messages or communications to
>
> > the intended recipient, YOU ARE HEREBY NOTIFIED that any use,
>
> > distribution, or copying of this communication or any of the
>
> > information contained in it is strictly prohibited. If you have
>
> > received this communication in error, please notify us immediately by
>
> > telephone and then destroy or delete this communication, or return it
>
> > to us by mail if requested by us. The City of Calgary thanks you for
> your attention and co-operation.
>
> >
>
>


RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Hanjan, Harinder
Thank you Charlie, Tim.
I will integrate Tika in my Java app and use SolrJ to send data to Solr. 


-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Monday, April 09, 2018 11:24 AM
To: solr-user@lucene.apache.org
Subject: [EXT] RE: How to use Tika (Solr Cell) to extract content from HTML 
document instead of Solr's MostlyPassthroughHtmlMapper ?

+1



https://urldefense.proofpoint.com/v2/url?u=https-3A__lucidworks.com_2012_02_14_indexing-2Dwith-2Dsolrj_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=YbP_o22QJ_tsZDUPgSfDvEXZ9asBUFFHz53s2yTH8Q0=



We should add a chatbot to the list that includes Charlie's advice and the link 
to Erick's blog post whenever Tika is used. 





-Original Message-

From: Charlie Hull [mailto:char...@flax.co.uk] 

Sent: Monday, April 9, 2018 12:44 PM

To: solr-user@lucene.apache.org

Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document 
instead of Solr's MostlyPassthroughHtmlMapper ?



I'd recommend you run Tika externally to Solr, which will allow you to catch 
this kind of problem and prevent it bringing down your Solr installation.



Cheers



Charlie



On 9 April 2018 at 16:59, Hanjan, Harinder 

wrote:



> Hello!

>

> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents 

> we have in our Sharepoint system. I have used the tika-app.jar 

> directly to extract the document in question and it does _not_ throw 

> an exception and extract the contents just fine. So it would seem Solr 

> is doing something different than a Tika standalone installation.

>

> After some Googling, I found out that Solr uses its custom HtmlMapper

> (MostlyPassthroughHtmlMapper) which passes through all elements in the 

> HTML document to Tika. As Tika limits nested elements to 100, this 

> causes Tika to throw an exception: Suspected zip bomb: 100 levels of 

> XML element nesting. This is metioned in TIKA-2091 

> (https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_=DwIGaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=7XZTNWKY6A53HuY_2qeWA_3ndvYmpHBHjZXJ5pTMP2w=Il6-in8tGiAN3MaNlXmqvIkc3VyCCeG2qK2cGyMOuw0=
>  jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira.

> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The 

> "solution" is to use Tika's default parsing/mapping mechanism but no 

> details have been provided on how to configure this at Solr.

>

> I'm hoping some folks here have the knowledge on how to configure Solr 

> to effectively by-pass its built in MostlyPassthroughHtmlMapper and 

> use Tika's implementation.

>

> Thank you!

> Harinder

>

>

> 

> NOTICE -

> This communication is intended ONLY for the use of the person or 

> entity named above and may contain information that is confidential or 

> legally privileged. If you are not the intended recipient named above 

> or a person responsible for delivering messages or communications to 

> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, 

> distribution, or copying of this communication or any of the 

> information contained in it is strictly prohibited. If you have 

> received this communication in error, please notify us immediately by 

> telephone and then destroy or delete this communication, or return it 

> to us by mail if requested by us. The City of Calgary thanks you for your 
> attention and co-operation.

>



RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Allison, Timothy B.
+1

https://lucidworks.com/2012/02/14/indexing-with-solrj/

We should add a chatbot to the list that includes Charlie's advice and the link 
to Erick's blog post whenever Tika is used. 


-Original Message-
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Monday, April 9, 2018 12:44 PM
To: solr-user@lucene.apache.org
Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document 
instead of Solr's MostlyPassthroughHtmlMapper ?

I'd recommend you run Tika externally to Solr, which will allow you to catch 
this kind of problem and prevent it bringing down your Solr installation.

Cheers

Charlie

On 9 April 2018 at 16:59, Hanjan, Harinder 
wrote:

> Hello!
>
> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents 
> we have in our Sharepoint system. I have used the tika-app.jar 
> directly to extract the document in question and it does _not_ throw 
> an exception and extract the contents just fine. So it would seem Solr 
> is doing something different than a Tika standalone installation.
>
> After some Googling, I found out that Solr uses its custom HtmlMapper
> (MostlyPassthroughHtmlMapper) which passes through all elements in the 
> HTML document to Tika. As Tika limits nested elements to 100, this 
> causes Tika to throw an exception: Suspected zip bomb: 100 levels of 
> XML element nesting. This is metioned in TIKA-2091 
> (https://issues.apache.org/ 
> jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira.
> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The 
> "solution" is to use Tika's default parsing/mapping mechanism but no 
> details have been provided on how to configure this at Solr.
>
> I'm hoping some folks here have the knowledge on how to configure Solr 
> to effectively by-pass its built in MostlyPassthroughHtmlMapper and 
> use Tika's implementation.
>
> Thank you!
> Harinder
>
>
> 
> NOTICE -
> This communication is intended ONLY for the use of the person or 
> entity named above and may contain information that is confidential or 
> legally privileged. If you are not the intended recipient named above 
> or a person responsible for delivering messages or communications to 
> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, 
> distribution, or copying of this communication or any of the 
> information contained in it is strictly prohibited. If you have 
> received this communication in error, please notify us immediately by 
> telephone and then destroy or delete this communication, or return it 
> to us by mail if requested by us. The City of Calgary thanks you for your 
> attention and co-operation.
>


Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Charlie Hull
I'd recommend you run Tika externally to Solr, which will allow you to
catch this kind of problem and prevent it bringing down your Solr
installation.

Cheers

Charlie

On 9 April 2018 at 16:59, Hanjan, Harinder 
wrote:

> Hello!
>
> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we
> have in our Sharepoint system. I have used the tika-app.jar directly to
> extract the document in question and it does _not_ throw an exception and
> extract the contents just fine. So it would seem Solr is doing something
> different than a Tika standalone installation.
>
> After some Googling, I found out that Solr uses its custom HtmlMapper
> (MostlyPassthroughHtmlMapper) which passes through all elements in the HTML
> document to Tika. As Tika limits nested elements to 100, this causes Tika
> to throw an exception: Suspected zip bomb: 100 levels of XML element
> nesting. This is metioned in TIKA-2091 (https://issues.apache.org/
> jira/browse/TIKA-2091?focusedCommentId=15514131=com.atlassian.jira.
> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The
> "solution" is to use Tika's default parsing/mapping mechanism but no
> details have been provided on how to configure this at Solr.
>
> I'm hoping some folks here have the knowledge on how to configure Solr to
> effectively by-pass its built in MostlyPassthroughHtmlMapper and use Tika's
> implementation.
>
> Thank you!
> Harinder
>
>
> 
> NOTICE -
> This communication is intended ONLY for the use of the person or entity
> named above and may contain information that is confidential or legally
> privileged. If you are not the intended recipient named above or a person
> responsible for delivering messages or communications to the intended
> recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying
> of this communication or any of the information contained in it is strictly
> prohibited. If you have received this communication in error, please notify
> us immediately by telephone and then destroy or delete this communication,
> or return it to us by mail if requested by us. The City of Calgary thanks
> you for your attention and co-operation.
>