Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

Alexandre Rafalovitch Tue, 10 Apr 2018 07:41:53 -0700

I know it was a joke, but I've been thinking of something like that.
Not a chatbot per say, but perhaps something that uses Machine
Learning/topic clustering on the past discussions and match them to
the new questions. Still would need to be rechecked by a human for
final response, but could be very helpful. I certainly wished for that
many times as I was answering newbie's questions (or my own).


And, I feel, current version of Solr actually has all the pieces to
make such thing happen..... Could be a fun project/demo/service for
the next LuceneSolrRevolution for somebody with time on their hands
:-)

Regards,
   Alex.

On 9 April 2018 at 13:24, Allison, Timothy B. <talli...@mitre.org> wrote:
> +1
>
> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>
> We should add a chatbot to the list that includes Charlie's advice and the 
> link to Erick's blog post whenever Tika is used. 😊
>
>
> -----Original Message-----
> From: Charlie Hull [mailto:char...@flax.co.uk]
> Sent: Monday, April 9, 2018 12:44 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to use Tika (Solr Cell) to extract content from HTML 
> document instead of Solr's MostlyPassthroughHtmlMapper ?
>
> I'd recommend you run Tika externally to Solr, which will allow you to catch 
> this kind of problem and prevent it bringing down your Solr installation.
>
> Cheers
>
> Charlie
>
> On 9 April 2018 at 16:59, Hanjan, Harinder <harinder.han...@calgary.ca>
> wrote:
>
>> Hello!
>>
>> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents
>> we have in our Sharepoint system. I have used the tika-app.jar
>> directly to extract the document in question and it does _not_ throw
>> an exception and extract the contents just fine. So it would seem Solr
>> is doing something different than a Tika standalone installation.
>>
>> After some Googling, I found out that Solr uses its custom HtmlMapper
>> (MostlyPassthroughHtmlMapper) which passes through all elements in the
>> HTML document to Tika. As Tika limits nested elements to 100, this
>> causes Tika to throw an exception: Suspected zip bomb: 100 levels of
>> XML element nesting. This is metioned in TIKA-2091
>> (https://issues.apache.org/ 
>> jira/browse/TIKA-2091?focusedCommentId=15514131&page=com.atlassian.jira.
>> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The
>> "solution" is to use Tika's default parsing/mapping mechanism but no
>> details have been provided on how to configure this at Solr.
>>
>> I'm hoping some folks here have the knowledge on how to configure Solr
>> to effectively by-pass its built in MostlyPassthroughHtmlMapper and
>> use Tika's implementation.
>>
>> Thank you!
>> Harinder
>>
>>
>> ________________________________
>> NOTICE -
>> This communication is intended ONLY for the use of the person or
>> entity named above and may contain information that is confidential or
>> legally privileged. If you are not the intended recipient named above
>> or a person responsible for delivering messages or communications to
>> the intended recipient, YOU ARE HEREBY NOTIFIED that any use,
>> distribution, or copying of this communication or any of the
>> information contained in it is strictly prohibited. If you have
>> received this communication in error, please notify us immediately by
>> telephone and then destroy or delete this communication, or return it
>> to us by mail if requested by us. The City of Calgary thanks you for your 
>> attention and co-operation.
>>

Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

Reply via email to