Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

David Hastings Tue, 10 Apr 2018 07:49:24 -0700

I actually used solr 5.x, the more like this features, and a subset of
human tagged data (about 10%) to apply subject coding with around a 95%
accuracy rate to over 2 million documents, so it is definitely doable


On Tue, Apr 10, 2018 at 10:40 AM, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> I know it was a joke, but I've been thinking of something like that.
> Not a chatbot per say, but perhaps something that uses Machine
> Learning/topic clustering on the past discussions and match them to
> the new questions. Still would need to be rechecked by a human for
> final response, but could be very helpful. I certainly wished for that
> many times as I was answering newbie's questions (or my own).
>
> And, I feel, current version of Solr actually has all the pieces to
> make such thing happen..... Could be a fun project/demo/service for
> the next LuceneSolrRevolution for somebody with time on their hands
> :-)
>
> Regards,
>    Alex.
>
> On 9 April 2018 at 13:24, Allison, Timothy B. <talli...@mitre.org> wrote:
> > +1
> >
> > https://lucidworks.com/2012/02/14/indexing-with-solrj/
> >
> > We should add a chatbot to the list that includes Charlie's advice and
> the link to Erick's blog post whenever Tika is used. 😊
> >
> >
> > -----Original Message-----
> > From: Charlie Hull [mailto:char...@flax.co.uk]
> > Sent: Monday, April 9, 2018 12:44 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML
> document instead of Solr's MostlyPassthroughHtmlMapper ?
> >
> > I'd recommend you run Tika externally to Solr, which will allow you to
> catch this kind of problem and prevent it bringing down your Solr
> installation.
> >
> > Cheers
> >
> > Charlie
> >
> > On 9 April 2018 at 16:59, Hanjan, Harinder <harinder.han...@calgary.ca>
> > wrote:
> >
> >> Hello!
> >>
> >> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents
> >> we have in our Sharepoint system. I have used the tika-app.jar
> >> directly to extract the document in question and it does _not_ throw
> >> an exception and extract the contents just fine. So it would seem Solr
> >> is doing something different than a Tika standalone installation.
> >>
> >> After some Googling, I found out that Solr uses its custom HtmlMapper
> >> (MostlyPassthroughHtmlMapper) which passes through all elements in the
> >> HTML document to Tika. As Tika limits nested elements to 100, this
> >> causes Tika to throw an exception: Suspected zip bomb: 100 levels of
> >> XML element nesting. This is metioned in TIKA-2091
> >> (https://issues.apache.org/ jira/browse/TIKA-2091?
> focusedCommentId=15514131&page=com.atlassian.jira.
> >> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The
> >> "solution" is to use Tika's default parsing/mapping mechanism but no
> >> details have been provided on how to configure this at Solr.
> >>
> >> I'm hoping some folks here have the knowledge on how to configure Solr
> >> to effectively by-pass its built in MostlyPassthroughHtmlMapper and
> >> use Tika's implementation.
> >>
> >> Thank you!
> >> Harinder
> >>
> >>
> >> ________________________________
> >> NOTICE -
> >> This communication is intended ONLY for the use of the person or
> >> entity named above and may contain information that is confidential or
> >> legally privileged. If you are not the intended recipient named above
> >> or a person responsible for delivering messages or communications to
> >> the intended recipient, YOU ARE HEREBY NOTIFIED that any use,
> >> distribution, or copying of this communication or any of the
> >> information contained in it is strictly prohibited. If you have
> >> received this communication in error, please notify us immediately by
> >> telephone and then destroy or delete this communication, or return it
> >> to us by mail if requested by us. The City of Calgary thanks you for
> your attention and co-operation.
> >>
>

Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

Reply via email to