I actually used solr 5.x, the more like this features, and a subset of human tagged data (about 10%) to apply subject coding with around a 95% accuracy rate to over 2 million documents, so it is definitely doable
On Tue, Apr 10, 2018 at 10:40 AM, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > I know it was a joke, but I've been thinking of something like that. > Not a chatbot per say, but perhaps something that uses Machine > Learning/topic clustering on the past discussions and match them to > the new questions. Still would need to be rechecked by a human for > final response, but could be very helpful. I certainly wished for that > many times as I was answering newbie's questions (or my own). > > And, I feel, current version of Solr actually has all the pieces to > make such thing happen..... Could be a fun project/demo/service for > the next LuceneSolrRevolution for somebody with time on their hands > :-) > > Regards, > Alex. > > On 9 April 2018 at 13:24, Allison, Timothy B. <talli...@mitre.org> wrote: > > +1 > > > > https://lucidworks.com/2012/02/14/indexing-with-solrj/ > > > > We should add a chatbot to the list that includes Charlie's advice and > the link to Erick's blog post whenever Tika is used. 😊 > > > > > > -----Original Message----- > > From: Charlie Hull [mailto:char...@flax.co.uk] > > Sent: Monday, April 9, 2018 12:44 PM > > To: solr-user@lucene.apache.org > > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML > document instead of Solr's MostlyPassthroughHtmlMapper ? > > > > I'd recommend you run Tika externally to Solr, which will allow you to > catch this kind of problem and prevent it bringing down your Solr > installation. > > > > Cheers > > > > Charlie > > > > On 9 April 2018 at 16:59, Hanjan, Harinder <harinder.han...@calgary.ca> > > wrote: > > > >> Hello! > >> > >> Solr (i.e. Tika) throws a "zip bomb" exception with certain documents > >> we have in our Sharepoint system. I have used the tika-app.jar > >> directly to extract the document in question and it does _not_ throw > >> an exception and extract the contents just fine. So it would seem Solr > >> is doing something different than a Tika standalone installation. > >> > >> After some Googling, I found out that Solr uses its custom HtmlMapper > >> (MostlyPassthroughHtmlMapper) which passes through all elements in the > >> HTML document to Tika. As Tika limits nested elements to 100, this > >> causes Tika to throw an exception: Suspected zip bomb: 100 levels of > >> XML element nesting. This is metioned in TIKA-2091 > >> (https://issues.apache.org/ jira/browse/TIKA-2091? > focusedCommentId=15514131&page=com.atlassian.jira. > >> plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15514131). The > >> "solution" is to use Tika's default parsing/mapping mechanism but no > >> details have been provided on how to configure this at Solr. > >> > >> I'm hoping some folks here have the knowledge on how to configure Solr > >> to effectively by-pass its built in MostlyPassthroughHtmlMapper and > >> use Tika's implementation. > >> > >> Thank you! > >> Harinder > >> > >> > >> ________________________________ > >> NOTICE - > >> This communication is intended ONLY for the use of the person or > >> entity named above and may contain information that is confidential or > >> legally privileged. If you are not the intended recipient named above > >> or a person responsible for delivering messages or communications to > >> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, > >> distribution, or copying of this communication or any of the > >> information contained in it is strictly prohibited. If you have > >> received this communication in error, please notify us immediately by > >> telephone and then destroy or delete this communication, or return it > >> to us by mail if requested by us. The City of Calgary thanks you for > your attention and co-operation. > >> >