RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-12 Thread Allison, Timothy B.
There's also, of course, tika-server.  No matter the method, it is always best to isolate Tika to its own jvm, vm or m. -Original Message- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Monday, April 9, 2018 4:15 PM To: solr-user@lucene.apache.org Subject: Re: How to use Tika

Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-10 Thread David Hastings
I actually used solr 5.x, the more like this features, and a subset of human tagged data (about 10%) to apply subject coding with around a 95% accuracy rate to over 2 million documents, so it is definitely doable On Tue, Apr 10, 2018 at 10:40 AM, Alexandre Rafalovitch wrote:

Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-10 Thread Alexandre Rafalovitch
I know it was a joke, but I've been thinking of something like that. Not a chatbot per say, but perhaps something that uses Machine Learning/topic clustering on the past discussions and match them to the new questions. Still would need to be rechecked by a human for final response, but could be

RE: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Hanjan, Harinder
Oh this is great! Saves me a whole bunch of manual work. Thanks! -Original Message- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Monday, April 09, 2018 2:15 PM To: solr-user@lucene.apache.org Subject: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML document

Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Charlie Hull
As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service https://github.com/mattflax/dropwizard-tika-server written by a colleague of mine at Flax. Hope this is useful. Cheers Charlie On 9 April 2018 at 19:26, Hanjan, Harinder wrote: > Thank

RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Hanjan, Harinder
Thank you Charlie, Tim. I will integrate Tika in my Java app and use SolrJ to send data to Solr. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, April 09, 2018 11:24 AM To: solr-user@lucene.apache.org Subject: [EXT] RE: How to use Tika (Solr Cell)

RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Allison, Timothy B.
+1 https://lucidworks.com/2012/02/14/indexing-with-solrj/ We should add a chatbot to the list that includes Charlie's advice and the link to Erick's blog post whenever Tika is used.  -Original Message- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Monday, April 9, 2018 12:44

Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Charlie Hull
I'd recommend you run Tika externally to Solr, which will allow you to catch this kind of problem and prevent it bringing down your Solr installation. Cheers Charlie On 9 April 2018 at 16:59, Hanjan, Harinder wrote: > Hello! > > Solr (i.e. Tika) throws a "zip bomb"