RE: PDF extraction using Tika

2020-08-25 Thread Srinivas Kashyap
Thanks Phil, I will modify it according to the need. Thanks, Srinivas -Original Message- From: Phil Scadden Sent: 26 August 2020 02:44 To: solr-user@lucene.apache.org Subject: RE: PDF extraction using Tika Code for solrj is going to be very dependent on your needs but the beating

RE: PDF extraction using Tika

2020-08-25 Thread Phil Scadden
Code for solrj is going to be very dependent on your needs but the beating heart of my code is below ( note that I do OCR as separate step before feeding files into indexer). Solrj and tika docs should help. File f = new File(filename); ContentHandler textHandler = new

How does Solr suggest sort results when weight is 0

2020-08-25 Thread Hanjan, Harinderdeep S.
Hello, I can't find anything in the docs to understand how Solr sorts suggest results when the weight is the same (0 in my case). Here is my suggester config: --- mySuggester AnalyzingInfixLookupFactory

Re: How to Prevent Recovery?

2020-08-25 Thread Erick Erickson
Commits should absolutely not be taking that much time, that’s where I’d focus first. Some sneaky places things go wonky: 1> you have suggester configured that builds whenever there’s a commit. 2> you send commits from the client 3> you’re optimizing on commit 4> you have too much data for your

Re: How to Prevent Recovery?

2020-08-25 Thread Houston Putman
Are you able to use TLOG replicas? That should reduce the time it takes to recover significantly. It doesn't seem like you have a hard need for near-real-time, since slow ingestions are fine. - Houston On Tue, Aug 25, 2020 at 12:03 PM Anshuman Singh wrote: > Hi, > > We have a 10 node (150G

How to Prevent Recovery?

2020-08-25 Thread Anshuman Singh
Hi, We have a 10 node (150G RAM, 1TB SAS HDD, 32 cores) Solr 8.5.1 cluster with 50 shards, rf 2 (NRT replicas), 7B docs, We have 5 Zk with 2 running on the same nodes where Solr is running. Our use case requires continuous ingestions (updates mostly). If we ingest at 40k records per sec, after

Issues deploying LTR into SolrCloud

2020-08-25 Thread Dmitry Kan
Hi, There is a recent thread "Replication of Solr Model and feature store" on deploying LTR feature store and model into a master/slave Solr topology. I'm facing an issue of deploying into SolrCloud (solr 7.5.0), where collections have shards with replicas. This is the process I've been

Re: Apache Solr 8.6.0 with SSL

2020-08-25 Thread Patrik Peng
Thanks for your input regarding SOLR-14711, that makes sense. I wasn't able to reproduce the bin/solr script issue on a Debian machine, so I guess there's something wrong with my setup. Patrik On 24.08.20 17:26, Jan Høydahl wrote: > I think you’re experiencing this: > >

Re: PDF extraction using Tika

2020-08-25 Thread Joe Doupnik
    More properly,it would be best to fix Tika and thus not push extra complexity upon many many users. Error handling is one thing, crashes though ought to be designed out.     Thanks,     Joe D. On 25/08/2020 10:54, Charlie Hull wrote: On 25/08/2020 06:04, Srinivas Kashyap wrote: Hi

Re: PDF extraction using Tika

2020-08-25 Thread Charlie Hull
On 25/08/2020 06:04, Srinivas Kashyap wrote: Hi Alexandre, Yes, these are the same PDF files running in windows and linux. There are around 30 pdf files and I tried indexing single file, but faced same error. Is it related to how PDF stored in linux? Did you try running Tika (the same version

Creating a phrase match feature in LTR

2020-08-25 Thread krishan goyal
Hi, I am trying to create a phrase match feature (what "pf" does in dismax/edismax parsers) I've tried various ways to set it up { "name": "phraseMatch", "class": "org.apache.solr.ltr.feature.SolrFeature", "params": { "q": "{!complexphrase inOrder=true}query(fieldName:${input})" },