Re: PDF extraction using Tika

2020-08-26 Thread Walter Underwood
When I worked for a search engine vendor, we did exactly the same thing. We always ran the document crackers in a different process because they tended to hang, crash, run forever, or use all of memory. Adobe PDFlib was not an exception to that rule. wunder Walter Underwood Ultraseek Server

Re: Slow commit in Solr 6.1.0

2020-08-26 Thread Erick Erickson
There are a bunch of variables. If there are too many merge threads going on, for instance, then the commit will block until one of the merge threads finishes. It could well be that the one you identify as “slow” is coincidentally after the hard commit, which are could accumulate for 10 minutes

RE: [EXT] Re: PDF extraction using Tika

2020-08-26 Thread Hanjan, Harinderdeep S.
I found it better to offload PDF parsing and text extraction to a standalone Tika Server instead. This way, if a PDF crashes the Tika Server, it will not take down the JVM where your code is running. You could easily have multiple instances of Tika Server running (perhaps on another machine)

Re: Slow commit in Solr 6.1.0

2020-08-26 Thread vishal patel
Thanks for your quick reply. Commit is not called from client side. We do not use any cache. Here is my solrconfig.xml : https://drive.google.com/file/d/1LwA1d4OiMhQQv806tR0HbZoEjA8IyfdR/view We give set SOLR_OPTS=%SOLR_OPTS% -Dsolr.autoSoftCommit.maxTime=100 because we want quick view after

Solr Server Context Filtering in Auto Suggester not working

2020-08-26 Thread Viktor Bonchevski
Hello , I opened a bug issue not knowing it's not the correct place to ask the question at hand, So I was directed to send an e-mail to the mailing list, hopefully I'm correct this time. Here's a link to the issue opened:

Re: PDF extraction using Tika

2020-08-26 Thread Jan Høydahl
When I worked for a search engine vendor in my previous life, the PDF parsing pipeline looked something like this Try parsing the PDF file with tool X If failure or timeout, try instead with tool Y If failure or timeout, try instead with tool Z In this case X would be the preferred parser, but

Re: Apache Solr 8.6.0 with SSL

2020-08-26 Thread Patrik Peng
Followup regarding the bin/solr issue for anyone running Solr on FreeBSD. The script uses "ps auxww | grep ..." in various places, like: SOLR_PROC=`ps auxww |grep -w $SOLR_PID|grep start\.jar |grep jetty\.port` For reasons unknown to me, FreeBSD's "ps auxww" truncates the COMMAND column output

Re: Slow commit in Solr 6.1.0

2020-08-26 Thread Erick Erickson
It depends on how the commit is called. You have openSearcher=true, which means the call won’t return until all your autowarming is done. This _looks_ like it might be a commit called from a client, which you should not do. It’s also suspicious that these are soft commits 1 second apart. The

Re: About solr.HyphenatedWordsFilter

2020-08-26 Thread Erick Erickson
Another option is to suggest from a copyField with a very simple analysis chain. Say: PatternReplaceCharFilterFactory - to remove everything you don’t want to keep. WhitespaceTokenizerFactory LowercaseFilterFactory - maybe And I think you miss Shawn’s point about the exclamation point. If you

Re: About solr.HyphenatedWordsFilter

2020-08-26 Thread Kayak28
Hello, Shawn Thank you for your response. Yes. I am sure that I need to preserve "-" in the words. What I want to do is not actually search, it is for a suggestion. "abc-efg" is a dummy sample of our product ID. So, there are several product IDs. such as abc-efg, abc-hij, abc-klm and so on. When

Re: Real time index data

2020-08-26 Thread Jörn Franke
Maybe to add to this . Additionally try to batch the requests from the queue - don’t do it one by one , but take n items at the same time. Look on the Solr side also on the configuration of soft commits vs hard commits . Soft commits are relevant for definition how real time this is and can be.

Re: Real time index data

2020-08-26 Thread Jörn Franke
You do not provide many details, but a queuing mechanism seems to be appropriate for this use case. > Am 26.08.2020 um 11:30 schrieb Tushar Arora : > > Hi, > > One of our use cases requires real time indexing of data in solr from DB. > Approximately, 30 rows are updated in a second in DB. And

Real time index data

2020-08-26 Thread Tushar Arora
Hi, One of our use cases requires real time indexing of data in solr from DB. Approximately, 30 rows are updated in a second in DB. And I also want these to be updated in the index simultaneously. Is the Queuing mechanism like Rabbitmq helpful in my case? Please suggest the ways to achieve it.

Slow commit in Solr 6.1.0

2020-08-26 Thread vishal patel
I am using solr 6.1.0. We have 2 shards and each has one replica. When I checked shard1 log, I found that commit process was going to slow for some collection. Slow commit: 2020-08-25 09:08:10.328 INFO (commitScheduler-124-thread-1) [c:forms s:shard1 r:core_node1 x:forms]

Re: PDF extraction using Tika

2020-08-26 Thread Charlie Hull
Hi Joe, Tika is pretty amazing at coping with the things people throw at it and I know the team behind it have added a very extensive testing framework. However, the reality is that malformed, huge or just plain crazy documents may cause crashes - PDFs are mad, you can even embed Javascript

Re: Issues deploying LTR into SolrCloud

2020-08-26 Thread Dmitry Kan
Hello, Just noticed my numbering is off, should be: 1. Deploy a feature store from a JSON file to each collection. 2. Reload all collections as advised in the documentation: https://lucene.apache.org/solr/guide/7_5/learning-to-rank.html#applying-changes 3. Deploy the related model from a JSON

Re: Solr with HDFS configuration example running in production/dev

2020-08-26 Thread Prashant Jyoti
Hi Joe, Yes I had made these changes for getting HDFS to work with Solr. Below are config changes which I carried out: Changes in solr.in.cmd set SOLR_OPTS=%SOLR_OPTS% -Dsolr.directoryFactory=HdfsDirectoryFactory set

Re: About solr.HyphenatedWordsFilter

2020-08-26 Thread Shawn Heisey
On 8/26/2020 12:05 AM, Kayak28 wrote: I would like to tokenize the following sentence. I do want to tokens that remain hyphens. So, for example, original text: This is a new abc-edg and xyz-abc is coming soon! desired output tokens: this/is/a/new/abc-edg/and/xyz-abc/is/coming/soon/! Is there

About solr.HyphenatedWordsFilter

2020-08-26 Thread Kayak28
Hello, Solr community: I would like to tokenize the following sentence. I do want to tokens that remain hyphens. So, for example, original text: This is a new abc-edg and xyz-abc is coming soon! desired output tokens: this/is/a/new/abc-edg/and/xyz-abc/is/coming/soon/! Is there any way that I do