When I worked for a search engine vendor, we did exactly the same thing.
We always ran the document crackers in a different process because they tended
to hang, crash, run forever, or use all of memory. Adobe PDFlib was not an
exception to that rule.
wunder
Walter Underwood
Ultraseek Server
There are a bunch of variables. If there are too many merge threads going on,
for instance, then
the commit will block until one of the merge threads finishes. It could
well be that the one you identify as “slow” is coincidentally after the hard
commit, which are
could accumulate for 10 minutes
I found it better to offload PDF parsing and text extraction to a standalone
Tika Server instead. This way, if a PDF crashes the Tika Server, it will not
take down the JVM where your code is running.
You could easily have multiple instances of Tika Server running (perhaps on
another machine)
Thanks for your quick reply.
Commit is not called from client side.
We do not use any cache. Here is my solrconfig.xml :
https://drive.google.com/file/d/1LwA1d4OiMhQQv806tR0HbZoEjA8IyfdR/view
We give set SOLR_OPTS=%SOLR_OPTS% -Dsolr.autoSoftCommit.maxTime=100 because we
want quick view after
Hello ,
I opened a bug issue not knowing it's not the correct place to ask the
question at hand,
So I was directed to send an e-mail to the mailing list, hopefully I'm
correct this time.
Here's a link to the issue opened:
When I worked for a search engine vendor in my previous life, the PDF parsing
pipeline looked something like this
Try parsing the PDF file with tool X
If failure or timeout, try instead with tool Y
If failure or timeout, try instead with tool Z
In this case X would be the preferred parser, but
Followup regarding the bin/solr issue for anyone running Solr on FreeBSD.
The script uses "ps auxww | grep ..." in various places, like:
SOLR_PROC=`ps auxww |grep -w $SOLR_PID|grep start\.jar |grep jetty\.port`
For reasons unknown to me, FreeBSD's "ps auxww" truncates the COMMAND
column output
It depends on how the commit is called. You have openSearcher=true, which means
the call
won’t return until all your autowarming is done. This _looks_ like it might be
a commit
called from a client, which you should not do.
It’s also suspicious that these are soft commits 1 second apart. The
Another option is to suggest from a copyField with a very simple analysis
chain. Say:
PatternReplaceCharFilterFactory - to remove everything you don’t want to keep.
WhitespaceTokenizerFactory
LowercaseFilterFactory - maybe
And I think you miss Shawn’s point about the exclamation point. If you
Hello, Shawn
Thank you for your response.
Yes. I am sure that I need to preserve "-" in the words.
What I want to do is not actually search, it is for a suggestion.
"abc-efg" is a dummy sample of our product ID.
So, there are several product IDs. such as abc-efg, abc-hij, abc-klm and so
on.
When
Maybe to add to this . Additionally try to batch the requests from the queue -
don’t do it one by one , but take n items at the same time.
Look on the Solr side also on the configuration of soft commits vs hard commits
. Soft commits are relevant for definition how real time this is and can be.
You do not provide many details, but a queuing mechanism seems to be
appropriate for this use case.
> Am 26.08.2020 um 11:30 schrieb Tushar Arora :
>
> Hi,
>
> One of our use cases requires real time indexing of data in solr from DB.
> Approximately, 30 rows are updated in a second in DB. And
Hi,
One of our use cases requires real time indexing of data in solr from DB.
Approximately, 30 rows are updated in a second in DB. And I also want these
to be updated in the index simultaneously.
Is the Queuing mechanism like Rabbitmq helpful in my case?
Please suggest the ways to achieve it.
I am using solr 6.1.0. We have 2 shards and each has one replica.
When I checked shard1 log, I found that commit process was going to slow for
some collection.
Slow commit:
2020-08-25 09:08:10.328 INFO (commitScheduler-124-thread-1) [c:forms s:shard1
r:core_node1 x:forms]
Hi Joe,
Tika is pretty amazing at coping with the things people throw at it and
I know the team behind it have added a very extensive testing framework.
However, the reality is that malformed, huge or just plain crazy
documents may cause crashes - PDFs are mad, you can even embed
Javascript
Hello,
Just noticed my numbering is off, should be:
1. Deploy a feature store from a JSON file to each collection.
2. Reload all collections as advised in the documentation:
https://lucene.apache.org/solr/guide/7_5/learning-to-rank.html#applying-changes
3. Deploy the related model from a JSON
Hi Joe,
Yes I had made these changes for getting HDFS to work with Solr. Below are
config changes which I carried out:
Changes in solr.in.cmd
set SOLR_OPTS=%SOLR_OPTS% -Dsolr.directoryFactory=HdfsDirectoryFactory
set
On 8/26/2020 12:05 AM, Kayak28 wrote:
I would like to tokenize the following sentence. I do want to tokens
that remain hyphens. So, for example, original text: This is a new
abc-edg and xyz-abc is coming soon! desired output tokens:
this/is/a/new/abc-edg/and/xyz-abc/is/coming/soon/! Is there
Hello, Solr community:
I would like to tokenize the following sentence.
I do want to tokens that remain hyphens.
So, for example,
original text: This is a new abc-edg and xyz-abc is coming soon!
desired output tokens: this/is/a/new/abc-edg/and/xyz-abc/is/coming/soon/!
Is there any way that I do
19 matches
Mail list logo