Hi Rick,
We already do this with 30 eight-core machines running seven jobs each, working
off a shared queue. See https://github.com/ICIJ/extract which has been in
production for almost two years. Originally developed in order to OCR almost
ten million PDFs and TIFFs from the Panama Papers.
Hi Matthew
OCR is something which can be parallelized outside of Solr/Tika. Do one
OCR task per core, and you can have all cores running at 100%. Write the
OCR output to a staging area in the filesystem.
cheers -- Rick
On 2017-03-03 03:00 AM, Caruana, Matthew wrote:
This is the current
Well, historically during the really old days, optimize made a major
difference. As Lucene evolved that difference was smaller, and in
recent _anecdotal_ reports it's up to 10% improvement in query
processing with the usual caveats that there are a ton of variables
here, including and especially
We index rarely and in bulk as we’re an organisation that deals in enabling
access to leaked documents for journalists.
The indexes are mostly static for 99% of the year. We only optimise after
reindexing due to schema changes or when
we have a new leak.
Our workflow is to index on a staging
Matthew:
What load testing have you done on optimized .vs. unoptimized indexes?
Is there enough of a performance gain to be worth the trouble? Toke's
indexes are pretty static, and in his situation it's worth the effort.
Before spending a lot of cycles on making optimization
work/understanding
Thank you, you’re right - only one of the four cores is hitting 100%. This is
the correct answer. The bottleneck is CPU exacerbated by an absence of
parallelisation.
> On 3 Mar 2017, at 12:32, Toke Eskildsen wrote:
>
> On Thu, 2017-03-02 at 15:39 +, Caruana, Matthew wrote:
>>
On Thu, 2017-03-02 at 15:39 +, Caruana, Matthew wrote:
> Thank you. The question remains however, if this is such a hefty
> operation then why is it walking to the destination instead of
> running, so to speak?
We only do optimize on an old Solr 4.10 setup, but for that we have
plenty of
This is the current config:
100
1
10
10
We index in bulk, so after indexing about 4 million documents over a week (OCR
takes long) we normally
On 3/2/2017 8:04 AM, Caruana, Matthew wrote:
> I’m currently performing an optimise operation on a ~190GB index with about 4
> million documents. The process has been running for hours.
>
> This is surprising, because the machine is an EC2 r4.xlarge with four cores
> and 30GB of RAM, 24GB of
What do you have for merge configuration in solrconfig.xml? You should
be able to tune it to - approximately - whatever you want without
doing the grand optimize:
https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig#IndexConfiginSolrConfig-MergingIndexSegments
Regards,
Yes, we already do it outside Solr. See https://github.com/ICIJ/extract which
we developed for this purpose. My guess is that the documents are very large,
as you say.
Optimising was always an attempt to bring down the number of segments from 60+.
Not sure how else to do that.
> On 2 Mar
I typically end up with about 60-70 segments after indexing. What configuration
do you use to bring it down to 16?
> On 2 Mar 2017, at 7:42 pm, Michael Joyner wrote:
>
> You can solve the disk space and time issues by specifying multiple segments
> to optimize down to
You can solve the disk space and time issues by specifying multiple
segments to optimize down to instead of a single segment.
When we reindex we have to optimize or we end up with hundreds of
segments and very horrible performance.
We optimize down to like 16 segments or so and it doesn't do
It's _very_ unlikely that optimize will help with OOMs, so that's
very probably a red herring. Likely the document that's causing
the issue is very large or, perhaps, you're using the extracting
processor and it might be a Tika issue, consider doing the Tika
processing outside Solr if so, see:
Thank you, these are useful tips.
We were previously working with a 4GB heap and getting OOMs in Solr while
updating (probably from the analysers) that would cause the index writer to
close with what’s called a “tragic” error in the writer code. Only a hard
restart of the service could bring
6.4.0 added a lot of metrics to low-level calls. That makes many operations
slow. Go back to 6.3.0 or wait for 6.4.2.
Meanwhile, stop running optimize. You almost certainly don’t need it.
24 GB is a huge heap. Do you really need that? We run a 15 million doc index
with an 8 GB heap (Java
Hi,
It's simply expensive. You are rewriting your whole index.
Why are you running optimize? Are you seeing performance problems you are
trying to fix with optimize?
Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training -
Thank you. The question remains however, if this is such a hefty operation then
why is it walking to the destination instead of running, so to speak?
Is the process throttled in some way?
> On 2 Mar 2017, at 16:20, David Hastings wrote:
>
> Agreed, and since it
Hi Matthew,
I'm guessing it's the EBS. With EBS we've seen:
* cpu.system going up in some kernels
* low read/write speeds and maxed out IO at times
Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/
On
Agreed, and since it takes three times the space is part of the reason it
takes so long, so that 190gb index ends up writing another 380 gb until it
compresses down and deletes the two left over files. its a pretty hefty
operation
On Thu, Mar 2, 2017 at 10:13 AM, Alexandre Rafalovitch
Optimize operation is no longer recommended for Solr, as the
background merges got a lot smarter.
It is an extremely expensive operation that can require up to 3-times
amount of disk during the processing.
This is not to say yours is a valid question, which I am leaving to
others to respond.
I’m currently performing an optimise operation on a ~190GB index with about 4
million documents. The process has been running for hours.
This is surprising, because the machine is an EC2 r4.xlarge with four cores and
30GB of RAM, 24GB of which is allocated to the JVM.
The load average has been
22 matches
Mail list logo