Re: What is the bottleneck for an optimise operation? / solve the disk space and time issues by specifying multiple segments to optimize

2017-03-05 Thread Caruana, Matthew
Hi Rick, We already do this with 30 eight-core machines running seven jobs each, working off a shared queue. See https://github.com/ICIJ/extract which has been in production for almost two years. Originally developed in order to OCR almost ten million PDFs and TIFFs from the Panama Papers.

Re: What is the bottleneck for an optimise operation? / solve the disk space and time issues by specifying multiple segments to optimize

2017-03-05 Thread Rick Leir
Hi Matthew OCR is something which can be parallelized outside of Solr/Tika. Do one OCR task per core, and you can have all cores running at 100%. Write the OCR output to a staging area in the filesystem. cheers -- Rick On 2017-03-03 03:00 AM, Caruana, Matthew wrote: This is the current

Re: What is the bottleneck for an optimise operation?

2017-03-03 Thread Erick Erickson
Well, historically during the really old days, optimize made a major difference. As Lucene evolved that difference was smaller, and in recent _anecdotal_ reports it's up to 10% improvement in query processing with the usual caveats that there are a ton of variables here, including and especially

Re: What is the bottleneck for an optimise operation?

2017-03-03 Thread Caruana, Matthew
We index rarely and in bulk as we’re an organisation that deals in enabling access to leaked documents for journalists. The indexes are mostly static for 99% of the year. We only optimise after reindexing due to schema changes or when we have a new leak. Our workflow is to index on a staging

Re: What is the bottleneck for an optimise operation?

2017-03-03 Thread Erick Erickson
Matthew: What load testing have you done on optimized .vs. unoptimized indexes? Is there enough of a performance gain to be worth the trouble? Toke's indexes are pretty static, and in his situation it's worth the effort. Before spending a lot of cycles on making optimization work/understanding

Re: What is the bottleneck for an optimise operation?

2017-03-03 Thread Caruana, Matthew
Thank you, you’re right - only one of the four cores is hitting 100%. This is the correct answer. The bottleneck is CPU exacerbated by an absence of parallelisation. > On 3 Mar 2017, at 12:32, Toke Eskildsen wrote: > > On Thu, 2017-03-02 at 15:39 +, Caruana, Matthew wrote: >>

Re: What is the bottleneck for an optimise operation?

2017-03-03 Thread Toke Eskildsen
On Thu, 2017-03-02 at 15:39 +, Caruana, Matthew wrote: > Thank you. The question remains however, if this is such a hefty > operation then why is it walking to the destination instead of > running, so to speak? We only do optimize on an old Solr 4.10 setup, but for that we have plenty of

Re: What is the bottleneck for an optimise operation? / solve the disk space and time issues by specifying multiple segments to optimize

2017-03-03 Thread Caruana, Matthew
This is the current config: 100 1 10 10 We index in bulk, so after indexing about 4 million documents over a week (OCR takes long) we normally

Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread Shawn Heisey
On 3/2/2017 8:04 AM, Caruana, Matthew wrote: > I’m currently performing an optimise operation on a ~190GB index with about 4 > million documents. The process has been running for hours. > > This is surprising, because the machine is an EC2 r4.xlarge with four cores > and 30GB of RAM, 24GB of

Re: What is the bottleneck for an optimise operation? / solve the disk space and time issues by specifying multiple segments to optimize

2017-03-02 Thread Alexandre Rafalovitch
What do you have for merge configuration in solrconfig.xml? You should be able to tune it to - approximately - whatever you want without doing the grand optimize: https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig#IndexConfiginSolrConfig-MergingIndexSegments Regards,

Re: What is the bottleneck for an optimise operation? / solve the disk space and time issues by specifying multiple segments to optimize

2017-03-02 Thread Caruana, Matthew
Yes, we already do it outside Solr. See https://github.com/ICIJ/extract which we developed for this purpose. My guess is that the documents are very large, as you say. Optimising was always an attempt to bring down the number of segments from 60+. Not sure how else to do that. > On 2 Mar

Re: What is the bottleneck for an optimise operation? / solve the disk space and time issues by specifying multiple segments to optimize

2017-03-02 Thread Caruana, Matthew
I typically end up with about 60-70 segments after indexing. What configuration do you use to bring it down to 16? > On 2 Mar 2017, at 7:42 pm, Michael Joyner wrote: > > You can solve the disk space and time issues by specifying multiple segments > to optimize down to

Re: What is the bottleneck for an optimise operation? / solve the disk space and time issues by specifying multiple segments to optimize

2017-03-02 Thread Michael Joyner
You can solve the disk space and time issues by specifying multiple segments to optimize down to instead of a single segment. When we reindex we have to optimize or we end up with hundreds of segments and very horrible performance. We optimize down to like 16 segments or so and it doesn't do

Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread Erick Erickson
It's _very_ unlikely that optimize will help with OOMs, so that's very probably a red herring. Likely the document that's causing the issue is very large or, perhaps, you're using the extracting processor and it might be a Tika issue, consider doing the Tika processing outside Solr if so, see:

Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread Caruana, Matthew
Thank you, these are useful tips. We were previously working with a 4GB heap and getting OOMs in Solr while updating (probably from the analysers) that would cause the index writer to close with what’s called a “tragic” error in the writer code. Only a hard restart of the service could bring

Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread Walter Underwood
6.4.0 added a lot of metrics to low-level calls. That makes many operations slow. Go back to 6.3.0 or wait for 6.4.2. Meanwhile, stop running optimize. You almost certainly don’t need it. 24 GB is a huge heap. Do you really need that? We run a 15 million doc index with an 8 GB heap (Java

Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread Otis Gospodnetić
Hi, It's simply expensive. You are rewriting your whole index. Why are you running optimize? Are you seeing performance problems you are trying to fix with optimize? Otis -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training -

Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread Caruana, Matthew
Thank you. The question remains however, if this is such a hefty operation then why is it walking to the destination instead of running, so to speak? Is the process throttled in some way? > On 2 Mar 2017, at 16:20, David Hastings wrote: > > Agreed, and since it

Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread Otis Gospodnetić
Hi Matthew, I'm guessing it's the EBS. With EBS we've seen: * cpu.system going up in some kernels * low read/write speeds and maxed out IO at times Otis -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ On

Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread David Hastings
Agreed, and since it takes three times the space is part of the reason it takes so long, so that 190gb index ends up writing another 380 gb until it compresses down and deletes the two left over files. its a pretty hefty operation On Thu, Mar 2, 2017 at 10:13 AM, Alexandre Rafalovitch

Re: What is the bottleneck for an optimise operation?

2017-03-02 Thread Alexandre Rafalovitch
Optimize operation is no longer recommended for Solr, as the background merges got a lot smarter. It is an extremely expensive operation that can require up to 3-times amount of disk during the processing. This is not to say yours is a valid question, which I am leaving to others to respond.

What is the bottleneck for an optimise operation?

2017-03-02 Thread Caruana, Matthew
I’m currently performing an optimise operation on a ~190GB index with about 4 million documents. The process has been running for hours. This is surprising, because the machine is an EC2 r4.xlarge with four cores and 30GB of RAM, 24GB of which is allocated to the JVM. The load average has been