Re: Tuning Solr caches with high commit rates (NRT)

Peter Karich Tue, 14 Sep 2010 00:37:51 -0700

Hi Peter,

this scenario would be really great for us - I didn't know that this is
possible and works, so: thanks!
At the moment we are doing similar with replicating to the readonly
instance but
the replication is somewhat lengthy and resource-intensive at this
datavolume ;-)


Regards,
Peter.

> 1. You can run multiple Solr instances in separate JVMs, with both
> having their solr.xml configured to use the same index folder.
> You need to be careful that one and only one of these instances will
> ever update the index at a time. The best way to ensure this is to use
> one for writing only,
> and the other is read-only and never writes to the index. This
> read-only instance is the one to use for tuning for high search
> performance. Even though the RO instance doesn't write to the index,
> it still needs periodic (albeit empty) commits to kick off
> autowarming/cache refresh.
>
> Depending on your needs, you might not need to have 2 separate
> instances. We need it because the 'write' instance is also doing a lot
> of metadata pre-write operations in the same jvm as Solr, and so has
> its own memory requirements.
>
> 2. We use sharding all the time, and it works just fine with this
> scenario, as the RO instance is simply another shard in the pack.
>
>
> On Sun, Sep 12, 2010 at 8:46 PM, Peter Karich <peat...@yahoo.de> wrote:
>   
>> Peter,
>>
>> thanks a lot for your in-depth explanations!
>> Your findings will be definitely helpful for my next performance
>> improvement tests :-)
>>
>> Two questions:
>>
>> 1. How would I do that:
>>
>>     
>>> or a local read-only instance that reads the same core as the indexing
>>> instance (for the latter, you'll need something that periodically refreshes 
>>> - i.e. runs commit()).
>>>       
>>
>> 2. Did you try sharding with your current setup (e.g. one big,
>> nearly-static index and a tiny write+read index)?
>>
>> Regards,
>> Peter.
>>
>>     
>>> Hi,
>>>
>>> Below are some notes regarding Solr cache tuning that should prove
>>> useful for anyone who uses Solr with frequent commits (e.g. <5min).
>>>
>>> Environment:
>>> Solr 1.4.1 or branch_3x trunk.
>>> Note the 4.x trunk has lots of neat new features, so the notes here
>>> are likely less relevant to the 4.x environment.
>>>
>>> Overview:
>>> Our Solr environment makes extensive use of faceting, we perform
>>> commits every 30secs, and the indexes tend be on the large-ish side
>>> (>20million docs).
>>> Note: For our data, when we commit, we are always adding new data,
>>> never changing existing data.
>>> This type of environment can be tricky to tune, as Solr is more geared
>>> toward fast reads than frequent writes.
>>>
>>> Symptoms:
>>> If anyone has used faceting in searches where you are also performing
>>> frequent commits, you've likely encountered the dreaded OutOfMemory or
>>> GC Overhead Exeeded errors.
>>> In high commit rate environments, this is almost always due to
>>> multiple 'onDeck' searchers and autowarming - i.e. new searchers don't
>>> finish autowarming their caches before the next commit()
>>> comes along and invalidates them.
>>> Once this starts happening on a regular basis, it is likely your
>>> Solr's JVM will run out of memory eventually, as the number of
>>> searchers (and their cache arrays) will keep growing until the JVM
>>> dies of thirst.
>>> To check if your Solr environment is suffering from this, turn on INFO
>>> level logging, and look for: 'PERFORMANCE WARNING: Overlapping
>>> onDeckSearchers=x'.
>>>
>>> In tests, we've only ever seen this problem when using faceting, and
>>> facet.method=fc.
>>>
>>> Some solutions to this are:
>>>     Reduce the commit rate to allow searchers to fully warm before the
>>> next commit
>>>     Reduce or eliminate the autowarming in caches
>>>     Both of the above
>>>
>>> The trouble is, if you're doing NRT commits, you likely have a good
>>> reason for it, and reducing/elimintating autowarming will very
>>> significantly impact search performance in high commit rate
>>> environments.
>>>
>>> Solution:
>>> Here are some setup steps we've used that allow lots of faceting (we
>>> typically search with at least 20-35 different facet fields, and date
>>> faceting/sorting) on large indexes, and still keep decent search
>>> performance:
>>>
>>> 1. Firstly, you should consider using the enum method for facet
>>> searches (facet.method=enum) unless you've got A LOT of memory on your
>>> machine. In our tests, this method uses a lot less memory and
>>> autowarms more quickly than fc. (Note, I've not tried the new
>>> segement-based 'fcs' option, as I can't find support for it in
>>> branch_3x - looks nice for 4.x though)
>>> Admittedly, for our data, enum is not quite as fast for searching as
>>> fc, but short of purchsing a Thaiwanese RAM factory, it's a worthwhile
>>> tradeoff.
>>> If you do have access to LOTS of memory, AND you can guarantee that
>>> the index won't grow beyond the memory capacity (i.e. you have some
>>> sort of deletion policy in place), fc can be a lot faster than enum
>>> when searching with lots of facets across many terms.
>>>
>>> 2. Secondly, we've found that LRUCache is faster at autowarming than
>>> FastLRUCache - in our tests, about 20% faster. Maybe this is just our
>>> environment - your mileage may vary.
>>>
>>> So, our filterCache section in solrconfig.xml looks like this:
>>>     <filterCache
>>>       class="solr.LRUCache"
>>>       size="3600"
>>>       initialSize="1400"
>>>       autowarmCount="3600"/>
>>>
>>> For a 28GB index, running in a quad-core x64 VMWare instance, 30
>>> warmed facet fields, Solr is running at ~4GB. Stats filterCache size
>>> shows usually in the region of ~2400.
>>>
>>> 3. It's also a good idea to have some sort of
>>> firstSearcher/newSearcher event listener queries to allow new data to
>>> populate the caches.
>>> Of course, what you put in these is dependent on the facets you need/use.
>>> We've found a good combination is a firstSearcher with as many facets
>>> in the search as your environment can handle, then a subset of the
>>> most common facets for the newSearcher.
>>>
>>> 4. We also set:
>>>    <useColdSearcher>true</useColdSearcher>
>>> just in case.
>>>
>>> 5. Another key area for search performance with high commits is to use
>>> 2 Solr instances - one for the high commit rate indexing, and one for
>>> searching.
>>> The read-only searching instance can be a remote replica, or a local
>>> read-only instance that reads the same core as the indexing instance
>>> (for the latter, you'll need something that periodically refreshes -
>>> i.e. runs commit()).
>>> This way, you can tune the indexing instance for writing performance
>>> and the searching instance as above for max read performance.
>>>
>>> Using the setup above, we get fantastic searching speed for small
>>> facet sets (well under 1sec), and really good searching for large
>>> facet sets (a couple of secs depending on index size, number of
>>> facets, unique terms etc. etc.),
>>> even when searching against largeish indexes (>20million docs).
>>> We have yet to see any OOM or GC errors using the techniques above,
>>> even in low memory conditions.
>>>
>>> I hope there are people that find this useful. I know I've spent a lot
>>> of time looking for stuff like this, so hopefullly, this will save
>>> someone some time.
>>>
>>>
>>> Peter
>>>

Re: Tuning Solr caches with high commit rates (NRT)

Reply via email to