Solr 4.x has new NRT stuff included (uses latest Lucene 3.x, includes per-segment faceting etc.). The Solr 3.x branch doesn't currently..
On Fri, Sep 17, 2010 at 8:06 PM, Andy <angelf...@yahoo.com> wrote: > Does Solr use Lucene NRT? > > --- On Fri, 9/17/10, Erick Erickson <erickerick...@gmail.com> wrote: > >> From: Erick Erickson <erickerick...@gmail.com> >> Subject: Re: Tuning Solr caches with high commit rates (NRT) >> To: solr-user@lucene.apache.org >> Date: Friday, September 17, 2010, 1:05 PM >> Near Real Time... >> >> Erick >> >> On Fri, Sep 17, 2010 at 12:55 PM, Dennis Gearon <gear...@sbcglobal.net>wrote: >> >> > BTW, what is NRT? >> > >> > Dennis Gearon >> > >> > Signature Warning >> > ---------------- >> > EARTH has a Right To Life, >> > otherwise we all die. >> > >> > Read 'Hot, Flat, and Crowded' >> > Laugh at http://www.yert.com/film.php >> > >> > >> > --- On Fri, 9/17/10, Peter Sturge <peter.stu...@gmail.com> >> wrote: >> > >> > > From: Peter Sturge <peter.stu...@gmail.com> >> > > Subject: Re: Tuning Solr caches with high commit >> rates (NRT) >> > > To: solr-user@lucene.apache.org >> > > Date: Friday, September 17, 2010, 2:18 AM >> > > Hi, >> > > >> > > It's great to see such a fantastic response to >> this thread >> > > - NRT is >> > > alive and well! >> > > >> > > I'm hoping to collate this information and add it >> to the >> > > wiki when I >> > > get a few free cycles (thanks Erik for the heads >> up). >> > > >> > > In the meantime, I thought I'd add a few tidbits >> of >> > > additional >> > > information that might prove useful: >> > > >> > > 1. The first one to note is that the >> techniques/setup >> > > described in >> > > this thread don't fix the underlying potential >> for >> > > OutOfMemory errors >> > > - there can always be an index large enough to >> ask of its >> > > JVM more >> > > memory than is available for cache. >> > > These techniques, however, mitigate the risk, and >> provide >> > > an efficient >> > > balance between memory use and search >> performance. >> > > There are some interesting discussions going on >> for both >> > > Lucene and >> > > Solr regarding the '2 pounds of baloney into a 1 >> pound bag' >> > > issue of >> > > unbounded caches, with a number of interesting >> strategies. >> > > One strategy that I like, but haven't found in >> discussion >> > > lists is >> > > auto-limiting cache size/warming based on >> available >> > > resources (similar >> > > to the way file system caches use free memory). >> This would >> > > allow >> > > caches to adjust to their memory environment as >> indexes >> > > grow. >> > > >> > > 2. A note regarding lockType in solrconfig.xml >> for dual >> > > Solr >> > > instances: It's best not to use 'none' as a value >> for >> > > lockType - this >> > > sets the lockType to null, and as the source >> comments note, >> > > this is a >> > > recipe for disaster, so, use 'simple' instead. >> > > >> > > 3. Chris mentioned setting maxWarmingSearchers to >> 1 as a >> > > way of >> > > minimizing the number of onDeckSearchers. This is >> a prudent >> > > move -- >> > > thanks Chris for bringing this up! >> > > >> > > All the best, >> > > Peter >> > > >> > > >> > > >> > > >> > > On Tue, Sep 14, 2010 at 2:00 PM, Peter Karich >> <peat...@yahoo.de> >> > > wrote: >> > > > Peter Sturge, >> > > > >> > > > this was a nice hint, thanks again! If you >> are here in >> > > Germany anytime I >> > > > can invite you to a beer or an apfelschorle >> ! :-) >> > > > I only needed to change the lockType to none >> in the >> > > solrconfig.xml, >> > > > disable the replication and set the data dir >> to the >> > > master data dir! >> > > > >> > > > Regards, >> > > > Peter Karich. >> > > > >> > > >> Hi Peter, >> > > >> >> > > >> this scenario would be really great for >> us - I >> > > didn't know that this is >> > > >> possible and works, so: thanks! >> > > >> At the moment we are doing similar with >> > > replicating to the readonly >> > > >> instance but >> > > >> the replication is somewhat lengthy and >> > > resource-intensive at this >> > > >> datavolume ;-) >> > > >> >> > > >> Regards, >> > > >> Peter. >> > > >> >> > > >> >> > > >>> 1. You can run multiple Solr >> instances in >> > > separate JVMs, with both >> > > >>> having their solr.xml configured to >> use the >> > > same index folder. >> > > >>> You need to be careful that one and >> only one >> > > of these instances will >> > > >>> ever update the index at a time. The >> best way >> > > to ensure this is to use >> > > >>> one for writing only, >> > > >>> and the other is read-only and never >> writes to >> > > the index. This >> > > >>> read-only instance is the one to use >> for >> > > tuning for high search >> > > >>> performance. Even though the RO >> instance >> > > doesn't write to the index, >> > > >>> it still needs periodic (albeit >> empty) commits >> > > to kick off >> > > >>> autowarming/cache refresh. >> > > >>> >> > > >>> Depending on your needs, you might >> not need to >> > > have 2 separate >> > > >>> instances. We need it because the >> 'write' >> > > instance is also doing a lot >> > > >>> of metadata pre-write operations in >> the same >> > > jvm as Solr, and so has >> > > >>> its own memory requirements. >> > > >>> >> > > >>> 2. We use sharding all the time, and >> it works >> > > just fine with this >> > > >>> scenario, as the RO instance is >> simply another >> > > shard in the pack. >> > > >>> >> > > >>> >> > > >>> On Sun, Sep 12, 2010 at 8:46 PM, >> Peter Karich >> > > <peat...@yahoo.de> >> > > wrote: >> > > >>> >> > > >>> >> > > >>>> Peter, >> > > >>>> >> > > >>>> thanks a lot for your in-depth >> > > explanations! >> > > >>>> Your findings will be definitely >> helpful >> > > for my next performance >> > > >>>> improvement tests :-) >> > > >>>> >> > > >>>> Two questions: >> > > >>>> >> > > >>>> 1. How would I do that: >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>>>> or a local read-only >> instance that >> > > reads the same core as the indexing >> > > >>>>> instance (for the latter, >> you'll need >> > > something that periodically refreshes - i.e. >> runs >> > > commit()). >> > > >>>>> >> > > >>>>> >> > > >>>> 2. Did you try sharding with >> your current >> > > setup (e.g. one big, >> > > >>>> nearly-static index and a tiny >> write+read >> > > index)? >> > > >>>> >> > > >>>> Regards, >> > > >>>> Peter. >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>>>> Hi, >> > > >>>>> >> > > >>>>> Below are some notes >> regarding Solr >> > > cache tuning that should prove >> > > >>>>> useful for anyone who uses >> Solr with >> > > frequent commits (e.g. <5min). >> > > >>>>> >> > > >>>>> Environment: >> > > >>>>> Solr 1.4.1 or branch_3x >> trunk. >> > > >>>>> Note the 4.x trunk has lots >> of neat >> > > new features, so the notes here >> > > >>>>> are likely less relevant to >> the 4.x >> > > environment. >> > > >>>>> >> > > >>>>> Overview: >> > > >>>>> Our Solr environment makes >> extensive >> > > use of faceting, we perform >> > > >>>>> commits every 30secs, and >> the indexes >> > > tend be on the large-ish side >> > > >>>>> (>20million docs). >> > > >>>>> Note: For our data, when we >> commit, we >> > > are always adding new data, >> > > >>>>> never changing existing >> data. >> > > >>>>> This type of environment can >> be tricky >> > > to tune, as Solr is more geared >> > > >>>>> toward fast reads than >> frequent >> > > writes. >> > > >>>>> >> > > >>>>> Symptoms: >> > > >>>>> If anyone has used faceting >> in >> > > searches where you are also performing >> > > >>>>> frequent commits, you've >> likely >> > > encountered the dreaded OutOfMemory or >> > > >>>>> GC Overhead Exeeded errors. >> > > >>>>> In high commit rate >> environments, this >> > > is almost always due to >> > > >>>>> multiple 'onDeck' searchers >> and >> > > autowarming - i.e. new searchers don't >> > > >>>>> finish autowarming their >> caches before >> > > the next commit() >> > > >>>>> comes along and invalidates >> them. >> > > >>>>> Once this starts happening >> on a >> > > regular basis, it is likely your >> > > >>>>> Solr's JVM will run out of >> memory >> > > eventually, as the number of >> > > >>>>> searchers (and their cache >> arrays) >> > > will keep growing until the JVM >> > > >>>>> dies of thirst. >> > > >>>>> To check if your Solr >> environment is >> > > suffering from this, turn on INFO >> > > >>>>> level logging, and look >> for: >> > > 'PERFORMANCE WARNING: Overlapping >> > > >>>>> onDeckSearchers=x'. >> > > >>>>> >> > > >>>>> In tests, we've only ever >> seen this >> > > problem when using faceting, and >> > > >>>>> facet.method=fc. >> > > >>>>> >> > > >>>>> Some solutions to this are: >> > > >>>>> >> Reduce the commit rate to allow >> > > searchers to fully warm before the >> > > >>>>> next commit >> > > >>>>> >> Reduce or eliminate the >> > > autowarming in caches >> > > >>>>> Both >> of the above >> > > >>>>> >> > > >>>>> The trouble is, if you're >> doing NRT >> > > commits, you likely have a good >> > > >>>>> reason for it, and >> > > reducing/elimintating autowarming will very >> > > >>>>> significantly impact search >> > > performance in high commit rate >> > > >>>>> environments. >> > > >>>>> >> > > >>>>> Solution: >> > > >>>>> Here are some setup steps >> we've used >> > > that allow lots of faceting (we >> > > >>>>> typically search with at >> least 20-35 >> > > different facet fields, and date >> > > >>>>> faceting/sorting) on large >> indexes, >> > > and still keep decent search >> > > >>>>> performance: >> > > >>>>> >> > > >>>>> 1. Firstly, you should >> consider using >> > > the enum method for facet >> > > >>>>> searches (facet.method=enum) >> unless >> > > you've got A LOT of memory on your >> > > >>>>> machine. In our tests, this >> method >> > > uses a lot less memory and >> > > >>>>> autowarms more quickly than >> fc. (Note, >> > > I've not tried the new >> > > >>>>> segement-based 'fcs' option, >> as I >> > > can't find support for it in >> > > >>>>> branch_3x - looks nice for >> 4.x >> > > though) >> > > >>>>> Admittedly, for our data, >> enum is not >> > > quite as fast for searching as >> > > >>>>> fc, but short of purchsing >> a >> > > Thaiwanese RAM factory, it's a worthwhile >> > > >>>>> tradeoff. >> > > >>>>> If you do have access to >> LOTS of >> > > memory, AND you can guarantee that >> > > >>>>> the index won't grow beyond >> the memory >> > > capacity (i.e. you have some >> > > >>>>> sort of deletion policy in >> place), fc >> > > can be a lot faster than enum >> > > >>>>> when searching with lots of >> facets >> > > across many terms. >> > > >>>>> >> > > >>>>> 2. Secondly, we've found >> that LRUCache >> > > is faster at autowarming than >> > > >>>>> FastLRUCache - in our tests, >> about 20% >> > > faster. Maybe this is just our >> > > >>>>> environment - your mileage >> may vary. >> > > >>>>> >> > > >>>>> So, our filterCache section >> in >> > > solrconfig.xml looks like this: >> > > >>>>> >> <filterCache >> > > >>>>> >> class="solr.LRUCache" >> > > >>>>> >> size="3600" >> > > >>>>> >> initialSize="1400" >> > > >>>>> >> autowarmCount="3600"/> >> > > >>>>> >> > > >>>>> For a 28GB index, running in >> a >> > > quad-core x64 VMWare instance, 30 >> > > >>>>> warmed facet fields, Solr is >> running >> > > at ~4GB. Stats filterCache size >> > > >>>>> shows usually in the region >> of ~2400. >> > > >>>>> >> > > >>>>> 3. It's also a good idea to >> have some >> > > sort of >> > > >>>>> firstSearcher/newSearcher >> event >> > > listener queries to allow new data to >> > > >>>>> populate the caches. >> > > >>>>> Of course, what you put in >> these is >> > > dependent on the facets you need/use. >> > > >>>>> We've found a good >> combination is a >> > > firstSearcher with as many facets >> > > >>>>> in the search as your >> environment can >> > > handle, then a subset of the >> > > >>>>> most common facets for the >> > > newSearcher. >> > > >>>>> >> > > >>>>> 4. We also set: >> > > >>>>> >> > > >> <useColdSearcher>true</useColdSearcher> >> > > >>>>> just in case. >> > > >>>>> >> > > >>>>> 5. Another key area for >> search >> > > performance with high commits is to use >> > > >>>>> 2 Solr instances - one for >> the high >> > > commit rate indexing, and one for >> > > >>>>> searching. >> > > >>>>> The read-only searching >> instance can >> > > be a remote replica, or a local >> > > >>>>> read-only instance that >> reads the same >> > > core as the indexing instance >> > > >>>>> (for the latter, you'll need >> something >> > > that periodically refreshes - >> > > >>>>> i.e. runs commit()). >> > > >>>>> This way, you can tune the >> indexing >> > > instance for writing performance >> > > >>>>> and the searching instance >> as above >> > > for max read performance. >> > > >>>>> >> > > >>>>> Using the setup above, we >> get >> > > fantastic searching speed for small >> > > >>>>> facet sets (well under >> 1sec), and >> > > really good searching for large >> > > >>>>> facet sets (a couple of secs >> depending >> > > on index size, number of >> > > >>>>> facets, unique terms etc. >> etc.), >> > > >>>>> even when searching against >> largeish >> > > indexes (>20million docs). >> > > >>>>> We have yet to see any OOM >> or GC >> > > errors using the techniques above, >> > > >>>>> even in low memory >> conditions. >> > > >>>>> >> > > >>>>> I hope there are people that >> find this >> > > useful. I know I've spent a lot >> > > >>>>> of time looking for stuff >> like this, >> > > so hopefullly, this will save >> > > >>>>> someone some time. >> > > >>>>> >> > > >>>>> >> > > >>>>> Peter >> > > >>>>> >> > > >>>>> >> > > > >> > > > >> > > >> > >> > > > >