I try to adjust the new generation size so that it can handle all the allocations needed for HTTP requests. Those short-lived objects should never come from tenured space.
Even without facets, I run a pretty big new generation, 2 GB in an 8 GB heap. The tenured space will always grow in Solr, because objects ejected from cache have been around a while. Caches create garbage in tenured space. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jun 17, 2016, at 10:01 AM, Jeff Wartes <jwar...@whitepages.com> wrote: > > For what it’s worth, I looked into reducing the allocation footprint of > CollapsingQParserPlugin a bit, but without success. See > https://issues.apache.org/jira/browse/SOLR-9125 > > As it happened, I was collapsing on a field with such high cardinality that > the chances of a query even doing much collapsing of interest was pretty low. > That allowed me to use a vastly stripped-down version of > CollapsingQParserPlugin with a *much* lower memory footprint, in exchange for > collapsed document heads essentially being picked at random. (That is, when > collapsing two documents, the one that gets returned is random.) > > If that’s of interest, I could probably throw the code someplace public. > > > On 6/16/16, 3:39 PM, "Cas Rusnov" <c...@manzama.com> wrote: > >> Hey thanks for your reply. >> >> Looks like running the suggested CMS config from Shawn, we're getting some >> nodes with 30+sec pauses, I gather due to large heap, interestingly enough >> while the scenario Jeff talked about is remarkably similar (we use field >> collapsing), including the performance aspects of it, we are getting >> concurrent mode failures both due to new space allocation failures and due >> to promotion failures. I suspect there's a lot of garbage building up. >> We're going to run tests with field collapsing disabled and see if that >> makes a difference. >> >> Cas >> >> >> On Thu, Jun 16, 2016 at 1:08 PM, Jeff Wartes <jwar...@whitepages.com> wrote: >> >>> Check your gc log for CMS “concurrent mode failure” messages. >>> >>> If a concurrent CMS collection fails, it does a stop-the-world pause while >>> it cleans up using a *single thread*. This means the stop-the-world CMS >>> collection in the failure case is typically several times slower than a >>> concurrent CMS collection. The single-thread business means it will also be >>> several times slower than the Parallel collector, which is probably what >>> you’re seeing. I understand that it needs to stop the world in this case, >>> but I really wish the CMS failure would fall back to a Parallel collector >>> run instead. >>> The Parallel collector is always going to be the fastest at getting rid of >>> garbage, but only because it stops all the application threads while it >>> runs, so it’s got less complexity to deal with. That said, it’s probably >>> not going to be orders of magnitude faster than a (successfully) concurrent >>> CMS collection. >>> >>> Regardless, the bigger the heap, the bigger the pause. >>> >>> If your application is generating a lot of garbage, or can generate a lot >>> of garbage very suddenly, CMS concurrent mode failures are more likely. You >>> can turn down the -XX:CMSInitiatingOccupancyFraction value in order to >>> give the CMS collection more of a head start at the cost of more frequent >>> collections. If that doesn’t work, you can try using a bigger heap, but you >>> may eventually find yourself trying to figure out what about your query >>> load generates so much garbage (or causes garbage spikes) and trying to >>> address that. Even G1 won’t protect you from highly unpredictable garbage >>> generation rates. >>> >>> In my case, for example, I found that a very small subset of my queries >>> were using the CollapseQParserPlugin, which requires quite a lot of memory >>> allocations, especially on a large index. Although generally this was fine, >>> if I got several of these rare queries in a very short window, it would >>> always spike enough garbage to cause CMS concurrent mode failures. The >>> single-threaded concurrent-mode failure would then take long enough that >>> the ZK heartbeat would fail, and things would just go downhill from there. >>> >>> >>> >>> On 6/15/16, 3:57 PM, "Cas Rusnov" <c...@manzama.com> wrote: >>> >>>> Hey Shawn! Thanks for replying. >>>> >>>> Yes I meant HugePages not HugeTable, brain fart. I will give the >>>> transparent off option a go. >>>> >>>> I have attempted to use your CMS configs as is and also the default >>>> settings and the cluster dies under our load (basically a node will get a >>>> 35-60s GC STW and then the others in the shard will take the load, and >>> they >>>> will in turn get long STWs until the shard dies), which is why basically >>> in >>>> a fit of desperation I tried out ParallelGC and found it to be half-way >>>> acceptable. I will run a test using your configs (and the defaults) again >>>> just to be sure (since I'm certain the machine config has changed since we >>>> used your unaltered settings). >>>> >>>> Thanks! >>>> Cas >>>> >>>> >>>> On Wed, Jun 15, 2016 at 3:41 PM, Shawn Heisey <apa...@elyograg.org> >>> wrote: >>>> >>>>> On 6/15/2016 3:05 PM, Cas Rusnov wrote: >>>>>> After trying many of the off the shelf configurations (including CMS >>>>>> configurations but excluding G1GC, which we're still taking the >>>>>> warnings about seriously), numerous tweaks, rumors, various instance >>>>>> sizes, and all the rest, most of which regardless of heap size and >>>>>> newspace size resulted in frequent 30+ second STW GCs, we settled on >>>>>> the following configuration which leads to occasional high GCs but >>>>>> mostly stays between 10-20 second STWs every few minutes (which is >>>>>> almost acceptable): -XX:+AggressiveOpts -XX:+UnlockDiagnosticVMOptions >>>>>> -XX:+UseAdaptiveSizePolicy -XX:+UseLargePages -XX:+UseParallelGC >>>>>> -XX:+UseParallelOldGC -XX:MaxGCPauseMillis=15000 -XX:MaxNewSize=12000m >>>>>> -XX:ParGCCardsPerStrideChunk=4096 -XX:ParallelGCThreads=16 -Xms31000m >>>>>> -Xmx31000m >>>>> >>>>> You mentioned something called "HugeTable" ... I assume you're talking >>>>> about huge pages. If that's what you're talking about, have you also >>>>> turned off transparent huge pages? If you haven't, you might want to >>>>> completely disable huge pages in your OS. There's evidence that the >>>>> transparent option can affect performance. >>>>> >>>>> I assume you've probably looked at my GC info at the following URL: >>>>> >>>>> http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning_for_Solr >>>>> >>>>> The parallel collector is most definitely not a good choice. It does >>>>> not optimize for latency. It's my understanding that it actually >>>>> prefers full GCs, because it is optimized for throughput. Solr thrives >>>>> on good latency, throughput doesn't matter very much. >>>>> >>>>> If you want to continue avoiding G1, you should definitely be using >>>>> CMS. My recommendation right now would be to try the G1 settings on my >>>>> wiki page under the heading "Current experiments" or the CMS settings >>>>> just below that. >>>>> >>>>> The out-of-the-box GC tuning included with Solr 6 is probably a better >>>>> option than the parallel collector you've got configured now. >>>>> >>>>> Thanks, >>>>> Shawn >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> Cas Rusnov, >>>> >>>> Engineer >>>> [image: Manzama Logo] <http://www.manzama.com> >>>> >>>> Visit our Resource Center <http://www.manzama.com/resource-center/>. >>>> >>>> US & Canada Office: +1 (541) 306-3271 <+15413063271> | UK Office: +44 >>>> (0)203 282 1633 <+4402032821633> | AUS Office: +61 02 9326 6264 >>>> <+610293266264> >>>> >>>> LinkedIn <http://www.linkedin.com/company/manzama>| Twitter >>>> <https://twitter.com/ManzamaInc>| Facebook >>>> <http://www.facebook.com/manzamainc>| Google + >>>> <https://plus.google.com/u/0/b/116326385357563344293/+ManzamaInc/about>| >>>> YouTube <https://www.youtube.com/channel/UCgbgt-xWBTxrbQESTVeMMHw>| >>>> Pinterest <https://www.pinterest.com/manzama1754/> >>> >>> >> >> >> -- >> >> Cas Rusnov, >> >> Engineer >> [image: Manzama Logo] <http://www.manzama.com> >> >> Visit our Resource Center <http://www.manzama.com/resource-center/>. >> >> US & Canada Office: +1 (541) 306-3271 <+15413063271> | UK Office: +44 >> (0)203 282 1633 <+4402032821633> | AUS Office: +61 02 9326 6264 >> <+610293266264> >> >> LinkedIn <http://www.linkedin.com/company/manzama>| Twitter >> <https://twitter.com/ManzamaInc>| Facebook >> <http://www.facebook.com/manzamainc>| Google + >> <https://plus.google.com/u/0/b/116326385357563344293/+ManzamaInc/about>| >> YouTube <https://www.youtube.com/channel/UCgbgt-xWBTxrbQESTVeMMHw>| >> Pinterest <https://www.pinterest.com/manzama1754/> >