Hello again, Back to this topic, upgrade to 7.4 didn't mysteriously fix the leak our main text search collection has as i had so vigorously hoped. Again, it are SortedIntDocSet instances that leak consistently on the 15 minute index/commit interval.
Some facts: * problem started after upgrading from 7.2.1 to 7.3.0; * it occurs only in our main text search collection, all other collections are unaffected; * despite what i said earlier, it is so far unreproducible outside production, even when mimicking production as good as we can; * SortedIntDocSet instances ánd ConcurrentLRUCache$CacheEntry instances are both leaked on commit; * filterCache is enabled using FastLRUCache; * filter queries are simple field:value using strings, and three filter query for time range using [NOW/DAY TO NOW+1DAY/DAY] syntax for 'today', 'last week' and 'last month', but rarely used; * reloading the core manually frees OldGen; * custom URP's don't cause the problem, disabling them doesn't solve it; * the collection uses custom extensions for QueryComponent and QueryElevationComponent, ExtendedDismaxQParser and MoreLikeThisQParser, a whole bunch of TokenFilters, and several DocTransformers and due it being only reproducible on production, i really cannot switch these back to Solr/Lucene versions; * useFilterForSortedQuery is/was not defined in schema so it was default (true?), SOLR-11769 could be the culprit, i disabled it just now only for the node running 7.4.0, rest of collection runs 7.2.1; The 7.4.0 node with useFilterForSortedQuery=false now seems to be running fine for the last three commits. While typing this i may just have been lucky after so many hours/days of tediousness. To confirm i will run 7.4.0 on a second node in the cluster, but with different values for useFilterForSortedQuery... I am unlucky after all :( so i'll revert to 7.2.1 again (but why did it 'seem' to run fine for three commits?). But we need it fixed and it is clear whatever i do, i am not one damn step closer to solving this. So what next? I need the list's help to find the leak. So please, thanks, Markus -----Original message----- > From:Shalin Shekhar Mangar <shalinman...@gmail.com> > Sent: Friday 27th April 2018 12:11 > To: solr-user@lucene.apache.org > Subject: Re: 7.3 appears to leak > > Hi Markus, > > Can you give an idea of what your filter queries look like? Any custom > plugins or things we should be aware of? Simple indexing artificial docs, > querying and committing doesn't seem to reproduce the issue for me. > > On Thu, Apr 26, 2018 at 10:13 PM, Markus Jelsma <markus.jel...@openindex.io> > wrote: > > > Hello, > > > > We just finished upgrading our three separate clusters from 7.2.1 to 7.3, > > which went fine, except for our main text search collection, it appears to > > leak memory on commit! > > > > After initial upgrade we saw the cluster slowly starting to run out of > > memory within about an hour and a half. We increased heap in case 7.3 just > > requires more of it, but the heap consumption graph is still growing on > > each commit. Heap space cannot be reclaimed by forcing the garbage > > collector to run, everything just piles up in the OldGen. Running with this > > slightly larger heap, the first nodes will run out of memory in about two > > and a half hours after cluster restart. > > > > The heap eating cluster is a 2shard/3replica system on separate nodes. > > Each replica is about 50 GB in size and about 8.5 million documents. On > > 7.2.1 it ran fine with just a 2 GB heap. With 7.3 and 2.5 GB heap, it will > > take just a little longer for it to run out of memory. > > > > I inspected reports shown by the sampler of VisualVM and spotted one > > peculiarity, the number of instances of SortedIntDocSet kept growing on > > each commit by about the same amount as the number of cached filter > > queries. But this doesn't happen on the logs cluster, SortedIntDocSet > > instances are neatly collected there. The number of instances also accounts > > for the number of commits since start up times the cache sizes > > > > Our other two clusters don't have this problem, one of them receives very > > few commits per day, but the other receives data all the time, it logs user > > interactions so a large amount of data is coming in all the time. I cannot > > reproduce it locally by indexing data and committing all the time, the peak > > usage in OldGen stays about the same. But, i can reproduce it locally when > > i introduce queries, and filter queries while indexing pieces of data and > > committing it. > > > > So, what is the problem? I dug in the CHANGES.txt of both Lucene and Solr, > > but nothing really caught my attention. Does anyone here have an idea where > > to look? > > > > Many thanks, > > Markus > > > > > > -- > Regards, > Shalin Shekhar Mangar. >