Hello again,

Back to this topic, upgrade to 7.4 didn't mysteriously fix the leak our main 
text search collection has as i had so vigorously hoped. Again, it are 
SortedIntDocSet instances that leak consistently on the 15 minute index/commit 
interval.

Some facts:
* problem started after upgrading from 7.2.1 to 7.3.0;
* it occurs only in our main text search collection, all other collections are 
unaffected;
* despite what i said earlier, it is so far unreproducible outside production, 
even when mimicking production as good as we can;
* SortedIntDocSet instances ánd ConcurrentLRUCache$CacheEntry instances are 
both leaked on commit;
* filterCache is enabled using FastLRUCache;
* filter queries are simple field:value using strings, and three filter query 
for time range using [NOW/DAY TO NOW+1DAY/DAY] syntax for 'today', 'last week' 
and 'last month', but rarely used;
* reloading the core manually frees OldGen;
* custom URP's don't cause the problem, disabling them doesn't solve it;
* the collection uses custom extensions for QueryComponent and 
QueryElevationComponent, ExtendedDismaxQParser and MoreLikeThisQParser, a whole 
bunch of TokenFilters, and several DocTransformers and due it being only 
reproducible on production, i really cannot switch these back to Solr/Lucene 
versions;
* useFilterForSortedQuery is/was not defined in schema so it was default 
(true?), SOLR-11769 could be the culprit, i disabled it just now only for the 
node running 7.4.0, rest of collection runs 7.2.1;

The 7.4.0 node with useFilterForSortedQuery=false now seems to be running fine 
for the last three commits. While typing this i may just have been lucky after 
so many hours/days of tediousness. To confirm i will run 7.4.0 on a second node 
in the cluster, but with different values for useFilterForSortedQuery...

I am unlucky after all :( so i'll revert to 7.2.1 again (but why did it 'seem' 
to run fine for three commits?). But we need it fixed and it is clear whatever 
i do, i am not one damn step closer to solving this. So what next? I need the 
list's help to find the leak.

So please, thanks,
Markus
 
-----Original message-----
> From:Shalin Shekhar Mangar <shalinman...@gmail.com>
> Sent: Friday 27th April 2018 12:11
> To: solr-user@lucene.apache.org
> Subject: Re: 7.3 appears to leak
> 
> Hi Markus,
> 
> Can you give an idea of what your filter queries look like? Any custom
> plugins or things we should be aware of? Simple indexing artificial docs,
> querying and committing doesn't seem to reproduce the issue for me.
> 
> On Thu, Apr 26, 2018 at 10:13 PM, Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> 
> > Hello,
> >
> > We just finished upgrading our three separate clusters from 7.2.1 to 7.3,
> > which went fine, except for our main text search collection, it appears to
> > leak memory on commit!
> >
> > After initial upgrade we saw the cluster slowly starting to run out of
> > memory within about an hour and a half. We increased heap in case 7.3 just
> > requires more of it, but the heap consumption graph is still growing on
> > each commit. Heap space cannot be reclaimed by forcing the garbage
> > collector to run, everything just piles up in the OldGen. Running with this
> > slightly larger heap, the first nodes will run out of memory in about two
> > and a half hours after cluster restart.
> >
> > The heap eating cluster is a 2shard/3replica system on separate nodes.
> > Each replica is about 50 GB in size and about 8.5 million documents. On
> > 7.2.1 it ran fine with just a 2 GB heap. With 7.3 and 2.5 GB heap, it will
> > take just a little longer for it to run out of memory.
> >
> > I inspected reports shown by the sampler of VisualVM and spotted one
> > peculiarity, the number of instances of SortedIntDocSet kept growing on
> > each commit by about the same amount as the number of cached filter
> > queries. But this doesn't happen on the logs cluster, SortedIntDocSet
> > instances are neatly collected there. The number of instances also accounts
> > for the number of commits since start up times the cache sizes
> >
> > Our other two clusters don't have this problem, one of them receives very
> > few commits per day, but the other receives data all the time, it logs user
> > interactions so a large amount of data is coming in all the time. I cannot
> > reproduce it locally by indexing data and committing all the time, the peak
> > usage in OldGen stays about the same. But, i can reproduce it locally when
> > i introduce queries, and filter queries while indexing pieces of data and
> > committing it.
> >
> > So, what is the problem? I dug in the CHANGES.txt of both Lucene and Solr,
> > but nothing really caught my attention. Does anyone here have an idea where
> > to look?
> >
> > Many thanks,
> > Markus
> >
> 
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.
> 

Reply via email to