[jira] [Commented] (SOLR-13790) LRUStatsCache size explosion and ineffective caching
[ https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946440#comment-16946440 ] ASF subversion and git services commented on SOLR-13790: Commit c0a446b179e8091f84e795ab04c6c3fcc9396ebe in lucene-solr's branch refs/heads/jira/SOLR-13821 from Andrzej Bialecki [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c0a446b ] SOLR-13790: LRUStatsCache size explosion and ineffective caching. > LRUStatsCache size explosion and ineffective caching > > > Key: SOLR-13790 > URL: https://issues.apache.org/jira/browse/SOLR-13790 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 7.7.2, 8.2, 8.3 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Critical > Fix For: 7.7.3, 8.3 > > Attachments: SOLR-13790.patch, SOLR-13790.patch, SOLR-13790.patch > > > On a sizeable cluster with multi-shard multi-replica collections, when > {{LRUStatsCache}} was in use we encountered excessive memory usage, which > consequently led to severe performance problems. > On a closer examination of the heapdumps it became apparent that when > {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of > {{FastLRUCache}} using the passed {{shard}} argument - however, the value of > this argument is not a simple shard name but instead it's a randomly ordered > list of ALL replica URLs for this shard. > As a result, due to the combinatoric number of possible keys, over time the > map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries... > The fix seems to be simply to extract the shard name and cache using this > name instead of the full string value of the {{shard}} parameter. Existing > unit tests also need much improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13790) LRUStatsCache size explosion and ineffective caching
[ https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946207#comment-16946207 ] ASF subversion and git services commented on SOLR-13790: Commit 611966ec7b4d200568395ec4d1f1d353453ce9e2 in lucene-solr's branch refs/heads/branch_8x from Andrzej Bialecki [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=611966e ] SOLR-13790: LRUStatsCache size explosion and ineffective caching. > LRUStatsCache size explosion and ineffective caching > > > Key: SOLR-13790 > URL: https://issues.apache.org/jira/browse/SOLR-13790 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 7.7.2, 8.2, 8.3 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Critical > Fix For: 7.7.3, 8.3 > > Attachments: SOLR-13790.patch, SOLR-13790.patch, SOLR-13790.patch > > > On a sizeable cluster with multi-shard multi-replica collections, when > {{LRUStatsCache}} was in use we encountered excessive memory usage, which > consequently led to severe performance problems. > On a closer examination of the heapdumps it became apparent that when > {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of > {{FastLRUCache}} using the passed {{shard}} argument - however, the value of > this argument is not a simple shard name but instead it's a randomly ordered > list of ALL replica URLs for this shard. > As a result, due to the combinatoric number of possible keys, over time the > map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries... > The fix seems to be simply to extract the shard name and cache using this > name instead of the full string value of the {{shard}} parameter. Existing > unit tests also need much improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13790) LRUStatsCache size explosion and ineffective caching
[ https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946081#comment-16946081 ] ASF subversion and git services commented on SOLR-13790: Commit c0a446b179e8091f84e795ab04c6c3fcc9396ebe in lucene-solr's branch refs/heads/master from Andrzej Bialecki [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c0a446b ] SOLR-13790: LRUStatsCache size explosion and ineffective caching. > LRUStatsCache size explosion and ineffective caching > > > Key: SOLR-13790 > URL: https://issues.apache.org/jira/browse/SOLR-13790 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 7.7.2, 8.2, 8.3 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Critical > Fix For: 7.7.3, 8.3 > > Attachments: SOLR-13790.patch, SOLR-13790.patch, SOLR-13790.patch > > > On a sizeable cluster with multi-shard multi-replica collections, when > {{LRUStatsCache}} was in use we encountered excessive memory usage, which > consequently led to severe performance problems. > On a closer examination of the heapdumps it became apparent that when > {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of > {{FastLRUCache}} using the passed {{shard}} argument - however, the value of > this argument is not a simple shard name but instead it's a randomly ordered > list of ALL replica URLs for this shard. > As a result, due to the combinatoric number of possible keys, over time the > map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries... > The fix seems to be simply to extract the shard name and cache using this > name instead of the full string value of the {{shard}} parameter. Existing > unit tests also need much improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13790) LRUStatsCache size explosion and ineffective caching
[ https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16945948#comment-16945948 ] Andrzej Bialecki commented on SOLR-13790: - Final patch: * added stats caching metrics * improved SolrCloud tests > LRUStatsCache size explosion and ineffective caching > > > Key: SOLR-13790 > URL: https://issues.apache.org/jira/browse/SOLR-13790 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 7.7.2, 8.2, 8.3 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Critical > Fix For: 7.7.3, 8.3 > > Attachments: SOLR-13790.patch, SOLR-13790.patch, SOLR-13790.patch > > > On a sizeable cluster with multi-shard multi-replica collections, when > {{LRUStatsCache}} was in use we encountered excessive memory usage, which > consequently led to severe performance problems. > On a closer examination of the heapdumps it became apparent that when > {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of > {{FastLRUCache}} using the passed {{shard}} argument - however, the value of > this argument is not a simple shard name but instead it's a randomly ordered > list of ALL replica URLs for this shard. > As a result, due to the combinatoric number of possible keys, over time the > map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries... > The fix seems to be simply to extract the shard name and cache using this > name instead of the full string value of the {{shard}} parameter. Existing > unit tests also need much improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13790) LRUStatsCache size explosion and ineffective caching
[ https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943149#comment-16943149 ] David Wayne Smiley commented on SOLR-13790: --- Interesting; your proposal makes sense. Thanks [~ab]. > LRUStatsCache size explosion and ineffective caching > > > Key: SOLR-13790 > URL: https://issues.apache.org/jira/browse/SOLR-13790 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 7.7.2, 8.2, 8.3 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Critical > Fix For: 7.7.3, 8.3 > > Attachments: SOLR-13790.patch, SOLR-13790.patch > > > On a sizeable cluster with multi-shard multi-replica collections, when > {{LRUStatsCache}} was in use we encountered excessive memory usage, which > consequently led to severe performance problems. > On a closer examination of the heapdumps it became apparent that when > {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of > {{FastLRUCache}} using the passed {{shard}} argument - however, the value of > this argument is not a simple shard name but instead it's a randomly ordered > list of ALL replica URLs for this shard. > As a result, due to the combinatoric number of possible keys, over time the > map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries... > The fix seems to be simply to extract the shard name and cache using this > name instead of the full string value of the {{shard}} parameter. Existing > unit tests also need much improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13790) LRUStatsCache size explosion and ineffective caching
[ https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942908#comment-16942908 ] Andrzej Bialecki commented on SOLR-13790: - Oh, and until the staleness issue is fixed I would recommend using only {{ExactStatsCache}} - other implementations can only make matters worse, both in terms of memory use and scoring inaccuracies. > LRUStatsCache size explosion and ineffective caching > > > Key: SOLR-13790 > URL: https://issues.apache.org/jira/browse/SOLR-13790 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 7.7.2, 8.2, 8.3 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Critical > Fix For: 7.7.3, 8.3 > > Attachments: SOLR-13790.patch, SOLR-13790.patch > > > On a sizeable cluster with multi-shard multi-replica collections, when > {{LRUStatsCache}} was in use we encountered excessive memory usage, which > consequently led to severe performance problems. > On a closer examination of the heapdumps it became apparent that when > {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of > {{FastLRUCache}} using the passed {{shard}} argument - however, the value of > this argument is not a simple shard name but instead it's a randomly ordered > list of ALL replica URLs for this shard. > As a result, due to the combinatoric number of possible keys, over time the > map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries... > The fix seems to be simply to extract the shard name and cache using this > name instead of the full string value of the {{shard}} parameter. Existing > unit tests also need much improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13790) LRUStatsCache size explosion and ineffective caching
[ https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942904#comment-16942904 ] Andrzej Bialecki commented on SOLR-13790: - Upon further examination it looks like {{ExactSharedStatsCache}} and {{LRUStatsCache}} have a problem with staleness - they don't track updates in the shards so they have no way of knowing when to refresh the stats. As a result the global stats may be even more wrong than if we used just local stats - imagine a scenario where there's a heavy indexing activity that adds a lot of terms and postings. In this scenario local stats from the local shard would reflect this growth, albeit partially, but the global stats that are stale would not. Another issue is with the purported optimization in {{LRUStatsCache}} and {{ExactSharedStatsCache}} - the claimed advantage of these caches is that they help to avoid unnecessary fetching of stats from shards. Only they don't ... as explained in my previous comment, both of these implementations always send ShardRequest-s to fetch the stats, thus adding one more round-trip to every query. Since the stats are fetched on every request at least there was no problem with the staleness ;) but the "caching" aspect was completely false - per-shard stats were being fetched on every request, and on every request new global stats would be built and send out. I plan to address these issues separately, the current patch is already large. Updated patch with the following additional changes: * the biggest change is that now StatsCache instances are tied to SolrIndexSearcher and its life-cycle and not to SolrCore - this helps to at least mitigate the problem of staleness and also the problem of unbound memory consumption of {{ExactSharedStatsCache}}. The downside is that after every commit the cache needs to be re-populated. * more optimization and safety in StatsUtil serialization code * fixed a bug in {{DebugComponent}} where only local stats would be used for explanations - this threw me off for a while, as I relied on explanations to explain the details of scoring :) * added more substance to SolrCloud unit tests All tests are passing. If there are no objections I'd like to commit this shortly. > LRUStatsCache size explosion and ineffective caching > > > Key: SOLR-13790 > URL: https://issues.apache.org/jira/browse/SOLR-13790 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: 7.7.2, 8.2, 8.3 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Critical > Fix For: 7.7.3, 8.3 > > Attachments: SOLR-13790.patch, SOLR-13790.patch > > > On a sizeable cluster with multi-shard multi-replica collections, when > {{LRUStatsCache}} was in use we encountered excessive memory usage, which > consequently led to severe performance problems. > On a closer examination of the heapdumps it became apparent that when > {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of > {{FastLRUCache}} using the passed {{shard}} argument - however, the value of > this argument is not a simple shard name but instead it's a randomly ordered > list of ALL replica URLs for this shard. > As a result, due to the combinatoric number of possible keys, over time the > map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries... > The fix seems to be simply to extract the shard name and cache using this > name instead of the full string value of the {{shard}} parameter. Existing > unit tests also need much improvement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org