[jira] [Commented] (SOLR-13790) LRUStatsCache size explosion and ineffective caching

2019-10-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946440#comment-16946440
 ] 

ASF subversion and git services commented on SOLR-13790:


Commit c0a446b179e8091f84e795ab04c6c3fcc9396ebe in lucene-solr's branch 
refs/heads/jira/SOLR-13821 from Andrzej Bialecki
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c0a446b ]

SOLR-13790: LRUStatsCache size explosion and ineffective caching.


> LRUStatsCache size explosion and ineffective caching
> 
>
> Key: SOLR-13790
> URL: https://issues.apache.org/jira/browse/SOLR-13790
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.7.2, 8.2, 8.3
>Reporter: Andrzej Bialecki
>Assignee: Andrzej Bialecki
>Priority: Critical
> Fix For: 7.7.3, 8.3
>
> Attachments: SOLR-13790.patch, SOLR-13790.patch, SOLR-13790.patch
>
>
> On a sizeable cluster with multi-shard multi-replica collections, when 
> {{LRUStatsCache}} was in use we encountered excessive memory usage, which 
> consequently led to severe performance problems.
> On a closer examination of the heapdumps it became apparent that when 
> {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of 
> {{FastLRUCache}} using the passed {{shard}} argument - however, the value of 
> this argument is not a simple shard name but instead it's a randomly ordered 
> list of ALL replica URLs for this shard.
> As a result, due to the combinatoric number of possible keys, over time the 
> map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries...
> The fix seems to be simply to extract the shard name and cache using this 
> name instead of the full string value of the {{shard}} parameter. Existing 
> unit tests also need much improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13790) LRUStatsCache size explosion and ineffective caching

2019-10-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946207#comment-16946207
 ] 

ASF subversion and git services commented on SOLR-13790:


Commit 611966ec7b4d200568395ec4d1f1d353453ce9e2 in lucene-solr's branch 
refs/heads/branch_8x from Andrzej Bialecki
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=611966e ]

SOLR-13790: LRUStatsCache size explosion and ineffective caching.


> LRUStatsCache size explosion and ineffective caching
> 
>
> Key: SOLR-13790
> URL: https://issues.apache.org/jira/browse/SOLR-13790
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.7.2, 8.2, 8.3
>Reporter: Andrzej Bialecki
>Assignee: Andrzej Bialecki
>Priority: Critical
> Fix For: 7.7.3, 8.3
>
> Attachments: SOLR-13790.patch, SOLR-13790.patch, SOLR-13790.patch
>
>
> On a sizeable cluster with multi-shard multi-replica collections, when 
> {{LRUStatsCache}} was in use we encountered excessive memory usage, which 
> consequently led to severe performance problems.
> On a closer examination of the heapdumps it became apparent that when 
> {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of 
> {{FastLRUCache}} using the passed {{shard}} argument - however, the value of 
> this argument is not a simple shard name but instead it's a randomly ordered 
> list of ALL replica URLs for this shard.
> As a result, due to the combinatoric number of possible keys, over time the 
> map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries...
> The fix seems to be simply to extract the shard name and cache using this 
> name instead of the full string value of the {{shard}} parameter. Existing 
> unit tests also need much improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13790) LRUStatsCache size explosion and ineffective caching

2019-10-07 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946081#comment-16946081
 ] 

ASF subversion and git services commented on SOLR-13790:


Commit c0a446b179e8091f84e795ab04c6c3fcc9396ebe in lucene-solr's branch 
refs/heads/master from Andrzej Bialecki
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c0a446b ]

SOLR-13790: LRUStatsCache size explosion and ineffective caching.


> LRUStatsCache size explosion and ineffective caching
> 
>
> Key: SOLR-13790
> URL: https://issues.apache.org/jira/browse/SOLR-13790
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.7.2, 8.2, 8.3
>Reporter: Andrzej Bialecki
>Assignee: Andrzej Bialecki
>Priority: Critical
> Fix For: 7.7.3, 8.3
>
> Attachments: SOLR-13790.patch, SOLR-13790.patch, SOLR-13790.patch
>
>
> On a sizeable cluster with multi-shard multi-replica collections, when 
> {{LRUStatsCache}} was in use we encountered excessive memory usage, which 
> consequently led to severe performance problems.
> On a closer examination of the heapdumps it became apparent that when 
> {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of 
> {{FastLRUCache}} using the passed {{shard}} argument - however, the value of 
> this argument is not a simple shard name but instead it's a randomly ordered 
> list of ALL replica URLs for this shard.
> As a result, due to the combinatoric number of possible keys, over time the 
> map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries...
> The fix seems to be simply to extract the shard name and cache using this 
> name instead of the full string value of the {{shard}} parameter. Existing 
> unit tests also need much improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13790) LRUStatsCache size explosion and ineffective caching

2019-10-07 Thread Andrzej Bialecki (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16945948#comment-16945948
 ] 

Andrzej Bialecki commented on SOLR-13790:
-

Final patch:
 * added stats caching metrics
 * improved SolrCloud tests

> LRUStatsCache size explosion and ineffective caching
> 
>
> Key: SOLR-13790
> URL: https://issues.apache.org/jira/browse/SOLR-13790
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.7.2, 8.2, 8.3
>Reporter: Andrzej Bialecki
>Assignee: Andrzej Bialecki
>Priority: Critical
> Fix For: 7.7.3, 8.3
>
> Attachments: SOLR-13790.patch, SOLR-13790.patch, SOLR-13790.patch
>
>
> On a sizeable cluster with multi-shard multi-replica collections, when 
> {{LRUStatsCache}} was in use we encountered excessive memory usage, which 
> consequently led to severe performance problems.
> On a closer examination of the heapdumps it became apparent that when 
> {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of 
> {{FastLRUCache}} using the passed {{shard}} argument - however, the value of 
> this argument is not a simple shard name but instead it's a randomly ordered 
> list of ALL replica URLs for this shard.
> As a result, due to the combinatoric number of possible keys, over time the 
> map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries...
> The fix seems to be simply to extract the shard name and cache using this 
> name instead of the full string value of the {{shard}} parameter. Existing 
> unit tests also need much improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13790) LRUStatsCache size explosion and ineffective caching

2019-10-02 Thread David Wayne Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16943149#comment-16943149
 ] 

David Wayne Smiley commented on SOLR-13790:
---

Interesting; your proposal makes sense.  Thanks [~ab].

> LRUStatsCache size explosion and ineffective caching
> 
>
> Key: SOLR-13790
> URL: https://issues.apache.org/jira/browse/SOLR-13790
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.7.2, 8.2, 8.3
>Reporter: Andrzej Bialecki
>Assignee: Andrzej Bialecki
>Priority: Critical
> Fix For: 7.7.3, 8.3
>
> Attachments: SOLR-13790.patch, SOLR-13790.patch
>
>
> On a sizeable cluster with multi-shard multi-replica collections, when 
> {{LRUStatsCache}} was in use we encountered excessive memory usage, which 
> consequently led to severe performance problems.
> On a closer examination of the heapdumps it became apparent that when 
> {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of 
> {{FastLRUCache}} using the passed {{shard}} argument - however, the value of 
> this argument is not a simple shard name but instead it's a randomly ordered 
> list of ALL replica URLs for this shard.
> As a result, due to the combinatoric number of possible keys, over time the 
> map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries...
> The fix seems to be simply to extract the shard name and cache using this 
> name instead of the full string value of the {{shard}} parameter. Existing 
> unit tests also need much improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13790) LRUStatsCache size explosion and ineffective caching

2019-10-02 Thread Andrzej Bialecki (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942908#comment-16942908
 ] 

Andrzej Bialecki commented on SOLR-13790:
-

Oh, and until the staleness issue is fixed I would recommend using only 
{{ExactStatsCache}} - other implementations can only make matters worse, both 
in terms of memory use and scoring inaccuracies.

> LRUStatsCache size explosion and ineffective caching
> 
>
> Key: SOLR-13790
> URL: https://issues.apache.org/jira/browse/SOLR-13790
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.7.2, 8.2, 8.3
>Reporter: Andrzej Bialecki
>Assignee: Andrzej Bialecki
>Priority: Critical
> Fix For: 7.7.3, 8.3
>
> Attachments: SOLR-13790.patch, SOLR-13790.patch
>
>
> On a sizeable cluster with multi-shard multi-replica collections, when 
> {{LRUStatsCache}} was in use we encountered excessive memory usage, which 
> consequently led to severe performance problems.
> On a closer examination of the heapdumps it became apparent that when 
> {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of 
> {{FastLRUCache}} using the passed {{shard}} argument - however, the value of 
> this argument is not a simple shard name but instead it's a randomly ordered 
> list of ALL replica URLs for this shard.
> As a result, due to the combinatoric number of possible keys, over time the 
> map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries...
> The fix seems to be simply to extract the shard name and cache using this 
> name instead of the full string value of the {{shard}} parameter. Existing 
> unit tests also need much improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13790) LRUStatsCache size explosion and ineffective caching

2019-10-02 Thread Andrzej Bialecki (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942904#comment-16942904
 ] 

Andrzej Bialecki commented on SOLR-13790:
-

Upon further examination it looks like {{ExactSharedStatsCache}} and 
{{LRUStatsCache}} have a problem with staleness - they don't track updates in 
the shards so they have no way of knowing when to refresh the stats. As a 
result the global stats may be even more wrong than if we used just local stats 
- imagine a scenario where there's a heavy indexing activity that adds a lot of 
terms and postings. In this scenario local stats from the local shard would 
reflect this growth, albeit partially, but the global stats that are stale 
would not.

Another issue is with the purported optimization in {{LRUStatsCache}} and 
{{ExactSharedStatsCache}} - the claimed advantage of these caches is that they 
help to avoid unnecessary fetching of stats from shards. Only they don't ... as 
explained in my previous comment, both of these implementations always send 
ShardRequest-s to fetch the stats, thus adding one more round-trip to every 
query. Since the stats are fetched on every request at least there was no 
problem with the staleness ;) but the "caching" aspect was completely false - 
per-shard stats were being fetched on every request, and on every request new 
global stats would be built and send out.

I plan to address these issues separately, the current patch is already large.

Updated patch with the following additional changes:
 * the biggest change is that now StatsCache instances are tied to 
SolrIndexSearcher and its life-cycle and not to SolrCore - this helps to at 
least mitigate the problem of staleness and also the problem of unbound memory 
consumption of {{ExactSharedStatsCache}}. The downside is that after every 
commit the cache needs to be re-populated.
 * more optimization and safety in StatsUtil serialization code
 * fixed a bug in {{DebugComponent}} where only local stats would be used for 
explanations - this threw me off for a while, as I relied on explanations to 
explain the details of scoring :)
 * added more substance to SolrCloud unit tests

All tests are passing. If there are no objections I'd like to commit this 
shortly.

> LRUStatsCache size explosion and ineffective caching
> 
>
> Key: SOLR-13790
> URL: https://issues.apache.org/jira/browse/SOLR-13790
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.7.2, 8.2, 8.3
>Reporter: Andrzej Bialecki
>Assignee: Andrzej Bialecki
>Priority: Critical
> Fix For: 7.7.3, 8.3
>
> Attachments: SOLR-13790.patch, SOLR-13790.patch
>
>
> On a sizeable cluster with multi-shard multi-replica collections, when 
> {{LRUStatsCache}} was in use we encountered excessive memory usage, which 
> consequently led to severe performance problems.
> On a closer examination of the heapdumps it became apparent that when 
> {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of 
> {{FastLRUCache}} using the passed {{shard}} argument - however, the value of 
> this argument is not a simple shard name but instead it's a randomly ordered 
> list of ALL replica URLs for this shard.
> As a result, due to the combinatoric number of possible keys, over time the 
> map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries...
> The fix seems to be simply to extract the shard name and cache using this 
> name instead of the full string value of the {{shard}} parameter. Existing 
> unit tests also need much improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org