[jira] [Commented] (SOLR-13807) Caching for term facet counts

2020-09-10 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193650#comment-17193650
 ] 

Michael Gibney commented on SOLR-13807:
---

As a side-effect of [PR #1357|https://github.com/apache/lucene-solr/pull/1357] 
on this issue, SOLR-9724 is also incidentally fixed.

> Caching for term facet counts
> -
>
> Key: SOLR-13807
> URL: https://issues.apache.org/jira/browse/SOLR-13807
> Project: Solr
>  Issue Type: New Feature
>  Components: Facet Module
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: SOLR-13807-benchmarks.tgz, 
> SOLR-13807__SOLR-13132_test_stub.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Solr does not have a facet count cache; so for _every_ request, term facets 
> are recalculated for _every_ (facet) field, by iterating over _every_ field 
> value for _every_ doc in the result domain, and incrementing the associated 
> count.
> As a result, subsequent requests end up redoing a lot of the same work, 
> including all associated object allocation, GC, etc. This situation could 
> benefit from integrated caching.
> Because of the domain-based, serial/iterative nature of term facet 
> calculation, latency is proportional to the size of the result domain. 
> Consequently, one common/clear manifestation of this issue is high latency 
> for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be 
> observed on a top-level landing page that exposes facets. This type of 
> "static" case is often mitigated by external (to Solr) caching, either with a 
> caching layer between Solr and a front-end application, or within a front-end 
> application, or even with a caching layer between the end user and a 
> front-end application.
> But in addition to the overhead of handling this caching elsewhere in the 
> stack (or, for a new user, even being aware of this as a potential issue to 
> mitigate), any external caching mitigation is really only appropriate for 
> relatively static cases like the "landing page" example described above. A 
> Solr-internal facet count cache (analogous to the {{filterCache}}) would 
> provide the following additional benefits:
>  # ease of use/out-of-the-box configuration to address a common performance 
> concern
>  # compact (specifically caching count arrays, without the extra baggage that 
> accompanies a naive external caching approach)
>  # NRT-friendly (could be implemented to be segment-aware)
>  # modular, capable of reusing the same cached values in conjunction with 
> variant requests over the same result domain (this would support common use 
> cases like paging, but also potentially more interesting direct uses of 
> facets). 
>  # could be used for distributed refinement (i.e., if facet counts over a 
> given domain are cached, a refinement request could simply look up the 
> ordinal value for each enumerated term and directly grab the count out of the 
> count array that was cached during the first phase of facet calculation)
>  # composable (e.g., in aggregate functions that calculate values based on 
> facet counts across different domains, like SKG/relatedness – see SOLR-13132)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13807) Caching for term facet counts

2020-08-17 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179234#comment-17179234
 ] 

Michael Gibney commented on SOLR-13807:
---

aba1d18797c99b46edf211ff26b989a4d23f625b adds the ability to configure docset 
domain size threshold ({{countCacheDf}} for consultation of the termFacetCache) 
separately for base domain, fgSet, and bgSet. This is practically useful for 
the common case where one might want to dedicate the termFacetCache primarily 
to caching (large, static) bgSets, but accumulate facet counts for base and 
fgSet "normally" (i.e., without consulting the termFacetCache). One nice 
side-effect is that this allows the benchmarks above to demonstrate in a more 
isolated way the different potential benefits from the termFacetCache wrt 
sort-by-{{relatedness}}.

{code}
[magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s true # no cache/cache 
maxSize=0
cdnlty: 10  100 1k  10k 100k1m
.1% 86  85  90  90  93  130
1%  89  87  92  91  96  138
10% 128 122 125 125 145 188
20% 145 137 142 143 162 242
30% 164 153 160 166 180 266
40% 186 176 180 180 201 295
50% 206 188 193 197 216 326
99.99%  198 196 200 200 220 332

[magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s true # all caching disabled 
at query level (countCacheDf=-1)
cdnlty: 10  100 1k  10k 100k1m
.1% 88  86  91  89  94  121
1%  89  87  91  92  96  131
10% 124 122 127 126 146 186
20% 142 139 144 144 166 241
30% 160 156 162 160 213 264
40% 181 180 183 183 206 301
50% 196 195 197 196 216 324
99.99%  199 195 202 204 223 331

[magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s true # bg-countCacheDf=0 
(default; cache size->6 -- 1 per field)
cdnlty: 10  100 1k  10k 100k1m
.1% 0   0   0   0   1   4
1%  1   1   1   2   5   8
10% 22  22  28  25  37  61
20% 42  42  43  44  59  99
30% 61  60  62  63  82  129
40% 80  80  84  84  101 165
50% 99  98  101 102 122 199
99.99%  122 118 123 124 142 231

[magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s true # base, fg, bg 
countCacheDf=0 (default; cache size->84)
cdnlty: 10  100 1k  10k 100k1m
.1% 0   0   0   1   1   9
1%  0   0   0   1   4   12
10% 3   3   3   4   13  38
20% 6   6   6   7   14  47
30% 9   8   8   9   17  56
40% 11  10  11  10  21  64
50% 14  13  13  14  21  75
99.99%  25  24  24  25  34  108
{code}

The first two above demonstrate parity between disabling caching at the 
collection config level and at the query level. The third illustrates the 
result of consulting the termFacetCache only for the bgSet. The last consults 
the termFacetCache across the board (base, fg, bg). One takeaway here is that 
even a _very_ modestly-configured termFacetCache gives ~10x performance boost 
for sort-by-skg against low-recall base domains, and a "worst"-case boost (for 
high-cardinality domains) of ~33% -- and because of the static nature of the 
bgSet, this boost can be expected to be consistent (unlike hit-or-miss gains 
that are generally characteristic of caching).

The first two count-only benchmarks (below) demonstrates analogous parity for 
different ways of disabled caching, and the third below shows significant 
performance boost for count-only (sort-by-count) with default termFacetCache 
consultation.

{code}
[magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s count # no cache/maxSize=0
cdnlty: 10  100 1k  10k 100k1m
.1% 0   0   0   0   0   4
1%  1   1   1   1   1   5
10% 9   8   9   8   11  15
20% 16  15  17  18  20  32
30% 23  21  23  22  26  42
40% 29  28  31  29  34  54
50% 37  34  37  35  41  65
99.99%  70  63  70  67  74  112

[magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s count # caching disabled at 
query time (countCacheDf=-1)
cdnlty: 10  100 1k  10k 100k1m
.1% 0   0   0   0   0   4
1%  1   1   1   1   2   6
10% 7   8   8   9   11  14
20% 16  15  

[jira] [Commented] (SOLR-13807) Caching for term facet counts

2020-08-13 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177108#comment-17177108
 ] 

Michael Gibney commented on SOLR-13807:
---

Quick follow-up: regarding skg on master vs. 
77daac4ae2a4d1c40652eafbbdb42b582fe2d02d with no cache configured: the apparent 
performance improvement is simultaneously a little misleading (depending on 
perspective) _and also_ likely to manifest in practice for many common use 
cases. This is a byproduct of the fact that in order to support caching, we 
need to retain "query" references (for the purpose of building query-based 
cache keys), not just domain DocSets. Once this was done, these query keys (and 
in some cases extra deduping based on actual docsets) are used to ensure that 
"sweep" count accumulation only accumulates once for each unique domain. So for 
the common/default case where fgQ=q, and fgSet (as determined by 
fgFilters=fgQ∩domainFilters) is the same as the base domain, sweep collection 
only happens over the base domain (deduped with fgSet) and the bgSet -- 2 
domains instead of 3. I ran the benchmarks forcing fgSet to be (slightly) 
different than base domain, and performance is comparable to master (as one 
would expect/hope).

> Caching for term facet counts
> -
>
> Key: SOLR-13807
> URL: https://issues.apache.org/jira/browse/SOLR-13807
> Project: Solr
>  Issue Type: New Feature
>  Components: Facet Module
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: SOLR-13807-benchmarks.tgz, 
> SOLR-13807__SOLR-13132_test_stub.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Solr does not have a facet count cache; so for _every_ request, term facets 
> are recalculated for _every_ (facet) field, by iterating over _every_ field 
> value for _every_ doc in the result domain, and incrementing the associated 
> count.
> As a result, subsequent requests end up redoing a lot of the same work, 
> including all associated object allocation, GC, etc. This situation could 
> benefit from integrated caching.
> Because of the domain-based, serial/iterative nature of term facet 
> calculation, latency is proportional to the size of the result domain. 
> Consequently, one common/clear manifestation of this issue is high latency 
> for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be 
> observed on a top-level landing page that exposes facets. This type of 
> "static" case is often mitigated by external (to Solr) caching, either with a 
> caching layer between Solr and a front-end application, or within a front-end 
> application, or even with a caching layer between the end user and a 
> front-end application.
> But in addition to the overhead of handling this caching elsewhere in the 
> stack (or, for a new user, even being aware of this as a potential issue to 
> mitigate), any external caching mitigation is really only appropriate for 
> relatively static cases like the "landing page" example described above. A 
> Solr-internal facet count cache (analogous to the {{filterCache}}) would 
> provide the following additional benefits:
>  # ease of use/out-of-the-box configuration to address a common performance 
> concern
>  # compact (specifically caching count arrays, without the extra baggage that 
> accompanies a naive external caching approach)
>  # NRT-friendly (could be implemented to be segment-aware)
>  # modular, capable of reusing the same cached values in conjunction with 
> variant requests over the same result domain (this would support common use 
> cases like paging, but also potentially more interesting direct uses of 
> facets). 
>  # could be used for distributed refinement (i.e., if facet counts over a 
> given domain are cached, a refinement request could simply look up the 
> ordinal value for each enumerated term and directly grab the count out of the 
> count array that was cached during the first phase of facet calculation)
>  # composable (e.g., in aggregate functions that calculate values based on 
> facet counts across different domains, like SKG/relatedness – see SOLR-13132)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13807) Caching for term facet counts

2020-08-11 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175836#comment-17175836
 ] 

Michael Gibney commented on SOLR-13807:
---

After SOLR-13132 was merged to master, it was a bit of challenge to reconcile 
with the complementary "term facet cache" (this issue). I've taken an initial 
stab at this and pushed to [PR 
#1357|https://github.com/apache/lucene-solr/pull/1357], and I think it's at the 
point where it's once again ready for consideration.

Below are some naive performance benchmarks, using [^SOLR-13807-benchmarks.tgz] 
(based on similar benchmarks for SOLR-13132).

{{filterCache}} is irrelevant for what's illustrated here (all count or sweep 
collection, single-shard thus no refinement). I included hooks in the included 
scripts to easily change the filterCache size and termFacetCache size for 
evaluation. For purpose of {{relatedness}} evaluation, fgSet == base search 
result domain. All results discussed here are for single-valued string fields, 
but multivalued string fields are also included in the benchmark attachment 
(results for multi-valued didn't differ substantially from those for 
single-valued).

There's a row for each docset domain recall percentage (percentage of \*:* 
domain returned by main query/fg), and a column for each field cardinality; 
cell values indicate latency (QTime) in ms against a single core with 3 million 
docs, no deletes; each value is the average of 10 repeated invocations of the 
the relevant request (standard deviation isn't captured here, but was quite 
low, fwiw).

Below are for current (including SOLR-13132) master; no caches (filterCache, if 
present, would be unused):
{code}
[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s count # sort-by-count, master
cdnlty: 10  100 1k  10k 100k1m
.1% 0   0   0   0   0   4
1%  1   0   1   1   2   5
10% 7   7   8   8   10  16
20% 17  14  16  15  19  31
30% 22  19  23  20  24  42
40% 27  26  28  28  32  50
50% 33  32  35  32  38  59
99.99%  65  60  67  62  72  107

[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s true # sort-by-skg, master
cdnlty: 10  100 1k  10k 100k1m
.1% 179 174 183 190 192 225
1%  182 177 186 183 194 236
10% 193 191 196 197 226 256
20% 206 200 207 207 234 300
30% 216 210 217 216 239 316
40% 228 225 231 231 253 331
50% 239 234 241 240 266 347
99.99%  285 280 287 287 311 403
{code}

Below are for 77daac4ae2a4d1c40652eafbbdb42b582fe2d02d (SOLR-13807), with _no_ 
termFacetCache configured (apples-to-apples, since there are changes in some of 
the hot facet code paths):
{code}
[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s count # sort-by-count, 
no_cache
cdnlty: 10  100 1k  10k 100k1m
.1% 0   0   0   0   0   3
1%  1   1   1   1   1   6
10% 8   8   9   8   11  14
20% 16  15  16  15  20  32
30% 21  21  23  22  26  42
40% 28  27  31  28  34  53
50% 35  33  37  34  40  63
99.99%  68  64  71  66  74  108

[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s true # sort-by-skg, no_cache
cdnlty: 10  100 1k  10k 100k1m
.1% 96  80  89  97  96  129
1%  88  83  90  88  101 133
10% 99  97  103 102 122 162
20% 117 107 113 113 135 194
30% 120 117 123 122 144 211
40% 130 129 134 134 156 232
50% 143 140 147 144 169 249
99.99%  179 175 181 179 201 305
{code}

Below are for 77daac4ae2a4d1c40652eafbbdb42b582fe2d02d (SOLR-13807), with 
{{solr.termFacetCacheSize=20}} configured.
{code}
[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s count # sort-by-count, cache 
size 20
cdnlty: 10  100 1k  10k 100k1m
.1% 0   0   0   0   0   2
1%  0   0   0   0   1   10
10% 3   4   4   4   5   16
20% 8   7   8   7   9   20
30% 11  10  12  11  13  25
40% 13  13  15  15  15  28
50% 15  16  16  18  20  32
99.99%  29  30  30  29  32  45

[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s true # sort-by-skg, cache 
size 20
cdnlty: 10  100 1k  10k 100k1m
.1% 0   0   

[jira] [Commented] (SOLR-13807) Caching for term facet counts

2020-03-16 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060395#comment-17060395
 ] 

Michael Gibney commented on SOLR-13807:
---

Thanks Chris! This all makes sense. I've pushed a single commit(/patch) to [PR 
#751|https://github.com/apache/lucene-solr/pull/751] incorporating only changes 
for SOLR-13132, and opened new PR with a commit dedicated to facet cache 
(associated with _this_ issue) 
([#1357|https://github.com/apache/lucene-solr/pull/1357]).

My apologies, I'm doing my best to manage this logically! I will indeed stick 
with Github (more for the "git" and less for the "hub" ... or something like 
that), esp. because it made it easier to make the facet cache patch/code 
available while clarifying the dependency of the code as written on some of the 
code reorganization introduced by the patch for SOLR-13132.

> Caching for term facet counts
> -
>
> Key: SOLR-13807
> URL: https://issues.apache.org/jira/browse/SOLR-13807
> Project: Solr
>  Issue Type: New Feature
>  Components: Facet Module
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: SOLR-13807__SOLR-13132_test_stub.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Solr does not have a facet count cache; so for _every_ request, term facets 
> are recalculated for _every_ (facet) field, by iterating over _every_ field 
> value for _every_ doc in the result domain, and incrementing the associated 
> count.
> As a result, subsequent requests end up redoing a lot of the same work, 
> including all associated object allocation, GC, etc. This situation could 
> benefit from integrated caching.
> Because of the domain-based, serial/iterative nature of term facet 
> calculation, latency is proportional to the size of the result domain. 
> Consequently, one common/clear manifestation of this issue is high latency 
> for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be 
> observed on a top-level landing page that exposes facets. This type of 
> "static" case is often mitigated by external (to Solr) caching, either with a 
> caching layer between Solr and a front-end application, or within a front-end 
> application, or even with a caching layer between the end user and a 
> front-end application.
> But in addition to the overhead of handling this caching elsewhere in the 
> stack (or, for a new user, even being aware of this as a potential issue to 
> mitigate), any external caching mitigation is really only appropriate for 
> relatively static cases like the "landing page" example described above. A 
> Solr-internal facet count cache (analogous to the {{filterCache}}) would 
> provide the following additional benefits:
>  # ease of use/out-of-the-box configuration to address a common performance 
> concern
>  # compact (specifically caching count arrays, without the extra baggage that 
> accompanies a naive external caching approach)
>  # NRT-friendly (could be implemented to be segment-aware)
>  # modular, capable of reusing the same cached values in conjunction with 
> variant requests over the same result domain (this would support common use 
> cases like paging, but also potentially more interesting direct uses of 
> facets). 
>  # could be used for distributed refinement (i.e., if facet counts over a 
> given domain are cached, a refinement request could simply look up the 
> ordinal value for each enumerated term and directly grab the count out of the 
> count array that was cached during the first phase of facet calculation)
>  # composable (e.g., in aggregate functions that calculate values based on 
> facet counts across different domains, like SKG/relatedness – see SOLR-13132)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13807) Caching for term facet counts

2020-03-13 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059139#comment-17059139
 ] 

Chris M. Hostetter commented on SOLR-13807:
---

Hey Michael, I've been offline all week, but hopefully i'll be able to start 
digging in an reviewing some of this more at some point next week.

as far as some of your process questions: I don't know if/how force-pushing 
affects things, but from a reviewing / "get more eyeballs" on things i really 
think it would be cleaner to have 2 discrete PRs that we can iterate from, and 
link those 2 PRs from each of the 2 distinct jiras, so we can keep the 
comments/discussion on each distinct, particularly since one or the other might 
attract more/less attention from folks who are more/less passionate about the 
new functionality and/or concerned about the internal change.

FWIW: I personally don't care about PRs (or any github "value add" 
functionality for that matter) at all – to me they are just patch files i can 
find at special URLs by adding ".patch" to the end.  i tell you this not to 
discourage you from using PRs (my anachronistic view on development shouldn't 
stop you from using the tools you're comfortable with, and other people in the 
community are – or at least claim to be – more likely to help review PRs then 
patches) but just to clarify that you certainly don't need to worry about 
trying to re-use the existing PR ... feel free to close it and open new ones 
for each of the distinct issues – the github-to-Jira bridge should pick them up.

 

> Caching for term facet counts
> -
>
> Key: SOLR-13807
> URL: https://issues.apache.org/jira/browse/SOLR-13807
> Project: Solr
>  Issue Type: New Feature
>  Components: Facet Module
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: SOLR-13807__SOLR-13132_test_stub.patch
>
>
> Solr does not have a facet count cache; so for _every_ request, term facets 
> are recalculated for _every_ (facet) field, by iterating over _every_ field 
> value for _every_ doc in the result domain, and incrementing the associated 
> count.
> As a result, subsequent requests end up redoing a lot of the same work, 
> including all associated object allocation, GC, etc. This situation could 
> benefit from integrated caching.
> Because of the domain-based, serial/iterative nature of term facet 
> calculation, latency is proportional to the size of the result domain. 
> Consequently, one common/clear manifestation of this issue is high latency 
> for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be 
> observed on a top-level landing page that exposes facets. This type of 
> "static" case is often mitigated by external (to Solr) caching, either with a 
> caching layer between Solr and a front-end application, or within a front-end 
> application, or even with a caching layer between the end user and a 
> front-end application.
> But in addition to the overhead of handling this caching elsewhere in the 
> stack (or, for a new user, even being aware of this as a potential issue to 
> mitigate), any external caching mitigation is really only appropriate for 
> relatively static cases like the "landing page" example described above. A 
> Solr-internal facet count cache (analogous to the {{filterCache}}) would 
> provide the following additional benefits:
>  # ease of use/out-of-the-box configuration to address a common performance 
> concern
>  # compact (specifically caching count arrays, without the extra baggage that 
> accompanies a naive external caching approach)
>  # NRT-friendly (could be implemented to be segment-aware)
>  # modular, capable of reusing the same cached values in conjunction with 
> variant requests over the same result domain (this would support common use 
> cases like paging, but also potentially more interesting direct uses of 
> facets). 
>  # could be used for distributed refinement (i.e., if facet counts over a 
> given domain are cached, a refinement request could simply look up the 
> ordinal value for each enumerated term and directly grab the count out of the 
> count array that was cached during the first phase of facet calculation)
>  # composable (e.g., in aggregate functions that calculate values based on 
> facet counts across different domains, like SKG/relatedness – see SOLR-13132)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13807) Caching for term facet counts

2020-03-12 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057947#comment-17057947
 ] 

Michael Gibney commented on SOLR-13807:
---

Still working at [PR #751|https://github.com/apache/lucene-solr/pull/751], I 
separated the earlier monolithic commit (and recent incremental adjustments) 
into two logical commits: first introducing single-sweep collection of facet 
term counts across different DocSet domains, second introducing a term facet 
count cache. (N.b. tests are passing for each commit, but validation is failing 
based on intentional nocommits).

[~hossman], I have yet to expand on the test stub you introduced, but if you 
don't think it's premature to take a second look at this now that the two 
features have been separated into logical commits, I'd appreciate any feedback 
you have to offer. I was reluctant to force-push, and wasn't sure whether to 
open new PRs or work with the existing one; but I left the old (monolithic + 
test-stub-patch + iterative-adjustments) available 
[here|https://github.com/magibney/lucene-solr/tree/SOLR-13132-mingled-sweep-and-cache],
 and figured the new 2-commit push would clarify things and be a good jumping 
off point for however we want to proceed (whether new PRs, etc...).

I know I had said I would make single-sweep collection dependent on facet 
cache, but (as you can see) I went the opposite way. Functionality-wise, 
facet-cache would have made sense first, but code/structure-wise, sweep-first 
was much cleaner and clearer.

> Caching for term facet counts
> -
>
> Key: SOLR-13807
> URL: https://issues.apache.org/jira/browse/SOLR-13807
> Project: Solr
>  Issue Type: New Feature
>  Components: Facet Module
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: SOLR-13807__SOLR-13132_test_stub.patch
>
>
> Solr does not have a facet count cache; so for _every_ request, term facets 
> are recalculated for _every_ (facet) field, by iterating over _every_ field 
> value for _every_ doc in the result domain, and incrementing the associated 
> count.
> As a result, subsequent requests end up redoing a lot of the same work, 
> including all associated object allocation, GC, etc. This situation could 
> benefit from integrated caching.
> Because of the domain-based, serial/iterative nature of term facet 
> calculation, latency is proportional to the size of the result domain. 
> Consequently, one common/clear manifestation of this issue is high latency 
> for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be 
> observed on a top-level landing page that exposes facets. This type of 
> "static" case is often mitigated by external (to Solr) caching, either with a 
> caching layer between Solr and a front-end application, or within a front-end 
> application, or even with a caching layer between the end user and a 
> front-end application.
> But in addition to the overhead of handling this caching elsewhere in the 
> stack (or, for a new user, even being aware of this as a potential issue to 
> mitigate), any external caching mitigation is really only appropriate for 
> relatively static cases like the "landing page" example described above. A 
> Solr-internal facet count cache (analogous to the {{filterCache}}) would 
> provide the following additional benefits:
>  # ease of use/out-of-the-box configuration to address a common performance 
> concern
>  # compact (specifically caching count arrays, without the extra baggage that 
> accompanies a naive external caching approach)
>  # NRT-friendly (could be implemented to be segment-aware)
>  # modular, capable of reusing the same cached values in conjunction with 
> variant requests over the same result domain (this would support common use 
> cases like paging, but also potentially more interesting direct uses of 
> facets). 
>  # could be used for distributed refinement (i.e., if facet counts over a 
> given domain are cached, a refinement request could simply look up the 
> ordinal value for each enumerated term and directly grab the count out of the 
> count array that was cached during the first phase of facet calculation)
>  # composable (e.g., in aggregate functions that calculate values based on 
> facet counts across different domains, like SKG/relatedness – see SOLR-13132)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13807) Caching for term facet counts

2020-03-10 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056100#comment-17056100
 ] 

Michael Gibney commented on SOLR-13807:
---

Thanks for responding on these points, [~hossman]! Apologies for my delay in 
responding, but it's taken me a while to dig into actually addressing some of 
the issues uncovered by testing (just pushed to [PR 
#751|https://github.com/apache/lucene-solr/pull/751]). Before embarking on a 
potential major refactor of PR code that is I believe essentially sound, I 
first wanted to address the test failures in the existing PR and then see where 
we are with things.

The changes required were not large in terms of number of lines of code. Aside 
from some trivial bug fixes, the substantive issues addressed fell into three 
categories, broadly speaking:
 # UIF caching was an afterthought in the initial patch. I knew this at the 
time I opened the PR (and should have called it out more explicitly) but 
although I had roughed in some of the cache-entry-building logic as a POC, 
nothing was ever actually getting inserted in the cache \(!) and not all code 
branches were covered. It was fairly straightforward to bring UIF into line 
(and I re-enabled the UIF cases from your initial test).
 # Cache entry compatibility across different methods of facet processing. I 
had to clarify that term counts are only eligible for caching when 
{{prefix==null}} (or {{prefix.isEmpty()}}). (It would be possible to use 
no-prefix cached term counts to process prefixed facet requests, but I think it 
makes sense to leave that for later, if at all). Aside from that, missing 
buckets are collected _inline_ and cached for {{FacetFieldProcessorByArrayDV}}, 
but are _not_ collected (nor cached) for {{FacetFieldProcessorByArrayUIF}} (or 
legacy {{DocValuesFacets}}) processing. In practice, it's unlikely that the 
same field would be processed both as UIF (no cached "missing" count) _and_ as 
DV (cached "missing" count), but the case did come up in testing, and I 
addressed it by detecting and re-processing with {{*ByArrayDV}}, and replacing 
the cache entry with the new one that includes "missing" count. The resulting 
"missing"-inclusive cache-entry is backward-compatible with (may be used by) 
{{*ByArrayUIF}} and legacy {{DocValuesFacets}} processing implementations. 
Incidentally, I wonder whether this "inline" collection of "missing" counts is 
something like what you had in mind with the comment "{{TODO: it would be more 
efficient to build up a missing DocSet if we need it here anyway.}}"?
 # Cache key compatibility across blockJoin domain changes. The extant "nested 
facet" implementation only passes the {{base}} DocSet domain down from parent 
to child. One of the things this PR had to do was to also track corresponding 
changes to the {{baseFilters}} – the queries used to generate the {{base}} 
DocSet domain – because these queries are required for use in facet cache keys. 
The initial PR punted on the question of blockJoin domain changes, and simply 
set {{baseFilters = null}}, with a comment in code: "{{unusual case; TODO: can 
we make a cache key for this base domain?}}". Well I meant "unusual _for me, at 
the moment_" :); I just had to put the effort into building proper 
({{baseFilter}} query) cache keys for these domain changes. In the process, I 
also realized that tracking {{baseFilters}} down the nested facet tree should 
probably address "{{TODO: somehow remove responsebuilder dependency}}" – I put 
a {{nocommit}} comment to that effect (and temporarily throw an 
{{AssertionError}} to highlight what I think can now be dead code following). I 
also found myself wondering how exclusion of ancestor tagged filters would 
affect descendent join/graph/blockjoin domain changes ... but that's a separate 
issue.

> Caching for term facet counts
> -
>
> Key: SOLR-13807
> URL: https://issues.apache.org/jira/browse/SOLR-13807
> Project: Solr
>  Issue Type: New Feature
>  Components: Facet Module
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: SOLR-13807__SOLR-13132_test_stub.patch
>
>
> Solr does not have a facet count cache; so for _every_ request, term facets 
> are recalculated for _every_ (facet) field, by iterating over _every_ field 
> value for _every_ doc in the result domain, and incrementing the associated 
> count.
> As a result, subsequent requests end up redoing a lot of the same work, 
> including all associated object allocation, GC, etc. This situation could 
> benefit from integrated caching.
> Because of the domain-based, serial/iterative nature of term facet 
> calculation, latency is proportional to the size of the result domain. 
> Consequently, one common/clear manifestation of this issue is

[jira] [Commented] (SOLR-13807) Caching for term facet counts

2020-03-05 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052350#comment-17052350
 ] 

Chris M. Hostetter commented on SOLR-13807:
---

bq. my understanding of CacheHelper.getKey() is that the returned keys ... that 
the types of modifications you mention (deletes, in-place DV updates, etc.) 
should result in the creation of a new cache key. Is that not true?

I don't know ... it's not something i've looked into in depth, if so then false 
alarm (but we should double check, and ideally prove it w/a defensive white box 
test of the regenerator after doing some deletes/in-place updates)

bq. countCacheDf is defined wrt the main domain DocSet.size(), and only affects 
whether the termFacetCache is consulted for a given domain-request combination 
...

Oh, oh OH ! ... ok  that explains so much about what i was seeing in cache 
stats after various requests.  For some reason I thought it controlled whether 
individual term=counts were being cached -- which reminds me: we need ref-guide 
updates in the PR : )

bq. ...As far as the temporarily tabled concerns about concurrent mutation...

Those concerns were largely related to my mistaken impression that different 
requests w/different {{countCacheDf}} params were causing the original segment 
level cache values to be mutated in place (w/o doing a new "insert" back into 
the cache) because that's what i convinced myself was happening to explain the 
cache stats i was seeing and my vague (missguided) assumptions about how/why 
{{CacheState.PARTIALLY_CACHED}} existed from skimming the code.

Your point about doing a defensive copy of the segment level counts & atomic 
re-insert of the top level entry after updating the counts for the new segments 
makes perfect sense.

> Caching for term facet counts
> -
>
> Key: SOLR-13807
> URL: https://issues.apache.org/jira/browse/SOLR-13807
> Project: Solr
>  Issue Type: New Feature
>  Components: Facet Module
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: SOLR-13807__SOLR-13132_test_stub.patch
>
>
> Solr does not have a facet count cache; so for _every_ request, term facets 
> are recalculated for _every_ (facet) field, by iterating over _every_ field 
> value for _every_ doc in the result domain, and incrementing the associated 
> count.
> As a result, subsequent requests end up redoing a lot of the same work, 
> including all associated object allocation, GC, etc. This situation could 
> benefit from integrated caching.
> Because of the domain-based, serial/iterative nature of term facet 
> calculation, latency is proportional to the size of the result domain. 
> Consequently, one common/clear manifestation of this issue is high latency 
> for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be 
> observed on a top-level landing page that exposes facets. This type of 
> "static" case is often mitigated by external (to Solr) caching, either with a 
> caching layer between Solr and a front-end application, or within a front-end 
> application, or even with a caching layer between the end user and a 
> front-end application.
> But in addition to the overhead of handling this caching elsewhere in the 
> stack (or, for a new user, even being aware of this as a potential issue to 
> mitigate), any external caching mitigation is really only appropriate for 
> relatively static cases like the "landing page" example described above. A 
> Solr-internal facet count cache (analogous to the {{filterCache}}) would 
> provide the following additional benefits:
>  # ease of use/out-of-the-box configuration to address a common performance 
> concern
>  # compact (specifically caching count arrays, without the extra baggage that 
> accompanies a naive external caching approach)
>  # NRT-friendly (could be implemented to be segment-aware)
>  # modular, capable of reusing the same cached values in conjunction with 
> variant requests over the same result domain (this would support common use 
> cases like paging, but also potentially more interesting direct uses of 
> facets). 
>  # could be used for distributed refinement (i.e., if facet counts over a 
> given domain are cached, a refinement request could simply look up the 
> ordinal value for each enumerated term and directly grab the count out of the 
> count array that was cached during the first phase of facet calculation)
>  # composable (e.g., in aggregate functions that calculate values based on 
> facet counts across different domains, like SKG/relatedness – see SOLR-13132)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For a

[jira] [Commented] (SOLR-13807) Caching for term facet counts

2020-03-05 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052328#comment-17052328
 ] 

Michael Gibney commented on SOLR-13807:
---

Regarding TermFacetCacheRegenerator, my understanding of CacheHelper.getKey() 
is that the returned keys should work the same way at the segment level that 
they do at the top level; notably, that the types of modifications you mention 
(deletes, in-place DV updates, etc.) should result in the creation of a new 
cache key. Is that not true?

{{countCacheDf}} is defined wrt the main domain DocSet.size(), and only affects 
whether the {{termFacetCache}} is consulted for a given domain-request 
combination. It should _not_ affect the cached values themselves, if that's 
your concern. As far as the temporarily tabled concerns about concurrent 
mutation, this was something I considered, and (I think) addressed 
[here|https://github.com/apache/lucene-solr/pull/751/files#diff-1b16fc96c8dde547ddde619e54a45c26R1158-R1161]:
{code:java}
  if (segmentCache == null) {
// no cache presence; initialize.
cacheState = CacheState.NOT_CACHED;
newSegmentCache = new 
HashMap<>(fcontext.searcher.getIndexReader().leaves().size() + 1);
  } else if (segmentCache.containsKey(topLevelKey)) {
topLevelEntry = segmentCache.get(topLevelKey);
CachedCountSlotAcc acc = new CachedCountSlotAcc(fcontext, 
topLevelEntry.topLevelCounts);
return new SweepCountAccStruct(qKey, docs, CacheState.CACHED, null, 
isBase, acc,
new ReadOnlyCountSlotAccWrapper(fcontext, acc), acc);
  } else {
// defensive copy, since cache entries are shared across threads
cacheState = CacheState.PARTIALLY_CACHED;
newSegmentCache = new 
HashMap<>(fcontext.searcher.getIndexReader().leaves().size() + 1);
newSegmentCache.putAll(segmentCache);
  }
{code}
In that last {{else}} block, each domain-request combination that finds a 
partial cache entry (with some segments populated), creates and populates an 
entirely new, request-private top-level cache entry (initially sharing the 
immutable segment-level entries from the extant top-level entry). On completion 
of processing, this new top-level entry is placed atomically into the 
termFacetCache. I believe this should be robust; and if indeed robust, at worst 
you'd end up with concurrent requests each doing the work of creating 
equivalent top-level cache entries, the last of which would remain in the cache 
... which should be no worse than the status quo, where each request always 
does all the work of recalculating facet counts.

> Caching for term facet counts
> -
>
> Key: SOLR-13807
> URL: https://issues.apache.org/jira/browse/SOLR-13807
> Project: Solr
>  Issue Type: New Feature
>  Components: Facet Module
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: SOLR-13807__SOLR-13132_test_stub.patch
>
>
> Solr does not have a facet count cache; so for _every_ request, term facets 
> are recalculated for _every_ (facet) field, by iterating over _every_ field 
> value for _every_ doc in the result domain, and incrementing the associated 
> count.
> As a result, subsequent requests end up redoing a lot of the same work, 
> including all associated object allocation, GC, etc. This situation could 
> benefit from integrated caching.
> Because of the domain-based, serial/iterative nature of term facet 
> calculation, latency is proportional to the size of the result domain. 
> Consequently, one common/clear manifestation of this issue is high latency 
> for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be 
> observed on a top-level landing page that exposes facets. This type of 
> "static" case is often mitigated by external (to Solr) caching, either with a 
> caching layer between Solr and a front-end application, or within a front-end 
> application, or even with a caching layer between the end user and a 
> front-end application.
> But in addition to the overhead of handling this caching elsewhere in the 
> stack (or, for a new user, even being aware of this as a potential issue to 
> mitigate), any external caching mitigation is really only appropriate for 
> relatively static cases like the "landing page" example described above. A 
> Solr-internal facet count cache (analogous to the {{filterCache}}) would 
> provide the following additional benefits:
>  # ease of use/out-of-the-box configuration to address a common performance 
> concern
>  # compact (specifically caching count arrays, without the extra baggage that 
> accompanies a naive external caching approach)
>  # NRT-friendly (could be implemented to be segment-aware)
>  # modular, capable of reusing the same cached values in co

[jira] [Commented] (SOLR-13807) Caching for term facet counts

2020-03-04 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051700#comment-17051700
 ] 

Chris M. Hostetter commented on SOLR-13807:
---

linking SOLR-13807 to fix the config parsing so tests can optionally 
enable/disable the cache at run time

> Caching for term facet counts
> -
>
> Key: SOLR-13807
> URL: https://issues.apache.org/jira/browse/SOLR-13807
> Project: Solr
>  Issue Type: New Feature
>  Components: Facet Module
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
> Attachments: SOLR-13807__SOLR-13132_test_stub.patch
>
>
> Solr does not have a facet count cache; so for _every_ request, term facets 
> are recalculated for _every_ (facet) field, by iterating over _every_ field 
> value for _every_ doc in the result domain, and incrementing the associated 
> count.
> As a result, subsequent requests end up redoing a lot of the same work, 
> including all associated object allocation, GC, etc. This situation could 
> benefit from integrated caching.
> Because of the domain-based, serial/iterative nature of term facet 
> calculation, latency is proportional to the size of the result domain. 
> Consequently, one common/clear manifestation of this issue is high latency 
> for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be 
> observed on a top-level landing page that exposes facets. This type of 
> "static" case is often mitigated by external (to Solr) caching, either with a 
> caching layer between Solr and a front-end application, or within a front-end 
> application, or even with a caching layer between the end user and a 
> front-end application.
> But in addition to the overhead of handling this caching elsewhere in the 
> stack (or, for a new user, even being aware of this as a potential issue to 
> mitigate), any external caching mitigation is really only appropriate for 
> relatively static cases like the "landing page" example described above. A 
> Solr-internal facet count cache (analogous to the {{filterCache}}) would 
> provide the following additional benefits:
>  # ease of use/out-of-the-box configuration to address a common performance 
> concern
>  # compact (specifically caching count arrays, without the extra baggage that 
> accompanies a naive external caching approach)
>  # NRT-friendly (could be implemented to be segment-aware)
>  # modular, capable of reusing the same cached values in conjunction with 
> variant requests over the same result domain (this would support common use 
> cases like paging, but also potentially more interesting direct uses of 
> facets). 
>  # could be used for distributed refinement (i.e., if facet counts over a 
> given domain are cached, a refinement request could simply look up the 
> ordinal value for each enumerated term and directly grab the count out of the 
> count array that was cached during the first phase of facet calculation)
>  # composable (e.g., in aggregate functions that calculate values based on 
> facet counts across different domains, like SKG/relatedness – see SOLR-13132)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13807) Caching for term facet counts

2019-10-01 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942117#comment-16942117
 ] 

Michael Gibney commented on SOLR-13807:
---

I initially proposed this idea, along with an implementation over 
{{SimpleFacets}}, as (very) tangentially related to 
[SOLR-8096|https://issues.apache.org/jira/browse/SOLR-8096?focusedCommentId=15960982#comment-15960982].
 As a natural consequence of working to address some performance issues with 
full-domain SKG/relatedness (SOLR-13132), I updated the initial facet cache 
implementation to be compatible with JSON facets (while maintaining 
cross-compatibility with {{SimpleFacets}}).

[PR #751|https://github.com/apache/lucene-solr/pull/751] (associated with 
SOLR-13132) incorporates a facet cache that I believe realizes all of the 
potential mentioned in the proposal/description above, including being 
NRT-friendly/segment-aware ... with the exception of point 5 (the PR does not 
leverage the facet cache for distributed refinement; the facet cache itself was 
a prerequisite for the SKG/relatedness work, but distributed refinement would 
have definitely been out of scope).

In retrospect I would have preferred to submit a separate PR for only the facet 
cache; I did not go that route, but only because the facet cache implementation 
grew organically out of (and was prerequisite to) the work on SKG/relatedness. 
Would people be comfortable (at least initially) evaluating the facet cache 
implementation in the context of SOLR-13132? Whether or not I end up having to 
extract the facet cache work into a separate PR, I thought it would be worth 
opening this separate Jira issue for the facet cache, since its use could 
potentially be much more general (beyond SKG/relatedness). 

> Caching for term facet counts
> -
>
> Key: SOLR-13807
> URL: https://issues.apache.org/jira/browse/SOLR-13807
> Project: Solr
>  Issue Type: New Feature
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Facet Module
>Affects Versions: master (9.0), 8.2
>Reporter: Michael Gibney
>Priority: Minor
>
> Solr does not have a facet count cache; so for _every_ request, term facets 
> are recalculated for _every_ (facet) field, by iterating over _every_ field 
> value for _every_ doc in the result domain, and incrementing the associated 
> count.
> This redoes a lot of work, including all associated object allocation, GC, 
> etc., and could benefit greatly from integrated caching.
> Because of the domain-based, serial/iterative nature of term facet 
> calculation, latency is proportional to the size of the result domain. 
> Consequently, one common/clear manifestation of this issue is high latency 
> for faceting over an unrestricted domain (e.g., {{*:*}}), as might be 
> observed on a top-level landing page that exposes facets. This type of 
> "static" case is often mitigated by external (to Solr) caching, either with a 
> caching layer between Solr and a front-end application, or within a front-end 
> application, or even with a caching layer between the end user and a 
> front-end application.
> But in addition to the overhead of handling this caching elsewhere in the 
> stack (or, for a new user, even being aware of this as a potential issue to 
> mitigate), any external caching mitigation is really only appropriate for 
> relatively static cases like the "landing page" example described above. A 
> Solr-internal facet count cache (analogous to the {{filterCache}}) would 
> provide the following additional benefits:
>  # ease of use/out-of-the-box configuration to address a common performance 
> concern
>  # compact (specifically caching count arrays, without the extra baggage that 
> accompanies a naive external caching approach)
>  # NRT-friendly (could be implemented to be segment-aware)
>  # modular, capable of reusing the same cached values in conjunction with 
> variant requests over the same result domain (this would support common use 
> cases like paging, but also potentially more interesting direct uses of 
> facets). 
>  # could be used for distributed refinement (i.e., if facet counts over a 
> given domain are cached, a refinement request could simply look up the 
> ordinal value for each enumerated term and directly grab the count out of the 
> count array that was cached during the first phase of facet calculation)
>  # composable (e.g., in aggregate functions that calculate values based on 
> facet counts across different domains, like SKG/relatedness – see SOLR-13132)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache