[jira] [Commented] (SOLR-13807) Caching for term facet counts
[ https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17193650#comment-17193650 ] Michael Gibney commented on SOLR-13807: --- As a side-effect of [PR #1357|https://github.com/apache/lucene-solr/pull/1357] on this issue, SOLR-9724 is also incidentally fixed. > Caching for term facet counts > - > > Key: SOLR-13807 > URL: https://issues.apache.org/jira/browse/SOLR-13807 > Project: Solr > Issue Type: New Feature > Components: Facet Module >Affects Versions: master (9.0), 8.2 >Reporter: Michael Gibney >Priority: Minor > Attachments: SOLR-13807-benchmarks.tgz, > SOLR-13807__SOLR-13132_test_stub.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Solr does not have a facet count cache; so for _every_ request, term facets > are recalculated for _every_ (facet) field, by iterating over _every_ field > value for _every_ doc in the result domain, and incrementing the associated > count. > As a result, subsequent requests end up redoing a lot of the same work, > including all associated object allocation, GC, etc. This situation could > benefit from integrated caching. > Because of the domain-based, serial/iterative nature of term facet > calculation, latency is proportional to the size of the result domain. > Consequently, one common/clear manifestation of this issue is high latency > for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be > observed on a top-level landing page that exposes facets. This type of > "static" case is often mitigated by external (to Solr) caching, either with a > caching layer between Solr and a front-end application, or within a front-end > application, or even with a caching layer between the end user and a > front-end application. > But in addition to the overhead of handling this caching elsewhere in the > stack (or, for a new user, even being aware of this as a potential issue to > mitigate), any external caching mitigation is really only appropriate for > relatively static cases like the "landing page" example described above. A > Solr-internal facet count cache (analogous to the {{filterCache}}) would > provide the following additional benefits: > # ease of use/out-of-the-box configuration to address a common performance > concern > # compact (specifically caching count arrays, without the extra baggage that > accompanies a naive external caching approach) > # NRT-friendly (could be implemented to be segment-aware) > # modular, capable of reusing the same cached values in conjunction with > variant requests over the same result domain (this would support common use > cases like paging, but also potentially more interesting direct uses of > facets). > # could be used for distributed refinement (i.e., if facet counts over a > given domain are cached, a refinement request could simply look up the > ordinal value for each enumerated term and directly grab the count out of the > count array that was cached during the first phase of facet calculation) > # composable (e.g., in aggregate functions that calculate values based on > facet counts across different domains, like SKG/relatedness – see SOLR-13132) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13807) Caching for term facet counts
[ https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179234#comment-17179234 ] Michael Gibney commented on SOLR-13807: --- aba1d18797c99b46edf211ff26b989a4d23f625b adds the ability to configure docset domain size threshold ({{countCacheDf}} for consultation of the termFacetCache) separately for base domain, fgSet, and bgSet. This is practically useful for the common case where one might want to dedicate the termFacetCache primarily to caching (large, static) bgSets, but accumulate facet counts for base and fgSet "normally" (i.e., without consulting the termFacetCache). One nice side-effect is that this allows the benchmarks above to demonstrate in a more isolated way the different potential benefits from the termFacetCache wrt sort-by-{{relatedness}}. {code} [magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s true # no cache/cache maxSize=0 cdnlty: 10 100 1k 10k 100k1m .1% 86 85 90 90 93 130 1% 89 87 92 91 96 138 10% 128 122 125 125 145 188 20% 145 137 142 143 162 242 30% 164 153 160 166 180 266 40% 186 176 180 180 201 295 50% 206 188 193 197 216 326 99.99% 198 196 200 200 220 332 [magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s true # all caching disabled at query level (countCacheDf=-1) cdnlty: 10 100 1k 10k 100k1m .1% 88 86 91 89 94 121 1% 89 87 91 92 96 131 10% 124 122 127 126 146 186 20% 142 139 144 144 166 241 30% 160 156 162 160 213 264 40% 181 180 183 183 206 301 50% 196 195 197 196 216 324 99.99% 199 195 202 204 223 331 [magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s true # bg-countCacheDf=0 (default; cache size->6 -- 1 per field) cdnlty: 10 100 1k 10k 100k1m .1% 0 0 0 0 1 4 1% 1 1 1 2 5 8 10% 22 22 28 25 37 61 20% 42 42 43 44 59 99 30% 61 60 62 63 82 129 40% 80 80 84 84 101 165 50% 99 98 101 102 122 199 99.99% 122 118 123 124 142 231 [magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s true # base, fg, bg countCacheDf=0 (default; cache size->84) cdnlty: 10 100 1k 10k 100k1m .1% 0 0 0 1 1 9 1% 0 0 0 1 4 12 10% 3 3 3 4 13 38 20% 6 6 6 7 14 47 30% 9 8 8 9 17 56 40% 11 10 11 10 21 64 50% 14 13 13 14 21 75 99.99% 25 24 24 25 34 108 {code} The first two above demonstrate parity between disabling caching at the collection config level and at the query level. The third illustrates the result of consulting the termFacetCache only for the bgSet. The last consults the termFacetCache across the board (base, fg, bg). One takeaway here is that even a _very_ modestly-configured termFacetCache gives ~10x performance boost for sort-by-skg against low-recall base domains, and a "worst"-case boost (for high-cardinality domains) of ~33% -- and because of the static nature of the bgSet, this boost can be expected to be consistent (unlike hit-or-miss gains that are generally characteristic of caching). The first two count-only benchmarks (below) demonstrates analogous parity for different ways of disabled caching, and the third below shows significant performance boost for count-only (sort-by-count) with default termFacetCache consultation. {code} [magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s count # no cache/maxSize=0 cdnlty: 10 100 1k 10k 100k1m .1% 0 0 0 0 0 4 1% 1 1 1 1 1 5 10% 9 8 9 8 11 15 20% 16 15 17 18 20 32 30% 23 21 23 22 26 42 40% 29 28 31 29 34 54 50% 37 34 37 35 41 65 99.99% 70 63 70 67 74 112 [magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s count # caching disabled at query time (countCacheDf=-1) cdnlty: 10 100 1k 10k 100k1m .1% 0 0 0 0 0 4 1% 1 1 1 1 2 6 10% 7 8 8 9 11 14 20% 16 15
[jira] [Commented] (SOLR-13807) Caching for term facet counts
[ https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177108#comment-17177108 ] Michael Gibney commented on SOLR-13807: --- Quick follow-up: regarding skg on master vs. 77daac4ae2a4d1c40652eafbbdb42b582fe2d02d with no cache configured: the apparent performance improvement is simultaneously a little misleading (depending on perspective) _and also_ likely to manifest in practice for many common use cases. This is a byproduct of the fact that in order to support caching, we need to retain "query" references (for the purpose of building query-based cache keys), not just domain DocSets. Once this was done, these query keys (and in some cases extra deduping based on actual docsets) are used to ensure that "sweep" count accumulation only accumulates once for each unique domain. So for the common/default case where fgQ=q, and fgSet (as determined by fgFilters=fgQ∩domainFilters) is the same as the base domain, sweep collection only happens over the base domain (deduped with fgSet) and the bgSet -- 2 domains instead of 3. I ran the benchmarks forcing fgSet to be (slightly) different than base domain, and performance is comparable to master (as one would expect/hope). > Caching for term facet counts > - > > Key: SOLR-13807 > URL: https://issues.apache.org/jira/browse/SOLR-13807 > Project: Solr > Issue Type: New Feature > Components: Facet Module >Affects Versions: master (9.0), 8.2 >Reporter: Michael Gibney >Priority: Minor > Attachments: SOLR-13807-benchmarks.tgz, > SOLR-13807__SOLR-13132_test_stub.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Solr does not have a facet count cache; so for _every_ request, term facets > are recalculated for _every_ (facet) field, by iterating over _every_ field > value for _every_ doc in the result domain, and incrementing the associated > count. > As a result, subsequent requests end up redoing a lot of the same work, > including all associated object allocation, GC, etc. This situation could > benefit from integrated caching. > Because of the domain-based, serial/iterative nature of term facet > calculation, latency is proportional to the size of the result domain. > Consequently, one common/clear manifestation of this issue is high latency > for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be > observed on a top-level landing page that exposes facets. This type of > "static" case is often mitigated by external (to Solr) caching, either with a > caching layer between Solr and a front-end application, or within a front-end > application, or even with a caching layer between the end user and a > front-end application. > But in addition to the overhead of handling this caching elsewhere in the > stack (or, for a new user, even being aware of this as a potential issue to > mitigate), any external caching mitigation is really only appropriate for > relatively static cases like the "landing page" example described above. A > Solr-internal facet count cache (analogous to the {{filterCache}}) would > provide the following additional benefits: > # ease of use/out-of-the-box configuration to address a common performance > concern > # compact (specifically caching count arrays, without the extra baggage that > accompanies a naive external caching approach) > # NRT-friendly (could be implemented to be segment-aware) > # modular, capable of reusing the same cached values in conjunction with > variant requests over the same result domain (this would support common use > cases like paging, but also potentially more interesting direct uses of > facets). > # could be used for distributed refinement (i.e., if facet counts over a > given domain are cached, a refinement request could simply look up the > ordinal value for each enumerated term and directly grab the count out of the > count array that was cached during the first phase of facet calculation) > # composable (e.g., in aggregate functions that calculate values based on > facet counts across different domains, like SKG/relatedness – see SOLR-13132) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13807) Caching for term facet counts
[ https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175836#comment-17175836 ] Michael Gibney commented on SOLR-13807: --- After SOLR-13132 was merged to master, it was a bit of challenge to reconcile with the complementary "term facet cache" (this issue). I've taken an initial stab at this and pushed to [PR #1357|https://github.com/apache/lucene-solr/pull/1357], and I think it's at the point where it's once again ready for consideration. Below are some naive performance benchmarks, using [^SOLR-13807-benchmarks.tgz] (based on similar benchmarks for SOLR-13132). {{filterCache}} is irrelevant for what's illustrated here (all count or sweep collection, single-shard thus no refinement). I included hooks in the included scripts to easily change the filterCache size and termFacetCache size for evaluation. For purpose of {{relatedness}} evaluation, fgSet == base search result domain. All results discussed here are for single-valued string fields, but multivalued string fields are also included in the benchmark attachment (results for multi-valued didn't differ substantially from those for single-valued). There's a row for each docset domain recall percentage (percentage of \*:* domain returned by main query/fg), and a column for each field cardinality; cell values indicate latency (QTime) in ms against a single core with 3 million docs, no deletes; each value is the average of 10 repeated invocations of the the relevant request (standard deviation isn't captured here, but was quite low, fwiw). Below are for current (including SOLR-13132) master; no caches (filterCache, if present, would be unused): {code} [magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s count # sort-by-count, master cdnlty: 10 100 1k 10k 100k1m .1% 0 0 0 0 0 4 1% 1 0 1 1 2 5 10% 7 7 8 8 10 16 20% 17 14 16 15 19 31 30% 22 19 23 20 24 42 40% 27 26 28 28 32 50 50% 33 32 35 32 38 59 99.99% 65 60 67 62 72 107 [magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s true # sort-by-skg, master cdnlty: 10 100 1k 10k 100k1m .1% 179 174 183 190 192 225 1% 182 177 186 183 194 236 10% 193 191 196 197 226 256 20% 206 200 207 207 234 300 30% 216 210 217 216 239 316 40% 228 225 231 231 253 331 50% 239 234 241 240 266 347 99.99% 285 280 287 287 311 403 {code} Below are for 77daac4ae2a4d1c40652eafbbdb42b582fe2d02d (SOLR-13807), with _no_ termFacetCache configured (apples-to-apples, since there are changes in some of the hot facet code paths): {code} [magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s count # sort-by-count, no_cache cdnlty: 10 100 1k 10k 100k1m .1% 0 0 0 0 0 3 1% 1 1 1 1 1 6 10% 8 8 9 8 11 14 20% 16 15 16 15 20 32 30% 21 21 23 22 26 42 40% 28 27 31 28 34 53 50% 35 33 37 34 40 63 99.99% 68 64 71 66 74 108 [magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s true # sort-by-skg, no_cache cdnlty: 10 100 1k 10k 100k1m .1% 96 80 89 97 96 129 1% 88 83 90 88 101 133 10% 99 97 103 102 122 162 20% 117 107 113 113 135 194 30% 120 117 123 122 144 211 40% 130 129 134 134 156 232 50% 143 140 147 144 169 249 99.99% 179 175 181 179 201 305 {code} Below are for 77daac4ae2a4d1c40652eafbbdb42b582fe2d02d (SOLR-13807), with {{solr.termFacetCacheSize=20}} configured. {code} [magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s count # sort-by-count, cache size 20 cdnlty: 10 100 1k 10k 100k1m .1% 0 0 0 0 0 2 1% 0 0 0 0 1 10 10% 3 4 4 4 5 16 20% 8 7 8 7 9 20 30% 11 10 12 11 13 25 40% 13 13 15 15 15 28 50% 15 16 16 18 20 32 99.99% 29 30 30 29 32 45 [magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s true # sort-by-skg, cache size 20 cdnlty: 10 100 1k 10k 100k1m .1% 0 0
[jira] [Commented] (SOLR-13807) Caching for term facet counts
[ https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060395#comment-17060395 ] Michael Gibney commented on SOLR-13807: --- Thanks Chris! This all makes sense. I've pushed a single commit(/patch) to [PR #751|https://github.com/apache/lucene-solr/pull/751] incorporating only changes for SOLR-13132, and opened new PR with a commit dedicated to facet cache (associated with _this_ issue) ([#1357|https://github.com/apache/lucene-solr/pull/1357]). My apologies, I'm doing my best to manage this logically! I will indeed stick with Github (more for the "git" and less for the "hub" ... or something like that), esp. because it made it easier to make the facet cache patch/code available while clarifying the dependency of the code as written on some of the code reorganization introduced by the patch for SOLR-13132. > Caching for term facet counts > - > > Key: SOLR-13807 > URL: https://issues.apache.org/jira/browse/SOLR-13807 > Project: Solr > Issue Type: New Feature > Components: Facet Module >Affects Versions: master (9.0), 8.2 >Reporter: Michael Gibney >Priority: Minor > Attachments: SOLR-13807__SOLR-13132_test_stub.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Solr does not have a facet count cache; so for _every_ request, term facets > are recalculated for _every_ (facet) field, by iterating over _every_ field > value for _every_ doc in the result domain, and incrementing the associated > count. > As a result, subsequent requests end up redoing a lot of the same work, > including all associated object allocation, GC, etc. This situation could > benefit from integrated caching. > Because of the domain-based, serial/iterative nature of term facet > calculation, latency is proportional to the size of the result domain. > Consequently, one common/clear manifestation of this issue is high latency > for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be > observed on a top-level landing page that exposes facets. This type of > "static" case is often mitigated by external (to Solr) caching, either with a > caching layer between Solr and a front-end application, or within a front-end > application, or even with a caching layer between the end user and a > front-end application. > But in addition to the overhead of handling this caching elsewhere in the > stack (or, for a new user, even being aware of this as a potential issue to > mitigate), any external caching mitigation is really only appropriate for > relatively static cases like the "landing page" example described above. A > Solr-internal facet count cache (analogous to the {{filterCache}}) would > provide the following additional benefits: > # ease of use/out-of-the-box configuration to address a common performance > concern > # compact (specifically caching count arrays, without the extra baggage that > accompanies a naive external caching approach) > # NRT-friendly (could be implemented to be segment-aware) > # modular, capable of reusing the same cached values in conjunction with > variant requests over the same result domain (this would support common use > cases like paging, but also potentially more interesting direct uses of > facets). > # could be used for distributed refinement (i.e., if facet counts over a > given domain are cached, a refinement request could simply look up the > ordinal value for each enumerated term and directly grab the count out of the > count array that was cached during the first phase of facet calculation) > # composable (e.g., in aggregate functions that calculate values based on > facet counts across different domains, like SKG/relatedness – see SOLR-13132) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13807) Caching for term facet counts
[ https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059139#comment-17059139 ] Chris M. Hostetter commented on SOLR-13807: --- Hey Michael, I've been offline all week, but hopefully i'll be able to start digging in an reviewing some of this more at some point next week. as far as some of your process questions: I don't know if/how force-pushing affects things, but from a reviewing / "get more eyeballs" on things i really think it would be cleaner to have 2 discrete PRs that we can iterate from, and link those 2 PRs from each of the 2 distinct jiras, so we can keep the comments/discussion on each distinct, particularly since one or the other might attract more/less attention from folks who are more/less passionate about the new functionality and/or concerned about the internal change. FWIW: I personally don't care about PRs (or any github "value add" functionality for that matter) at all – to me they are just patch files i can find at special URLs by adding ".patch" to the end. i tell you this not to discourage you from using PRs (my anachronistic view on development shouldn't stop you from using the tools you're comfortable with, and other people in the community are – or at least claim to be – more likely to help review PRs then patches) but just to clarify that you certainly don't need to worry about trying to re-use the existing PR ... feel free to close it and open new ones for each of the distinct issues – the github-to-Jira bridge should pick them up. > Caching for term facet counts > - > > Key: SOLR-13807 > URL: https://issues.apache.org/jira/browse/SOLR-13807 > Project: Solr > Issue Type: New Feature > Components: Facet Module >Affects Versions: master (9.0), 8.2 >Reporter: Michael Gibney >Priority: Minor > Attachments: SOLR-13807__SOLR-13132_test_stub.patch > > > Solr does not have a facet count cache; so for _every_ request, term facets > are recalculated for _every_ (facet) field, by iterating over _every_ field > value for _every_ doc in the result domain, and incrementing the associated > count. > As a result, subsequent requests end up redoing a lot of the same work, > including all associated object allocation, GC, etc. This situation could > benefit from integrated caching. > Because of the domain-based, serial/iterative nature of term facet > calculation, latency is proportional to the size of the result domain. > Consequently, one common/clear manifestation of this issue is high latency > for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be > observed on a top-level landing page that exposes facets. This type of > "static" case is often mitigated by external (to Solr) caching, either with a > caching layer between Solr and a front-end application, or within a front-end > application, or even with a caching layer between the end user and a > front-end application. > But in addition to the overhead of handling this caching elsewhere in the > stack (or, for a new user, even being aware of this as a potential issue to > mitigate), any external caching mitigation is really only appropriate for > relatively static cases like the "landing page" example described above. A > Solr-internal facet count cache (analogous to the {{filterCache}}) would > provide the following additional benefits: > # ease of use/out-of-the-box configuration to address a common performance > concern > # compact (specifically caching count arrays, without the extra baggage that > accompanies a naive external caching approach) > # NRT-friendly (could be implemented to be segment-aware) > # modular, capable of reusing the same cached values in conjunction with > variant requests over the same result domain (this would support common use > cases like paging, but also potentially more interesting direct uses of > facets). > # could be used for distributed refinement (i.e., if facet counts over a > given domain are cached, a refinement request could simply look up the > ordinal value for each enumerated term and directly grab the count out of the > count array that was cached during the first phase of facet calculation) > # composable (e.g., in aggregate functions that calculate values based on > facet counts across different domains, like SKG/relatedness – see SOLR-13132) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13807) Caching for term facet counts
[ https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057947#comment-17057947 ] Michael Gibney commented on SOLR-13807: --- Still working at [PR #751|https://github.com/apache/lucene-solr/pull/751], I separated the earlier monolithic commit (and recent incremental adjustments) into two logical commits: first introducing single-sweep collection of facet term counts across different DocSet domains, second introducing a term facet count cache. (N.b. tests are passing for each commit, but validation is failing based on intentional nocommits). [~hossman], I have yet to expand on the test stub you introduced, but if you don't think it's premature to take a second look at this now that the two features have been separated into logical commits, I'd appreciate any feedback you have to offer. I was reluctant to force-push, and wasn't sure whether to open new PRs or work with the existing one; but I left the old (monolithic + test-stub-patch + iterative-adjustments) available [here|https://github.com/magibney/lucene-solr/tree/SOLR-13132-mingled-sweep-and-cache], and figured the new 2-commit push would clarify things and be a good jumping off point for however we want to proceed (whether new PRs, etc...). I know I had said I would make single-sweep collection dependent on facet cache, but (as you can see) I went the opposite way. Functionality-wise, facet-cache would have made sense first, but code/structure-wise, sweep-first was much cleaner and clearer. > Caching for term facet counts > - > > Key: SOLR-13807 > URL: https://issues.apache.org/jira/browse/SOLR-13807 > Project: Solr > Issue Type: New Feature > Components: Facet Module >Affects Versions: master (9.0), 8.2 >Reporter: Michael Gibney >Priority: Minor > Attachments: SOLR-13807__SOLR-13132_test_stub.patch > > > Solr does not have a facet count cache; so for _every_ request, term facets > are recalculated for _every_ (facet) field, by iterating over _every_ field > value for _every_ doc in the result domain, and incrementing the associated > count. > As a result, subsequent requests end up redoing a lot of the same work, > including all associated object allocation, GC, etc. This situation could > benefit from integrated caching. > Because of the domain-based, serial/iterative nature of term facet > calculation, latency is proportional to the size of the result domain. > Consequently, one common/clear manifestation of this issue is high latency > for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be > observed on a top-level landing page that exposes facets. This type of > "static" case is often mitigated by external (to Solr) caching, either with a > caching layer between Solr and a front-end application, or within a front-end > application, or even with a caching layer between the end user and a > front-end application. > But in addition to the overhead of handling this caching elsewhere in the > stack (or, for a new user, even being aware of this as a potential issue to > mitigate), any external caching mitigation is really only appropriate for > relatively static cases like the "landing page" example described above. A > Solr-internal facet count cache (analogous to the {{filterCache}}) would > provide the following additional benefits: > # ease of use/out-of-the-box configuration to address a common performance > concern > # compact (specifically caching count arrays, without the extra baggage that > accompanies a naive external caching approach) > # NRT-friendly (could be implemented to be segment-aware) > # modular, capable of reusing the same cached values in conjunction with > variant requests over the same result domain (this would support common use > cases like paging, but also potentially more interesting direct uses of > facets). > # could be used for distributed refinement (i.e., if facet counts over a > given domain are cached, a refinement request could simply look up the > ordinal value for each enumerated term and directly grab the count out of the > count array that was cached during the first phase of facet calculation) > # composable (e.g., in aggregate functions that calculate values based on > facet counts across different domains, like SKG/relatedness – see SOLR-13132) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13807) Caching for term facet counts
[ https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056100#comment-17056100 ] Michael Gibney commented on SOLR-13807: --- Thanks for responding on these points, [~hossman]! Apologies for my delay in responding, but it's taken me a while to dig into actually addressing some of the issues uncovered by testing (just pushed to [PR #751|https://github.com/apache/lucene-solr/pull/751]). Before embarking on a potential major refactor of PR code that is I believe essentially sound, I first wanted to address the test failures in the existing PR and then see where we are with things. The changes required were not large in terms of number of lines of code. Aside from some trivial bug fixes, the substantive issues addressed fell into three categories, broadly speaking: # UIF caching was an afterthought in the initial patch. I knew this at the time I opened the PR (and should have called it out more explicitly) but although I had roughed in some of the cache-entry-building logic as a POC, nothing was ever actually getting inserted in the cache \(!) and not all code branches were covered. It was fairly straightforward to bring UIF into line (and I re-enabled the UIF cases from your initial test). # Cache entry compatibility across different methods of facet processing. I had to clarify that term counts are only eligible for caching when {{prefix==null}} (or {{prefix.isEmpty()}}). (It would be possible to use no-prefix cached term counts to process prefixed facet requests, but I think it makes sense to leave that for later, if at all). Aside from that, missing buckets are collected _inline_ and cached for {{FacetFieldProcessorByArrayDV}}, but are _not_ collected (nor cached) for {{FacetFieldProcessorByArrayUIF}} (or legacy {{DocValuesFacets}}) processing. In practice, it's unlikely that the same field would be processed both as UIF (no cached "missing" count) _and_ as DV (cached "missing" count), but the case did come up in testing, and I addressed it by detecting and re-processing with {{*ByArrayDV}}, and replacing the cache entry with the new one that includes "missing" count. The resulting "missing"-inclusive cache-entry is backward-compatible with (may be used by) {{*ByArrayUIF}} and legacy {{DocValuesFacets}} processing implementations. Incidentally, I wonder whether this "inline" collection of "missing" counts is something like what you had in mind with the comment "{{TODO: it would be more efficient to build up a missing DocSet if we need it here anyway.}}"? # Cache key compatibility across blockJoin domain changes. The extant "nested facet" implementation only passes the {{base}} DocSet domain down from parent to child. One of the things this PR had to do was to also track corresponding changes to the {{baseFilters}} – the queries used to generate the {{base}} DocSet domain – because these queries are required for use in facet cache keys. The initial PR punted on the question of blockJoin domain changes, and simply set {{baseFilters = null}}, with a comment in code: "{{unusual case; TODO: can we make a cache key for this base domain?}}". Well I meant "unusual _for me, at the moment_" :); I just had to put the effort into building proper ({{baseFilter}} query) cache keys for these domain changes. In the process, I also realized that tracking {{baseFilters}} down the nested facet tree should probably address "{{TODO: somehow remove responsebuilder dependency}}" – I put a {{nocommit}} comment to that effect (and temporarily throw an {{AssertionError}} to highlight what I think can now be dead code following). I also found myself wondering how exclusion of ancestor tagged filters would affect descendent join/graph/blockjoin domain changes ... but that's a separate issue. > Caching for term facet counts > - > > Key: SOLR-13807 > URL: https://issues.apache.org/jira/browse/SOLR-13807 > Project: Solr > Issue Type: New Feature > Components: Facet Module >Affects Versions: master (9.0), 8.2 >Reporter: Michael Gibney >Priority: Minor > Attachments: SOLR-13807__SOLR-13132_test_stub.patch > > > Solr does not have a facet count cache; so for _every_ request, term facets > are recalculated for _every_ (facet) field, by iterating over _every_ field > value for _every_ doc in the result domain, and incrementing the associated > count. > As a result, subsequent requests end up redoing a lot of the same work, > including all associated object allocation, GC, etc. This situation could > benefit from integrated caching. > Because of the domain-based, serial/iterative nature of term facet > calculation, latency is proportional to the size of the result domain. > Consequently, one common/clear manifestation of this issue is
[jira] [Commented] (SOLR-13807) Caching for term facet counts
[ https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052350#comment-17052350 ] Chris M. Hostetter commented on SOLR-13807: --- bq. my understanding of CacheHelper.getKey() is that the returned keys ... that the types of modifications you mention (deletes, in-place DV updates, etc.) should result in the creation of a new cache key. Is that not true? I don't know ... it's not something i've looked into in depth, if so then false alarm (but we should double check, and ideally prove it w/a defensive white box test of the regenerator after doing some deletes/in-place updates) bq. countCacheDf is defined wrt the main domain DocSet.size(), and only affects whether the termFacetCache is consulted for a given domain-request combination ... Oh, oh OH ! ... ok that explains so much about what i was seeing in cache stats after various requests. For some reason I thought it controlled whether individual term=counts were being cached -- which reminds me: we need ref-guide updates in the PR : ) bq. ...As far as the temporarily tabled concerns about concurrent mutation... Those concerns were largely related to my mistaken impression that different requests w/different {{countCacheDf}} params were causing the original segment level cache values to be mutated in place (w/o doing a new "insert" back into the cache) because that's what i convinced myself was happening to explain the cache stats i was seeing and my vague (missguided) assumptions about how/why {{CacheState.PARTIALLY_CACHED}} existed from skimming the code. Your point about doing a defensive copy of the segment level counts & atomic re-insert of the top level entry after updating the counts for the new segments makes perfect sense. > Caching for term facet counts > - > > Key: SOLR-13807 > URL: https://issues.apache.org/jira/browse/SOLR-13807 > Project: Solr > Issue Type: New Feature > Components: Facet Module >Affects Versions: master (9.0), 8.2 >Reporter: Michael Gibney >Priority: Minor > Attachments: SOLR-13807__SOLR-13132_test_stub.patch > > > Solr does not have a facet count cache; so for _every_ request, term facets > are recalculated for _every_ (facet) field, by iterating over _every_ field > value for _every_ doc in the result domain, and incrementing the associated > count. > As a result, subsequent requests end up redoing a lot of the same work, > including all associated object allocation, GC, etc. This situation could > benefit from integrated caching. > Because of the domain-based, serial/iterative nature of term facet > calculation, latency is proportional to the size of the result domain. > Consequently, one common/clear manifestation of this issue is high latency > for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be > observed on a top-level landing page that exposes facets. This type of > "static" case is often mitigated by external (to Solr) caching, either with a > caching layer between Solr and a front-end application, or within a front-end > application, or even with a caching layer between the end user and a > front-end application. > But in addition to the overhead of handling this caching elsewhere in the > stack (or, for a new user, even being aware of this as a potential issue to > mitigate), any external caching mitigation is really only appropriate for > relatively static cases like the "landing page" example described above. A > Solr-internal facet count cache (analogous to the {{filterCache}}) would > provide the following additional benefits: > # ease of use/out-of-the-box configuration to address a common performance > concern > # compact (specifically caching count arrays, without the extra baggage that > accompanies a naive external caching approach) > # NRT-friendly (could be implemented to be segment-aware) > # modular, capable of reusing the same cached values in conjunction with > variant requests over the same result domain (this would support common use > cases like paging, but also potentially more interesting direct uses of > facets). > # could be used for distributed refinement (i.e., if facet counts over a > given domain are cached, a refinement request could simply look up the > ordinal value for each enumerated term and directly grab the count out of the > count array that was cached during the first phase of facet calculation) > # composable (e.g., in aggregate functions that calculate values based on > facet counts across different domains, like SKG/relatedness – see SOLR-13132) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For a
[jira] [Commented] (SOLR-13807) Caching for term facet counts
[ https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17052328#comment-17052328 ] Michael Gibney commented on SOLR-13807: --- Regarding TermFacetCacheRegenerator, my understanding of CacheHelper.getKey() is that the returned keys should work the same way at the segment level that they do at the top level; notably, that the types of modifications you mention (deletes, in-place DV updates, etc.) should result in the creation of a new cache key. Is that not true? {{countCacheDf}} is defined wrt the main domain DocSet.size(), and only affects whether the {{termFacetCache}} is consulted for a given domain-request combination. It should _not_ affect the cached values themselves, if that's your concern. As far as the temporarily tabled concerns about concurrent mutation, this was something I considered, and (I think) addressed [here|https://github.com/apache/lucene-solr/pull/751/files#diff-1b16fc96c8dde547ddde619e54a45c26R1158-R1161]: {code:java} if (segmentCache == null) { // no cache presence; initialize. cacheState = CacheState.NOT_CACHED; newSegmentCache = new HashMap<>(fcontext.searcher.getIndexReader().leaves().size() + 1); } else if (segmentCache.containsKey(topLevelKey)) { topLevelEntry = segmentCache.get(topLevelKey); CachedCountSlotAcc acc = new CachedCountSlotAcc(fcontext, topLevelEntry.topLevelCounts); return new SweepCountAccStruct(qKey, docs, CacheState.CACHED, null, isBase, acc, new ReadOnlyCountSlotAccWrapper(fcontext, acc), acc); } else { // defensive copy, since cache entries are shared across threads cacheState = CacheState.PARTIALLY_CACHED; newSegmentCache = new HashMap<>(fcontext.searcher.getIndexReader().leaves().size() + 1); newSegmentCache.putAll(segmentCache); } {code} In that last {{else}} block, each domain-request combination that finds a partial cache entry (with some segments populated), creates and populates an entirely new, request-private top-level cache entry (initially sharing the immutable segment-level entries from the extant top-level entry). On completion of processing, this new top-level entry is placed atomically into the termFacetCache. I believe this should be robust; and if indeed robust, at worst you'd end up with concurrent requests each doing the work of creating equivalent top-level cache entries, the last of which would remain in the cache ... which should be no worse than the status quo, where each request always does all the work of recalculating facet counts. > Caching for term facet counts > - > > Key: SOLR-13807 > URL: https://issues.apache.org/jira/browse/SOLR-13807 > Project: Solr > Issue Type: New Feature > Components: Facet Module >Affects Versions: master (9.0), 8.2 >Reporter: Michael Gibney >Priority: Minor > Attachments: SOLR-13807__SOLR-13132_test_stub.patch > > > Solr does not have a facet count cache; so for _every_ request, term facets > are recalculated for _every_ (facet) field, by iterating over _every_ field > value for _every_ doc in the result domain, and incrementing the associated > count. > As a result, subsequent requests end up redoing a lot of the same work, > including all associated object allocation, GC, etc. This situation could > benefit from integrated caching. > Because of the domain-based, serial/iterative nature of term facet > calculation, latency is proportional to the size of the result domain. > Consequently, one common/clear manifestation of this issue is high latency > for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be > observed on a top-level landing page that exposes facets. This type of > "static" case is often mitigated by external (to Solr) caching, either with a > caching layer between Solr and a front-end application, or within a front-end > application, or even with a caching layer between the end user and a > front-end application. > But in addition to the overhead of handling this caching elsewhere in the > stack (or, for a new user, even being aware of this as a potential issue to > mitigate), any external caching mitigation is really only appropriate for > relatively static cases like the "landing page" example described above. A > Solr-internal facet count cache (analogous to the {{filterCache}}) would > provide the following additional benefits: > # ease of use/out-of-the-box configuration to address a common performance > concern > # compact (specifically caching count arrays, without the extra baggage that > accompanies a naive external caching approach) > # NRT-friendly (could be implemented to be segment-aware) > # modular, capable of reusing the same cached values in co
[jira] [Commented] (SOLR-13807) Caching for term facet counts
[ https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051700#comment-17051700 ] Chris M. Hostetter commented on SOLR-13807: --- linking SOLR-13807 to fix the config parsing so tests can optionally enable/disable the cache at run time > Caching for term facet counts > - > > Key: SOLR-13807 > URL: https://issues.apache.org/jira/browse/SOLR-13807 > Project: Solr > Issue Type: New Feature > Components: Facet Module >Affects Versions: master (9.0), 8.2 >Reporter: Michael Gibney >Priority: Minor > Attachments: SOLR-13807__SOLR-13132_test_stub.patch > > > Solr does not have a facet count cache; so for _every_ request, term facets > are recalculated for _every_ (facet) field, by iterating over _every_ field > value for _every_ doc in the result domain, and incrementing the associated > count. > As a result, subsequent requests end up redoing a lot of the same work, > including all associated object allocation, GC, etc. This situation could > benefit from integrated caching. > Because of the domain-based, serial/iterative nature of term facet > calculation, latency is proportional to the size of the result domain. > Consequently, one common/clear manifestation of this issue is high latency > for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be > observed on a top-level landing page that exposes facets. This type of > "static" case is often mitigated by external (to Solr) caching, either with a > caching layer between Solr and a front-end application, or within a front-end > application, or even with a caching layer between the end user and a > front-end application. > But in addition to the overhead of handling this caching elsewhere in the > stack (or, for a new user, even being aware of this as a potential issue to > mitigate), any external caching mitigation is really only appropriate for > relatively static cases like the "landing page" example described above. A > Solr-internal facet count cache (analogous to the {{filterCache}}) would > provide the following additional benefits: > # ease of use/out-of-the-box configuration to address a common performance > concern > # compact (specifically caching count arrays, without the extra baggage that > accompanies a naive external caching approach) > # NRT-friendly (could be implemented to be segment-aware) > # modular, capable of reusing the same cached values in conjunction with > variant requests over the same result domain (this would support common use > cases like paging, but also potentially more interesting direct uses of > facets). > # could be used for distributed refinement (i.e., if facet counts over a > given domain are cached, a refinement request could simply look up the > ordinal value for each enumerated term and directly grab the count out of the > count array that was cached during the first phase of facet calculation) > # composable (e.g., in aggregate functions that calculate values based on > facet counts across different domains, like SKG/relatedness – see SOLR-13132) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13807) Caching for term facet counts
[ https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16942117#comment-16942117 ] Michael Gibney commented on SOLR-13807: --- I initially proposed this idea, along with an implementation over {{SimpleFacets}}, as (very) tangentially related to [SOLR-8096|https://issues.apache.org/jira/browse/SOLR-8096?focusedCommentId=15960982#comment-15960982]. As a natural consequence of working to address some performance issues with full-domain SKG/relatedness (SOLR-13132), I updated the initial facet cache implementation to be compatible with JSON facets (while maintaining cross-compatibility with {{SimpleFacets}}). [PR #751|https://github.com/apache/lucene-solr/pull/751] (associated with SOLR-13132) incorporates a facet cache that I believe realizes all of the potential mentioned in the proposal/description above, including being NRT-friendly/segment-aware ... with the exception of point 5 (the PR does not leverage the facet cache for distributed refinement; the facet cache itself was a prerequisite for the SKG/relatedness work, but distributed refinement would have definitely been out of scope). In retrospect I would have preferred to submit a separate PR for only the facet cache; I did not go that route, but only because the facet cache implementation grew organically out of (and was prerequisite to) the work on SKG/relatedness. Would people be comfortable (at least initially) evaluating the facet cache implementation in the context of SOLR-13132? Whether or not I end up having to extract the facet cache work into a separate PR, I thought it would be worth opening this separate Jira issue for the facet cache, since its use could potentially be much more general (beyond SKG/relatedness). > Caching for term facet counts > - > > Key: SOLR-13807 > URL: https://issues.apache.org/jira/browse/SOLR-13807 > Project: Solr > Issue Type: New Feature > Security Level: Public(Default Security Level. Issues are Public) > Components: Facet Module >Affects Versions: master (9.0), 8.2 >Reporter: Michael Gibney >Priority: Minor > > Solr does not have a facet count cache; so for _every_ request, term facets > are recalculated for _every_ (facet) field, by iterating over _every_ field > value for _every_ doc in the result domain, and incrementing the associated > count. > This redoes a lot of work, including all associated object allocation, GC, > etc., and could benefit greatly from integrated caching. > Because of the domain-based, serial/iterative nature of term facet > calculation, latency is proportional to the size of the result domain. > Consequently, one common/clear manifestation of this issue is high latency > for faceting over an unrestricted domain (e.g., {{*:*}}), as might be > observed on a top-level landing page that exposes facets. This type of > "static" case is often mitigated by external (to Solr) caching, either with a > caching layer between Solr and a front-end application, or within a front-end > application, or even with a caching layer between the end user and a > front-end application. > But in addition to the overhead of handling this caching elsewhere in the > stack (or, for a new user, even being aware of this as a potential issue to > mitigate), any external caching mitigation is really only appropriate for > relatively static cases like the "landing page" example described above. A > Solr-internal facet count cache (analogous to the {{filterCache}}) would > provide the following additional benefits: > # ease of use/out-of-the-box configuration to address a common performance > concern > # compact (specifically caching count arrays, without the extra baggage that > accompanies a naive external caching approach) > # NRT-friendly (could be implemented to be segment-aware) > # modular, capable of reusing the same cached values in conjunction with > variant requests over the same result domain (this would support common use > cases like paging, but also potentially more interesting direct uses of > facets). > # could be used for distributed refinement (i.e., if facet counts over a > given domain are cached, a refinement request could simply look up the > ordinal value for each enumerated term and directly grab the count out of the > count array that was cached during the first phase of facet calculation) > # composable (e.g., in aggregate functions that calculate values based on > facet counts across different domains, like SKG/relatedness – see SOLR-13132) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache