[jira] [Commented] (LUCENE-8836) Optimize DocValues TermsDict to continue scanning from the last position when possible
[ https://issues.apache.org/jira/browse/LUCENE-8836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527666#comment-17527666 ] Bruno Roustant commented on LUCENE-8836: Thanks [~jpountz] for this simplified improvement! I agree to mark this issue as resolved. > Optimize DocValues TermsDict to continue scanning from the last position when > possible > -- > > Key: LUCENE-8836 > URL: https://issues.apache.org/jira/browse/LUCENE-8836 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > Labels: docValues, optimization > Fix For: 9.2 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Lucene80DocValuesProducer.TermsDict is used to lookup for either a term or a > term ordinal. > Currently it does not have the optimization the FSTEnum has: to be able to > continue a sequential scan from where the last lookup was in the IndexInput. > For sparse lookups (when searching only a few terms or ordinal) it is not an > issue. But for multiple lookups in a row this optimization could save > re-scanning all the terms from the block start (since they are delat encoded). > This patch proposes the optimization. > To estimate the gain, we ran 3 Lucene tests while counting the seeks and the > term reads in the IndexInput, with and without the optimization: > TestLucene70DocValuesFormat - the optimization saves 24% seeks and 15% term > reads. > TestDocValuesQueries - the optimization adds 0.7% seeks and 0.003% term reads. > TestDocValuesRewriteMethod.testRegexps - the optimization saves 71% seeks and > 82% term reads. > In some cases, when scanning many terms in lexicographical order, the > optimization saves a lot. In some case, when only looking for some sparse > terms, the optimization does not bring improvement, but does not penalize > neither. It seems to be worth to always have it. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10225) Improve IntroSelector with 3-way partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant resolved LUCENE-10225. - Fix Version/s: 9.1 Resolution: Fixed Thanks Dawid and Adrien! > Improve IntroSelector with 3-way partitioning > - > > Key: LUCENE-10225 > URL: https://issues.apache.org/jira/browse/LUCENE-10225 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > Fix For: 9.1 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > The same way we improved IntroSorter, we can improve IntroSelector with > Bentley-McIlroy 3-way partitioning. The gain in performance is roughly the > same. > With this new approach, we always use medians-of-medians (Tukey's Ninther), > so there is no real gain to fallback to a slower medians-of-medians technique > as an introspective protection (like the existing implementation does). > Instead we can simply shuffle the sub-range if we exceed the recursive max > depth (this does not change the speed as this intro-protective mechanism > almost never happens - maybe only in adversarial cases). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10225) Improve IntroSelector with 3-way partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445045#comment-17445045 ] Bruno Roustant commented on LUCENE-10225: - I'm a bit confused. I put this change in the 9.1 section in CHANGES, but actually should it be in the 9.0 section [~jpountz]? > Improve IntroSelector with 3-way partitioning > - > > Key: LUCENE-10225 > URL: https://issues.apache.org/jira/browse/LUCENE-10225 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > Time Spent: 3h 20m > Remaining Estimate: 0h > > The same way we improved IntroSorter, we can improve IntroSelector with > Bentley-McIlroy 3-way partitioning. The gain in performance is roughly the > same. > With this new approach, we always use medians-of-medians (Tukey's Ninther), > so there is no real gain to fallback to a slower medians-of-medians technique > as an introspective protection (like the existing implementation does). > Instead we can simply shuffle the sub-range if we exceed the recursive max > depth (this does not change the speed as this intro-protective mechanism > almost never happens - maybe only in adversarial cases). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10225) Improve IntroSelector with 3-way partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443195#comment-17443195 ] Bruno Roustant edited comment on LUCENE-10225 at 11/13/21, 10:25 PM: - With the addition of the adaptive top-k algorithm triggered when k is close to from or last, we get significant perf gain (x3-x4) when (k - from <= 20) or (last - k <= 20): {noformat} RANDOM IntroSelector ... 485 317 371 449 299 454 333 345 190 311 IntroSelector2 ... 85 86 86 97 90 87 87 86 87 91 RANDOM_LOW_CARDINALITY IntroSelector ... 475 483 239 445 234 236 455 460 365 508 IntroSelector2 ... 113 114 115 114 112 113 114 114 113 114 RANDOM_MEDIUM_CARDINALITY IntroSelector ... 529 287 448 374 347 392 356 412 182 408 IntroSelector2 ... 88 85 86 87 87 87 86 86 88 85 ASCENDING IntroSelector ... 157 150 148 146 146 147 146 147 146 146 IntroSelector2 ... 80 80 82 82 82 83 81 82 81 82 DESCENDING IntroSelector ... 250 250 246 246 254 251 255 247 247 249 IntroSelector2 ... 84 84 84 85 83 82 80 81 81 82 STRICTLY_DESCENDING IntroSelector ... 239 237 239 238 238 250 240 239 240 240 IntroSelector2 ... 82 83 82 81 82 80 82 85 84 83 ASCENDING_SEQUENCES IntroSelector ... 209 217 145 125 185 142 131 172 240 146 IntroSelector2 ... 85 86 86 84 84 82 83 83 83 81 MOSTLY_ASCENDING IntroSelector ... 151 155 150 150 154 147 150 154 154 154 IntroSelector2 ... 82 82 81 81 81 81 82 82 104 85 {noformat} was (Author: broustant): With the addition of the adaptive top-k algorithm triggered when k is close to from or last, we get significant perf gain when (k - from <= 20) or (last - k <= 20): {noformat} RANDOM IntroSelector ... 485 317 371 449 299 454 333 345 190 311 IntroSelector2 ... 85 86 86 97 90 87 87 86 87 91 RANDOM_LOW_CARDINALITY IntroSelector ... 475 483 239 445 234 236 455 460 365 508 IntroSelector2 ... 113 114 115 114 112 113 114 114 113 114 RANDOM_MEDIUM_CARDINALITY IntroSelector ... 529 287 448 374 347 392 356 412 182 408 IntroSelector2 ... 88 85 86 87 87 87 86 86 88 85 ASCENDING IntroSelector ... 157 150 148 146 146 147 146 147 146 146 IntroSelector2 ... 80 80 82 82 82 83 81 82 81 82 DESCENDING IntroSelector ... 250 250 246 246 254 251 255 247 247 249 IntroSelector2 ... 84 84 84 85 83 82 80 81 81 82 STRICTLY_DESCENDING IntroSelector ... 239 237 239 238 238 250 240 239 240 240 IntroSelector2 ... 82 83 82 81 82 80 82 85 84 83 ASCENDING_SEQUENCES IntroSelector ... 209 217 145 125 185 142 131 172 240 146 IntroSelector2 ... 85 86 86 84 84 82 83 83 83 81 MOSTLY_ASCENDING IntroSelector ... 151 155 150 150 154 147 150 154 154 154 IntroSelector2 ... 82 82 81 81 81 81 82 82 104 85 {noformat} > Improve IntroSelector with 3-way partitioning > - > > Key: LUCENE-10225 > URL: https://issues.apache.org/jira/browse/LUCENE-10225 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > The same way we improved IntroSorter, we can improve IntroSelector with > Bentley-McIlroy 3-way partitioning. The gain in performance is roughly the > same. > With this new approach, we always use medians-of-medians (Tukey's Ninther), > so there is no real gain to fallback to a slower medians-of-medians technique > as an introspective protection (like the existing implementation does). > Instead we can simply shuffle the sub-range if we exceed the recursive max > depth (this does not change the speed as this intro-protective mechanism > almost never happens - maybe only in adversarial cases). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10225) Improve IntroSelector with 3-way partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443195#comment-17443195 ] Bruno Roustant commented on LUCENE-10225: - With the addition of the adaptive top-k algorithm triggered when k is close to from or last, we get significant perf gain when (k - from <= 20) or (last - k <= 20): {noformat} RANDOM IntroSelector ... 485 317 371 449 299 454 333 345 190 311 IntroSelector2 ... 85 86 86 97 90 87 87 86 87 91 RANDOM_LOW_CARDINALITY IntroSelector ... 475 483 239 445 234 236 455 460 365 508 IntroSelector2 ... 113 114 115 114 112 113 114 114 113 114 RANDOM_MEDIUM_CARDINALITY IntroSelector ... 529 287 448 374 347 392 356 412 182 408 IntroSelector2 ... 88 85 86 87 87 87 86 86 88 85 ASCENDING IntroSelector ... 157 150 148 146 146 147 146 147 146 146 IntroSelector2 ... 80 80 82 82 82 83 81 82 81 82 DESCENDING IntroSelector ... 250 250 246 246 254 251 255 247 247 249 IntroSelector2 ... 84 84 84 85 83 82 80 81 81 82 STRICTLY_DESCENDING IntroSelector ... 239 237 239 238 238 250 240 239 240 240 IntroSelector2 ... 82 83 82 81 82 80 82 85 84 83 ASCENDING_SEQUENCES IntroSelector ... 209 217 145 125 185 142 131 172 240 146 IntroSelector2 ... 85 86 86 84 84 82 83 83 83 81 MOSTLY_ASCENDING IntroSelector ... 151 155 150 150 154 147 150 154 154 154 IntroSelector2 ... 82 82 81 81 81 81 82 82 104 85 {noformat} > Improve IntroSelector with 3-way partitioning > - > > Key: LUCENE-10225 > URL: https://issues.apache.org/jira/browse/LUCENE-10225 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > Time Spent: 1h 20m > Remaining Estimate: 0h > > The same way we improved IntroSorter, we can improve IntroSelector with > Bentley-McIlroy 3-way partitioning. The gain in performance is roughly the > same. > With this new approach, we always use medians-of-medians (Tukey's Ninther), > so there is no real gain to fallback to a slower medians-of-medians technique > as an introspective protection (like the existing implementation does). > Instead we can simply shuffle the sub-range if we exceed the recursive max > depth (this does not change the speed as this intro-protective mechanism > almost never happens - maybe only in adversarial cases). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] (LUCENE-10225) Improve IntroSelector with 3-way partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10225 ] Bruno Roustant deleted comment on LUCENE-10225: - was (Author: broustant): PR: https://github.com/apache/lucene/pull/430 > Improve IntroSelector with 3-way partitioning > - > > Key: LUCENE-10225 > URL: https://issues.apache.org/jira/browse/LUCENE-10225 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > The same way we improved IntroSorter, we can improve IntroSelector with > Bentley-McIlroy 3-way partitioning. The gain in performance is roughly the > same. > With this new approach, we always use medians-of-medians (Tukey's Ninther), > so there is no real gain to fallback to a slower medians-of-medians technique > as an introspective protection (like the existing implementation does). > Instead we can simply shuffle the sub-range if we exceed the recursive max > depth (this does not change the speed as this intro-protective mechanism > almost never happens - maybe only in adversarial cases). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10225) Improve IntroSelector with 3-way partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440556#comment-17440556 ] Bruno Roustant commented on LUCENE-10225: - PR: https://github.com/apache/lucene/pull/430 > Improve IntroSelector with 3-way partitioning > - > > Key: LUCENE-10225 > URL: https://issues.apache.org/jira/browse/LUCENE-10225 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > > The same way we improved IntroSorter, we can improve IntroSelector with > Bentley-McIlroy 3-way partitioning. The gain in performance is roughly the > same. > With this new approach, we always use medians-of-medians (Tukey's Ninther), > so there is no real gain to fallback to a slower medians-of-medians technique > as an introspective protection (like the existing implementation does). > Instead we can simply shuffle the sub-range if we exceed the recursive max > depth (this does not change the speed as this intro-protective mechanism > almost never happens - maybe only in adversarial cases). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10225) Improve IntroSelector with 3-way partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440475#comment-17440475 ] Bruno Roustant commented on LUCENE-10225: - {code:java} RANDOM IntroSelector ... 397 422 407 448 395 394 417 390 394 406 IntroSelector2 ... 361 357 372 368 366 363 362 361 361 371 RANDOM_LOW_CARDINALITY IntroSelector ... 661 724 743 707 830 696 734 711 745 779 IntroSelector2 ... 360 393 355 369 373 359 367 369 344 360 RANDOM_MEDIUM_CARDINALITY IntroSelector ... 423 394 465 387 398 418 396 393 415 399 IntroSelector2 ... 378 364 371 373 361 360 368 362 369 365 ASCENDING IntroSelector ... 127 127 128 126 130 127 126 131 127 130 IntroSelector2 ... 137 134 135 133 134 134 135 134 135 137 DESCENDING IntroSelector ... 209 221 205 212 203 205 210 208 206 211 IntroSelector2 ... 185 184 183 183 184 186 187 184 184 183 STRICTLY_DESCENDING IntroSelector ... 213 210 208 214 207 213 213 206 209 206 IntroSelector2 ... 184 188 183 186 184 183 182 188 184 184 ASCENDING_SEQUENCES IntroSelector ... 308 320 493 460 287 374 372 423 391 380 IntroSelector2 ... 201 216 234 218 218 211 211 203 207 236 MOSTLY_ASCENDING IntroSelector ... 256 298 264 351 255 448 196 353 256 397 IntroSelector2 ... 134 138 139 137 140 138 135 137 138 140 {code} > Improve IntroSelector with 3-way partitioning > - > > Key: LUCENE-10225 > URL: https://issues.apache.org/jira/browse/LUCENE-10225 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > > The same way we improved IntroSorter, we can improve IntroSelector with > Bentley-McIlroy 3-way partitioning. The gain in performance is roughly the > same. > With this new approach, we always use medians-of-medians (Tukey's Ninther), > so there is no real gain to fallback to a slower medians-of-medians technique > as an introspective protection (like the existing implementation does). > Instead we can simply shuffle the sub-range if we exceed the recursive max > depth (this does not change the speed as this intro-protective mechanism > almost never happens - maybe only in adversarial cases). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10225) Improve IntroSelector with 3-way partitioning
Bruno Roustant created LUCENE-10225: --- Summary: Improve IntroSelector with 3-way partitioning Key: LUCENE-10225 URL: https://issues.apache.org/jira/browse/LUCENE-10225 Project: Lucene - Core Issue Type: Improvement Reporter: Bruno Roustant The same way we improved IntroSorter, we can improve IntroSelector with Bentley-McIlroy 3-way partitioning. The gain in performance is roughly the same. With this new approach, we always use medians-of-medians (Tukey's Ninther), so there is no real gain to fallback to a slower medians-of-medians technique as an introspective protection (like the existing implementation does). Instead we can simply shuffle the sub-range if we exceed the recursive max depth (this does not change the speed as this intro-protective mechanism almost never happens - maybe only in adversarial cases). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437876#comment-17437876 ] Bruno Roustant commented on LUCENE-10196: - Thanks for sharing the benchmark Adrien. I'm not sure about IntroSelector, but I suppose yes. This is an exciting challenge :). I'll find some time to investigate. > Improve IntroSorter with 3-ways partitioning > > > Key: LUCENE-10196 > URL: https://issues.apache.org/jira/browse/LUCENE-10196 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > Fix For: 8.11 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > I added a SorterBenchmark to evaluate the performance of the various Sorter > implementations depending on the strategies defined in BaseSortTestCase > (random, random-low-cardinality, ascending, descending, etc). > By changing the implementation of the IntroSorter to use a 3-ways > partitioning, we can gain a significant performance improvement when sorting > low-cardinality lists, and with additional changes we can also improve the > performance for all the strategies. > Proposed changes: > - Sort small ranges with insertion sort (instead of binary sort). > - Select the quick sort pivot with medians. > - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm. > - Replace the tail recursion by a loop. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437247#comment-17437247 ] Bruno Roustant commented on LUCENE-10196: - Oh! Thank you [~jpountz]! I was wrong in the CHANGES file indeed. > Improve IntroSorter with 3-ways partitioning > > > Key: LUCENE-10196 > URL: https://issues.apache.org/jira/browse/LUCENE-10196 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > Fix For: 8.11 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > I added a SorterBenchmark to evaluate the performance of the various Sorter > implementations depending on the strategies defined in BaseSortTestCase > (random, random-low-cardinality, ascending, descending, etc). > By changing the implementation of the IntroSorter to use a 3-ways > partitioning, we can gain a significant performance improvement when sorting > low-cardinality lists, and with additional changes we can also improve the > performance for all the strategies. > Proposed changes: > - Sort small ranges with insertion sort (instead of binary sort). > - Select the quick sort pivot with medians. > - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm. > - Replace the tail recursion by a loop. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant resolved LUCENE-10196. - Fix Version/s: 8.11 Resolution: Fixed Thanks reviewers! > Improve IntroSorter with 3-ways partitioning > > > Key: LUCENE-10196 > URL: https://issues.apache.org/jira/browse/LUCENE-10196 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > Fix For: 8.11 > > Time Spent: 2h > Remaining Estimate: 0h > > I added a SorterBenchmark to evaluate the performance of the various Sorter > implementations depending on the strategies defined in BaseSortTestCase > (random, random-low-cardinality, ascending, descending, etc). > By changing the implementation of the IntroSorter to use a 3-ways > partitioning, we can gain a significant performance improvement when sorting > low-cardinality lists, and with additional changes we can also improve the > performance for all the strategies. > Proposed changes: > - Sort small ranges with insertion sort (instead of binary sort). > - Select the quick sort pivot with medians. > - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm. > - Replace the tail recursion by a loop. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated LUCENE-10196: Description: I added a SorterBenchmark to evaluate the performance of the various Sorter implementations depending on the strategies defined in BaseSortTestCase (random, random-low-cardinality, ascending, descending, etc). By changing the implementation of the IntroSorter to use a 3-ways partitioning, we can gain a significant performance improvement when sorting low-cardinality lists, and with additional changes we can also improve the performance for all the strategies. Proposed changes: - Sort small ranges with insertion sort (instead of binary sort). - Select the quick sort pivot with medians. - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm. - Replace the tail recursion by a loop. was: I added a SorterBenchmark to evaluate the performance of the various Sorter implementations depending on the strategies defined in BaseSortTestCase (random, random-low-cardinality, ascending, descending, etc). By changing the implementation of the IntroSorter to use a 3-ways partitioning, we can gain a significant performance improvement when sorting low-cardinality lists, and we additional changes we can also improve the performance for all the strategies. Proposed changes: - Sort small ranges with insertion sort (instead of binary sort). - Select the quick sort pivot with medians. - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm. - Replace the tail recursion by a loop. > Improve IntroSorter with 3-ways partitioning > > > Key: LUCENE-10196 > URL: https://issues.apache.org/jira/browse/LUCENE-10196 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > I added a SorterBenchmark to evaluate the performance of the various Sorter > implementations depending on the strategies defined in BaseSortTestCase > (random, random-low-cardinality, ascending, descending, etc). > By changing the implementation of the IntroSorter to use a 3-ways > partitioning, we can gain a significant performance improvement when sorting > low-cardinality lists, and with additional changes we can also improve the > performance for all the strategies. > Proposed changes: > - Sort small ranges with insertion sort (instead of binary sort). > - Select the quick sort pivot with medians. > - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm. > - Replace the tail recursion by a loop. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Issue Comment Deleted] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated LUCENE-10196: Comment: was deleted (was: https://github.com/apache/lucene/pull/404) > Improve IntroSorter with 3-ways partitioning > > > Key: LUCENE-10196 > URL: https://issues.apache.org/jira/browse/LUCENE-10196 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > I added a SorterBenchmark to evaluate the performance of the various Sorter > implementations depending on the strategies defined in BaseSortTestCase > (random, random-low-cardinality, ascending, descending, etc). > By changing the implementation of the IntroSorter to use a 3-ways > partitioning, we can gain a significant performance improvement when sorting > low-cardinality lists, and we additional changes we can also improve the > performance for all the strategies. > Proposed changes: > - Sort small ranges with insertion sort (instead of binary sort). > - Select the quick sort pivot with medians. > - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm. > - Replace the tail recursion by a loop. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432685#comment-17432685 ] Bruno Roustant commented on LUCENE-10196: - https://github.com/apache/lucene/pull/404 > Improve IntroSorter with 3-ways partitioning > > > Key: LUCENE-10196 > URL: https://issues.apache.org/jira/browse/LUCENE-10196 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > > I added a SorterBenchmark to evaluate the performance of the various Sorter > implementations depending on the strategies defined in BaseSortTestCase > (random, random-low-cardinality, ascending, descending, etc). > By changing the implementation of the IntroSorter to use a 3-ways > partitioning, we can gain a significant performance improvement when sorting > low-cardinality lists, and we additional changes we can also improve the > performance for all the strategies. > Proposed changes: > - Sort small ranges with insertion sort (instead of binary sort). > - Select the quick sort pivot with medians. > - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm. > - Replace the tail recursion by a loop. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning
[ https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432670#comment-17432670 ] Bruno Roustant commented on LUCENE-10196: - Benchmark run to compare sorters implementations for different shapes of data: Each value is the time to complete a run. The values for the same column can be compared because the same data is provided as input to the various sorters. Each column has different random data. IntroSorter2 is the new modified version of IntroSorter. The benchmark runs with 20K comparable entries. Comparing IntroSorter and IntroSorter2, we mainly observe a x5 speed for random low cardinality, and an improvement for each data shape. {noformat} RANDOM IntroSorter ... 445 445 459 458 453 458 460 465 452 451 IntroSorter2 ... 394 403 401 400 401 398 400 404 396 399 TimSorter... 1196 1203 1197 1206 1193 1195 1193 1204 1230 1207 MergeSorter ... 1462 1470 1482 1466 1463 1475 1478 1475 1466 1471 RANDOM_LOW_CARDINALITY IntroSorter ... 505 513 504 490 527 499 510 512 509 525 IntroSorter2 ... 89 84 88 88 86 90 88 89 92 88 TimSorter... 511 513 508 508 513 512 521 511 524 516 MergeSorter ... 725 725 725 762 737 723 727 724 736 733 RANDOM_MEDIUM_CARDINALITY IntroSorter ... 463 451 452 455 448 452 451 459 458 455 IntroSorter2 ... 370 381 378 373 375 376 376 372 370 370 TimSorter... 1192 1212 1197 1196 1201 1202 1196 1199 1196 1204 MergeSorter ... 1493 1465 1470 1480 1460 1470 1483 1464 1506 1500 ASCENDING IntroSorter ... 211 205 215 213 207 206 208 214 212 211 IntroSorter2 ... 191 188 190 193 194 191 188 187 185 188 TimSorter... 17 18 18 18 19 19 18 17 18 19 MergeSorter ... 73 71 72 75 72 73 73 77 72 71 DESCENDING IntroSorter ... 225 253 229 220 225 231 222 217 220 223 IntroSorter2 ... 220 213 214 220 205 211 208 210 208 212 TimSorter... 545 576 562 553 543 551 552 552 548 546 MergeSorter ... 537 537 548 538 537 536 533 530 533 545 STRICTLY_DESCENDING IntroSorter ... 215 214 221 224 218 227 213 212 212 211 IntroSorter2 ... 202 203 202 205 202 204 206 204 202 204 TimSorter... 22 21 21 22 22 21 21 22 22 23 MergeSorter ... 534 531 533 527 531 529 526 527 528 527 ASCENDING_SEQUENCES IntroSorter ... 370 366 361 376 367 369 358 364 379 376 IntroSorter2 ... 234 235 231 236 234 245 242 239 239 236 TimSorter... 686 679 745 673 694 685 673 719 682 685 MergeSorter ... 894 911 932 907 923 907 918 917 920 916 MOSTLY_ASCENDING IntroSorter ... 284 282 282 283 285 282 278 284 283 287 IntroSorter2 ... 254 252 249 250 255 255 249 250 252 251 TimSorter... 233 233 230 235 232 234 234 233 228 238 MergeSorter ... 399 385 390 398 398 392 380 377 377 387 {noformat} > Improve IntroSorter with 3-ways partitioning > > > Key: LUCENE-10196 > URL: https://issues.apache.org/jira/browse/LUCENE-10196 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > > I added a SorterBenchmark to evaluate the performance of the various Sorter > implementations depending on the strategies defined in BaseSortTestCase > (random, random-low-cardinality, ascending, descending, etc). > By changing the implementation of the IntroSorter to use a 3-ways > partitioning, we can gain a significant performance improvement when sorting > low-cardinality lists, and we additional changes we can also improve the > performance for all the strategies. > Proposed changes: > - Sort small ranges with insertion sort (instead of binary sort). > - Select the quick sort pivot with medians. > - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm. > - Replace the tail recursion by a loop. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning
Bruno Roustant created LUCENE-10196: --- Summary: Improve IntroSorter with 3-ways partitioning Key: LUCENE-10196 URL: https://issues.apache.org/jira/browse/LUCENE-10196 Project: Lucene - Core Issue Type: Improvement Reporter: Bruno Roustant I added a SorterBenchmark to evaluate the performance of the various Sorter implementations depending on the strategies defined in BaseSortTestCase (random, random-low-cardinality, ascending, descending, etc). By changing the implementation of the IntroSorter to use a 3-ways partitioning, we can gain a significant performance improvement when sorting low-cardinality lists, and we additional changes we can also improve the performance for all the strategies. Proposed changes: - Sort small ranges with insertion sort (instead of binary sort). - Select the quick sort pivot with medians. - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm. - Replace the tail recursion by a loop. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10097) Replace TreeMap use by HashMap when unnecessary
[ https://issues.apache.org/jira/browse/LUCENE-10097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant resolved LUCENE-10097. - Resolution: Won't Do > Replace TreeMap use by HashMap when unnecessary > --- > > Key: LUCENE-10097 > URL: https://issues.apache.org/jira/browse/LUCENE-10097 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 4h 10m > Remaining Estimate: 0h > > There are a couple of places where TreeMap is used although it could easily > be replaced by a HashMap with potentially a single sort. Sometimes it would > bring perf improvement (e.g. when TreeMap.entrySet() is called), other times > it's more for consistency to use a simpler HashMap if there is no strong need > for a TreeMap. > I saw other places where we have TODOs to see whether we can replace the > TreeMap, but when it is more complex, I'll prefer to open separate Jira > issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10097) Replace TreeMap use by HashMap when unnecessary
[ https://issues.apache.org/jira/browse/LUCENE-10097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423780#comment-17423780 ] Bruno Roustant commented on LUCENE-10097: - Based on remarks in the PRs, I'm closing this Jira issue as HashMap + List is too complicated and TreeMap was a deliberated choice for keeping the memory usage low. Maybe I'll come back later with another proposal as I think most of the time we use TreeMap whereas the intent is to build an immutable sorted map, which could be more memory efficient. > Replace TreeMap use by HashMap when unnecessary > --- > > Key: LUCENE-10097 > URL: https://issues.apache.org/jira/browse/LUCENE-10097 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 4h 10m > Remaining Estimate: 0h > > There are a couple of places where TreeMap is used although it could easily > be replaced by a HashMap with potentially a single sort. Sometimes it would > bring perf improvement (e.g. when TreeMap.entrySet() is called), other times > it's more for consistency to use a simpler HashMap if there is no strong need > for a TreeMap. > I saw other places where we have TODOs to see whether we can replace the > TreeMap, but when it is more complex, I'll prefer to open separate Jira > issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10097) Replace TreeMap use by HashMap when unnecessary
Bruno Roustant created LUCENE-10097: --- Summary: Replace TreeMap use by HashMap when unnecessary Key: LUCENE-10097 URL: https://issues.apache.org/jira/browse/LUCENE-10097 Project: Lucene - Core Issue Type: Improvement Reporter: Bruno Roustant Assignee: Bruno Roustant There are a couple of places where TreeMap is used although it could easily be replaced by a HashMap with potentially a single sort. Sometimes it would bring perf improvement (e.g. when TreeMap.entrySet() is called), other times it's more for consistency to use a simpler HashMap if there is no strong need for a TreeMap. I saw other places where we have TODOs to see whether we can replace the TreeMap, but when it is more complex, I'll prefer to open separate Jira issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358418#comment-17358418 ] Bruno Roustant commented on LUCENE-9983: [~zhai7631] do you have some stats about the numbers of NFA / DFA states manipulated? > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 4h 20m > Remaining Estimate: 0h > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily
[ https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355600#comment-17355600 ] Bruno Roustant commented on LUCENE-9983: How many states are manipulated? If the states are numbered from 0 to N, and we keep most of the states during the computation, or N is not too high, then should we use an array instead of a map? With array[state] is the "reference count". We wouldn't have to sort the set of states for equality check because it would be directly the array order (skipping states with 0 reference). > Stop sorting determinize powersets unnecessarily > > > Key: LUCENE-9983 > URL: https://issues.apache.org/jira/browse/LUCENE-9983 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > > Spinoff from LUCENE-9981. > Today, our {{Operations.determinize}} implementation builds powersets of all > subsets of NFA states that "belong" in the same determinized state, using > [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction]. > To hold each powerset, we use a malleable {{SortedIntSet}} and periodically > freeze it to a {{FrozenIntSet}}, also sorted. We pay a high price to keep > these growing maps of int key, int value sorted by key, e.g. upgrading to a > {{TreeMap}} once the map is large enough (> 30 entries). > But I think sorting is entirely unnecessary here! Really all we need is the > ability to add/delete keys from the map, and hashCode / equals (by key only – > ignoring value!), and to freeze the map (a small optimization that we could > skip initially). We only use these maps to lookup in the (growing) > determinized automaton whether this powerset has already been seen. > Maybe we could simply poach the {{IntIntScatterMap}} implementation from > [HPPC|https://github.com/carrotsearch/hppc]? And then change its > {{hashCode}}/{{equals }}to only use keys (not values). > This change should be a big speedup for the kinds of (admittedly adversarial) > regexps we saw on LUCENE-9981. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption
[ https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354275#comment-17354275 ] Bruno Roustant commented on LUCENE-9379: _RE AES-XTS vs AES-CTR:_ In the case of Lucene, we produce read-only files per index segment. And if we have a new random IV per file, we don't repeat the same (AES encrypted) blocks. So we are in a safe read-only-once case where AES-XTS and AES-CTR have the same strength [1][2]. Given that CTR is simpler, that's why I chose it for this patch. [1] https://crypto.stackexchange.com/questions/64556/aes-xts-vs-aes-ctr-for-write-once-storage [2] https://crypto.stackexchange.com/questions/14628/why-do-we-use-xts-over-ctr-for-disk-encryption > Directory based approach for index encryption > - > > Key: LUCENE-9379 > URL: https://issues.apache.org/jira/browse/LUCENE-9379 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > > +Important+: This Lucene Directory wrapper approach is to be considered only > if an OS level encryption is not possible. OS level encryption better fits > Lucene usage of OS cache, and thus is more performant. > But there are some use-case where OS level encryption is not possible. This > Jira issue was created to address those. > > > The goal is to provide optional encryption of the index, with a scope limited > to an encryptable Lucene Directory wrapper. > Encryption is at rest on disk, not in memory. > This simple approach should fit any Codec as it would be orthogonal, without > modifying APIs as much as possible. > Use a standard encryption method. Limit perf/memory impact as much as > possible. > Determine how callers provide encryption keys. They must not be stored on > disk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303494#comment-17303494 ] Bruno Roustant commented on LUCENE-9663: Ok, I backported to 8.x branch, and I updated CHANGES.txt in main to move to 8.9.0 section. > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Fix For: 8.9 > > Time Spent: 11h 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated LUCENE-9663: --- Fix Version/s: (was: main (9.0)) 8.9 > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Fix For: 8.9 > > Time Spent: 11h 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9796) fix SortedDocValues to no longer extend BinaryDocValues
[ https://issues.apache.org/jira/browse/LUCENE-9796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301817#comment-17301817 ] Bruno Roustant commented on LUCENE-9796: Ok, I'll try to update Solr to use TermOrdValComparator, reusing the same issue. > fix SortedDocValues to no longer extend BinaryDocValues > --- > > Key: LUCENE-9796 > URL: https://issues.apache.org/jira/browse/LUCENE-9796 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Robert Muir >Priority: Major > Fix For: main (9.0) > > Attachments: LUCENE-9796.patch, LUCENE-9796_prototype.patch > > > SortedDocValues give ordinals and a way to derefence ordinal as a byte[] > But currently they *extend* BinaryDocValues, which allows directly calling > {{binaryValue()}}. > This allows them to act as a "slow" BinaryDocValues, but it is a performance > trap, especially now that terms bytes may be block-compressed (LUCENE-9663). > I think this should be detangled to prevent performance traps like > LUCENE-9795: SortedDocValues shouldn't have the trappy inherited > {{binaryValue()}} method that implicitly derefs the ord for the doc, then the > term bytes for the ord. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9796) fix SortedDocValues to no longer extend BinaryDocValues
[ https://issues.apache.org/jira/browse/LUCENE-9796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301800#comment-17301800 ] Bruno Roustant commented on LUCENE-9796: Solr main branch build broke after this change. I worked on SOLR-15261 to fix that. We'll have this kind of issues until Solr uses a Lucene snapshot. While fixing I noticed that maybe o.a.l.search.FieldComparator$TermValComparator.getLeafComparator() needs to be adapted to not create only BinaryDocValues but also SortedDocValues? > fix SortedDocValues to no longer extend BinaryDocValues > --- > > Key: LUCENE-9796 > URL: https://issues.apache.org/jira/browse/LUCENE-9796 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Robert Muir >Priority: Major > Fix For: main (9.0) > > Attachments: LUCENE-9796.patch, LUCENE-9796_prototype.patch > > > SortedDocValues give ordinals and a way to derefence ordinal as a byte[] > But currently they *extend* BinaryDocValues, which allows directly calling > {{binaryValue()}}. > This allows them to act as a "slow" BinaryDocValues, but it is a performance > trap, especially now that terms bytes may be block-compressed (LUCENE-9663). > I think this should be detangled to prevent performance traps like > LUCENE-9795: SortedDocValues shouldn't have the trappy inherited > {{binaryValue()}} method that implicitly derefs the ord for the doc, then the > term bytes for the ord. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15170) Elevation file in data dir not working in Solr Cloud
[ https://issues.apache.org/jira/browse/SOLR-15170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298918#comment-17298918 ] Bruno Roustant commented on SOLR-15170: --- PR for the fix added. I plan to backport the fix in branch 8.9 (not in 7.7.2 where the issue was detected). > Elevation file in data dir not working in Solr Cloud > > > Key: SOLR-15170 > URL: https://issues.apache.org/jira/browse/SOLR-15170 > Project: Solr > Issue Type: Bug >Affects Versions: 7.7.2 >Reporter: Monica Marrero >Assignee: Bruno Roustant >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > When using elevation, it is not possible to store the _elevate.xml_ file in > the data folder instead of in the configuration folder in Solr Cloud. It is > only possible in standalone mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality
[ https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295358#comment-17295358 ] Bruno Roustant commented on SOLR-15038: --- I reverted this specific line in both master and branch_8x. > Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to > elevation functionality > > > Key: SOLR-15038 > URL: https://issues.apache.org/jira/browse/SOLR-15038 > Project: Solr > Issue Type: Improvement > Components: query >Reporter: Tobias Kässmann >Priority: Minor > Fix For: 8.9 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > We've worked a lot with Query Elevation component in the last time and we > were missing two features: > * Elevate only documents that are part of the search result > * In combination with collapsing: Only show the representative if the > elevated documents does have the same collapse field value. > Because of this, we've added these two feature toggles > _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality
[ https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295203#comment-17295203 ] Bruno Roustant edited comment on SOLR-15038 at 3/4/21, 10:37 AM: - Ouch, yes I'll revert that. I played with this permission but didn't intend to commit it. When running the tests I noticed many Solr tests have warning about being unable to create some test resources. {code:java} java.security.AccessControlException: access denied ("java.io.FilePermission" "lucene-solr/solr/core/build/resources/test/solr/userfiles" "write") at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:?] at java.security.AccessController.checkPermission(AccessController.java:897) ~[?:?] at java.lang.SecurityManager.checkPermission(SecurityManager.java:322) ~[?:?] at java.lang.SecurityManager.checkWrite(SecurityManager.java:752) ~[?:?] at sun.nio.fs.UnixPath.checkWrite(UnixPath.java:824) ~[?:?] at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:377) ~[?:?] at java.nio.file.Files.createDirectory(Files.java:689) ~[?:?] at java.nio.file.Files.createAndCheckIsDirectory(Files.java:796) ~[?:?] at java.nio.file.Files.createDirectories(Files.java:742) ~[?:?] at org.apache.solr.core.CoreContainer.(CoreContainer.java:383) [main/:?] at org.apache.solr.core.CoreContainer.(CoreContainer.java:344) [main/:?]{code} I noticed they disappeared when I changed the permission for write-access in solr-tests.policy. [~dweiss] do you know how to get rid of these (many) warnings? was (Author: broustant): Ouch, yes I'll revert that. I played with this permission but didn't intend to commit it. When running the tests I noticed many Solr tests have warning about being unable to create some test resources. {code:java} java.security.AccessControlException: access denied ("java.io.FilePermission" "lucene-solr/solr/core/build/resources/test/solr/userfiles" "write") at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:?] at java.security.AccessController.checkPermission(AccessController.java:897) ~[?:?] at java.lang.SecurityManager.checkPermission(SecurityManager.java:322) ~[?:?] at java.lang.SecurityManager.checkWrite(SecurityManager.java:752) ~[?:?] at sun.nio.fs.UnixPath.checkWrite(UnixPath.java:824) ~[?:?] at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:377) ~[?:?] at java.nio.file.Files.createDirectory(Files.java:689) ~[?:?] at java.nio.file.Files.createAndCheckIsDirectory(Files.java:796) ~[?:?] at java.nio.file.Files.createDirectories(Files.java:742) ~[?:?] at org.apache.solr.core.CoreContainer.(CoreContainer.java:383) [main/:?] at org.apache.solr.core.CoreContainer.(CoreContainer.java:344) [main/:?]{code} I noticed they disappeared when I changed the permission for write-access in solr-tests.policy. Do you know how to get rid of these (many) warnings? > Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to > elevation functionality > > > Key: SOLR-15038 > URL: https://issues.apache.org/jira/browse/SOLR-15038 > Project: Solr > Issue Type: Improvement > Components: query >Reporter: Tobias Kässmann >Priority: Minor > Fix For: 8.9 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > We've worked a lot with Query Elevation component in the last time and we > were missing two features: > * Elevate only documents that are part of the search result > * In combination with collapsing: Only show the representative if the > elevated documents does have the same collapse field value. > Because of this, we've added these two feature toggles > _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality
[ https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295203#comment-17295203 ] Bruno Roustant commented on SOLR-15038: --- Ouch, yes I'll revert that. I played with this permission but didn't intend to commit it. When running the tests I noticed many Solr tests have warning about being unable to create some test resources. {code:java} java.security.AccessControlException: access denied ("java.io.FilePermission" "lucene-solr/solr/core/build/resources/test/solr/userfiles" "write") at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:?] at java.security.AccessController.checkPermission(AccessController.java:897) ~[?:?] at java.lang.SecurityManager.checkPermission(SecurityManager.java:322) ~[?:?] at java.lang.SecurityManager.checkWrite(SecurityManager.java:752) ~[?:?] at sun.nio.fs.UnixPath.checkWrite(UnixPath.java:824) ~[?:?] at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:377) ~[?:?] at java.nio.file.Files.createDirectory(Files.java:689) ~[?:?] at java.nio.file.Files.createAndCheckIsDirectory(Files.java:796) ~[?:?] at java.nio.file.Files.createDirectories(Files.java:742) ~[?:?] at org.apache.solr.core.CoreContainer.(CoreContainer.java:383) [main/:?] at org.apache.solr.core.CoreContainer.(CoreContainer.java:344) [main/:?]{code} I noticed they disappeared when I changed the permission for write-access in solr-tests.policy. Do you know how to get rid of these (many) warnings? > Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to > elevation functionality > > > Key: SOLR-15038 > URL: https://issues.apache.org/jira/browse/SOLR-15038 > Project: Solr > Issue Type: Improvement > Components: query >Reporter: Tobias Kässmann >Priority: Minor > Fix For: 8.9 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > We've worked a lot with Query Elevation component in the last time and we > were missing two features: > * Elevate only documents that are part of the search result > * In combination with collapsing: Only show the representative if the > elevated documents does have the same collapse field value. > Because of this, we've added these two feature toggles > _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9796) fix SortedDocValues to no longer extend BinaryDocValues
[ https://issues.apache.org/jira/browse/LUCENE-9796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292769#comment-17292769 ] Bruno Roustant commented on LUCENE-9796: +1 > fix SortedDocValues to no longer extend BinaryDocValues > --- > > Key: LUCENE-9796 > URL: https://issues.apache.org/jira/browse/LUCENE-9796 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Robert Muir >Priority: Major > Attachments: LUCENE-9796.patch, LUCENE-9796_prototype.patch > > > SortedDocValues give ordinals and a way to derefence ordinal as a byte[] > But currently they *extend* BinaryDocValues, which allows directly calling > {{binaryValue()}}. > This allows them to act as a "slow" BinaryDocValues, but it is a performance > trap, especially now that terms bytes may be block-compressed (LUCENE-9663). > I think this should be detangled to prevent performance traps like > LUCENE-9795: SortedDocValues shouldn't have the trappy inherited > {{binaryValue()}} method that implicitly derefs the ord for the doc, then the > term bytes for the ord. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9815) PerField formats can select the format based on FieldInfo
[ https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292768#comment-17292768 ] Bruno Roustant commented on LUCENE-9815: +1 on LUCENE-9796 I'll close this PR and try to find some cycles to help on LUCENE-9796. > PerField formats can select the format based on FieldInfo > - > > Key: LUCENE-9815 > URL: https://issues.apache.org/jira/browse/LUCENE-9815 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Minor > Attachments: Screen_Shot_2021-02-28_at_16.08.05.png > > > PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the > format based on the field name. > If we improve them to also support the selection based on the FieldInfo, it > will be possible to select based on some FieldInfo attribute, DocValuesType, > etc. > +Example use-case:+ > It will be possible to adapt the compression mode of doc values fields > easily based on the DocValuesType. E.g. compressing sorted and not binary doc > values. > > User creates a new custom codec which provides a custom DocValuesFormat > > which extends PerFieldDocValuesFormat and implements the method > DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo). > This method provides either a standard Lucene80DocValuesFormat (no > compression) or another new custom DocValuesFormat extending > Lucene80DocValuesFormat with BEST_COMPRESSION mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9815) PerField formats can select the format based on FieldInfo
[ https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant resolved LUCENE-9815. Resolution: Won't Do > PerField formats can select the format based on FieldInfo > - > > Key: LUCENE-9815 > URL: https://issues.apache.org/jira/browse/LUCENE-9815 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Minor > Attachments: Screen_Shot_2021-02-28_at_16.08.05.png > > > PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the > format based on the field name. > If we improve them to also support the selection based on the FieldInfo, it > will be possible to select based on some FieldInfo attribute, DocValuesType, > etc. > +Example use-case:+ > It will be possible to adapt the compression mode of doc values fields > easily based on the DocValuesType. E.g. compressing sorted and not binary doc > values. > > User creates a new custom codec which provides a custom DocValuesFormat > > which extends PerFieldDocValuesFormat and implements the method > DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo). > This method provides either a standard Lucene80DocValuesFormat (no > compression) or another new custom DocValuesFormat extending > Lucene80DocValuesFormat with BEST_COMPRESSION mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (SOLR-15170) Elevation file in data dir not working in Solr Cloud
[ https://issues.apache.org/jira/browse/SOLR-15170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant reassigned SOLR-15170: - Assignee: Bruno Roustant > Elevation file in data dir not working in Solr Cloud > > > Key: SOLR-15170 > URL: https://issues.apache.org/jira/browse/SOLR-15170 > Project: Solr > Issue Type: Bug >Affects Versions: 7.7.2 >Reporter: Monica Marrero >Assignee: Bruno Roustant >Priority: Major > > When using elevation, it is not possible to store the _elevate.xml_ file in > the data folder instead of in the configuration folder in Solr Cloud. It is > only possible in standalone mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15170) Elevation file in data dir not working in Solr Cloud
[ https://issues.apache.org/jira/browse/SOLR-15170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292730#comment-17292730 ] Bruno Roustant commented on SOLR-15170: --- I'll have a look soon > Elevation file in data dir not working in Solr Cloud > > > Key: SOLR-15170 > URL: https://issues.apache.org/jira/browse/SOLR-15170 > Project: Solr > Issue Type: Bug >Affects Versions: 7.7.2 >Reporter: Monica Marrero >Priority: Major > > When using elevation, it is not possible to store the _elevate.xml_ file in > the data folder instead of in the configuration folder in Solr Cloud. It is > only possible in standalone mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9815) PerField formats can select the format based on FieldInfo
[ https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292518#comment-17292518 ] Bruno Roustant commented on LUCENE-9815: [~rcmuir] do you mean always compressing sorted docvalues and having a on/off mode for binary docvalues? Based on LUCENE-9378 binary docvalues compression causes a big perf impact, so currently the on/off compression mode for all docvalues is not so useful as users do not want to hit the perf for binary docvalues so they don't enable compression for sorted set neither. > PerField formats can select the format based on FieldInfo > - > > Key: LUCENE-9815 > URL: https://issues.apache.org/jira/browse/LUCENE-9815 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Minor > > PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the > format based on the field name. > If we improve them to also support the selection based on the FieldInfo, it > will be possible to select based on some FieldInfo attribute, DocValuesType, > etc. > +Example use-case:+ > It will be possible to adapt the compression mode of doc values fields > easily based on the DocValuesType. E.g. compressing sorted and not binary doc > values. > > User creates a new custom codec which provides a custom DocValuesFormat > > which extends PerFieldDocValuesFormat and implements the method > DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo). > This method provides either a standard Lucene80DocValuesFormat (no > compression) or another new custom DocValuesFormat extending > Lucene80DocValuesFormat with BEST_COMPRESSION mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9815) PerField formats can select the format based on FieldInfo
[ https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated LUCENE-9815: --- Description: PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the format based on the field name. If we improve them to also support the selection based on the FieldInfo, it will be possible to select based on some FieldInfo attribute, DocValuesType, etc. +Example use-case:+ It will be possible to adapt the compression mode of doc values fields easily based on the DocValuesType. E.g. compressing sorted and not binary doc values. > User creates a new custom codec which provides a custom DocValuesFormat which > extends PerFieldDocValuesFormat and implements the method DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo). This method provides either a standard Lucene80DocValuesFormat (no compression) or another new custom DocValuesFormat extending Lucene80DocValuesFormat with BEST_COMPRESSION mode. was: PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the format based on the field name. If we improve them to also support the selection based on the FieldInfo, it will be possible to select based on some FieldInfo attribute, DocValuesType, etc. +Use-case example:+ It will be possible for example to adapt the compression mode of doc values fields easily based on the DocValuesType. E.g. compressing sorted and not binary doc values. > User creates a new custom codec which provides a custom DocValuesFormat which > extends PerFieldDocValuesFormat and implements the method DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo). This method provides either a standard Lucene80DocValuesFormat (no compression) or another new custom DocValuesFormat extending Lucene80DocValuesFormat with BEST_COMPRESSION mode. > PerField formats can select the format based on FieldInfo > - > > Key: LUCENE-9815 > URL: https://issues.apache.org/jira/browse/LUCENE-9815 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Minor > > PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the > format based on the field name. > If we improve them to also support the selection based on the FieldInfo, it > will be possible to select based on some FieldInfo attribute, DocValuesType, > etc. > +Example use-case:+ > It will be possible to adapt the compression mode of doc values fields > easily based on the DocValuesType. E.g. compressing sorted and not binary doc > values. > > User creates a new custom codec which provides a custom DocValuesFormat > > which extends PerFieldDocValuesFormat and implements the method > DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo). > This method provides either a standard Lucene80DocValuesFormat (no > compression) or another new custom DocValuesFormat extending > Lucene80DocValuesFormat with BEST_COMPRESSION mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9815) PerField formats can select the format based on FieldInfo
Bruno Roustant created LUCENE-9815: -- Summary: PerField formats can select the format based on FieldInfo Key: LUCENE-9815 URL: https://issues.apache.org/jira/browse/LUCENE-9815 Project: Lucene - Core Issue Type: Improvement Reporter: Bruno Roustant PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the format based on the field name. If we improve them to also support the selection based on the FieldInfo, it will be possible to select based on some FieldInfo attribute, DocValuesType, etc. +Use-case example:+ It will be possible for example to adapt the compression mode of doc values fields easily based on the DocValuesType. E.g. compressing sorted and not binary doc values. > User creates a new custom codec which provides a custom DocValuesFormat which > extends PerFieldDocValuesFormat and implements the method DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo). This method provides either a standard Lucene80DocValuesFormat (no compression) or another new custom DocValuesFormat extending Lucene80DocValuesFormat with BEST_COMPRESSION mode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality
[ https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant resolved SOLR-15038. --- Fix Version/s: 8.9 Resolution: Fixed Thanks [~kaessmann] > Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to > elevation functionality > > > Key: SOLR-15038 > URL: https://issues.apache.org/jira/browse/SOLR-15038 > Project: Solr > Issue Type: Improvement > Components: query >Reporter: Tobias Kässmann >Priority: Minor > Fix For: 8.9 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > We've worked a lot with Query Elevation component in the last time and we > were missing two features: > * Elevate only documents that are part of the search result > * In combination with collapsing: Only show the representative if the > elevated documents does have the same collapse field value. > Because of this, we've added these two feature toggles > _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality
[ https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285321#comment-17285321 ] Bruno Roustant commented on SOLR-15038: --- The PR sounds good to me. I'll merge soon. > Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to > elevation functionality > > > Key: SOLR-15038 > URL: https://issues.apache.org/jira/browse/SOLR-15038 > Project: Solr > Issue Type: Improvement > Components: query >Reporter: Tobias Kässmann >Priority: Minor > Time Spent: 1h 50m > Remaining Estimate: 0h > > We've worked a lot with Query Elevation component in the last time and we > were missing two features: > * Elevate only documents that are part of the search result > * In combination with collapsing: Only show the representative if the > elevated documents does have the same collapse field value. > Because of this, we've added these two feature toggles > _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated LUCENE-9663: --- Resolution: Fixed Status: Resolved (was: Patch Available) Thanks [~Jaison], this is merged in master 9.0. I don't plan to port it to branch 8.x unless I'm advised otherwise. > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Fix For: master (9.0) > > Time Spent: 11h 10m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality
[ https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284798#comment-17284798 ] Bruno Roustant commented on SOLR-15038: --- Thanks for the PR. I have some comments/questions there. > Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to > elevation functionality > > > Key: SOLR-15038 > URL: https://issues.apache.org/jira/browse/SOLR-15038 > Project: Solr > Issue Type: Improvement > Components: query >Reporter: Tobias Kässmann >Priority: Minor > Time Spent: 1h 10m > Remaining Estimate: 0h > > We've worked a lot with Query Elevation component in the last time and we > were missing two features: > * Elevate only documents that are part of the search result > * In combination with collapsing: Only show the representative if the > elevated documents does have the same collapse field value. > Because of this, we've added these two feature toggles > _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281779#comment-17281779 ] Bruno Roustant commented on LUCENE-9663: I'm ready to merge. I think it could go to 8.9 branch but I'd like to have confirmation. This change adds compression to Lucene80DocValuesFormat if the Mode.BEST_COMPRESSION is used and is backward compatible. [~jpountz] any suggestion? Thanks > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Fix For: master (9.0) > > Time Spent: 11h > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278966#comment-17278966 ] Bruno Roustant commented on LUCENE-9663: The latest PR looks good. I'm going to merge it in a couple of days if there is no objection. [~Jaison] you may want to open another Jira issue if you want to propose more configuration for the compression (and you can link it to this issue). > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Fix For: master (9.0) > > Time Spent: 8h 40m > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9646) Set BM25Similarity discountOverlaps via the constructor
[ https://issues.apache.org/jira/browse/LUCENE-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant resolved LUCENE-9646. Fix Version/s: master (9.0) Resolution: Fixed Thank you [~pmarty] > Set BM25Similarity discountOverlaps via the constructor > --- > > Key: LUCENE-9646 > URL: https://issues.apache.org/jira/browse/LUCENE-9646 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: master (9.0) >Reporter: Patrick Marty >Assignee: Bruno Roustant >Priority: Trivial > Fix For: master (9.0) > > Time Spent: 40m > Remaining Estimate: 0h > > BM25Similarity discountOverlaps parameter is true by default. > It can be set with > {{org.apache.lucene.search.similarities.BM25Similarity#setDiscountOverlaps}} > method. > But this method makes BM25Similarity mutable. > > discountOverlaps should be set via the constructor and > {{setDiscountOverlaps}} method should be removed to make BM25Similarity > immutable. > > PR https://github.com/apache/lucene-solr/pull/2161 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-9646) Set BM25Similarity discountOverlaps via the constructor
[ https://issues.apache.org/jira/browse/LUCENE-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant reassigned LUCENE-9646: -- Assignee: Bruno Roustant > Set BM25Similarity discountOverlaps via the constructor > --- > > Key: LUCENE-9646 > URL: https://issues.apache.org/jira/browse/LUCENE-9646 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: master (9.0) >Reporter: Patrick Marty >Assignee: Bruno Roustant >Priority: Trivial > Time Spent: 0.5h > Remaining Estimate: 0h > > BM25Similarity discountOverlaps parameter is true by default. > It can be set with > {{org.apache.lucene.search.similarities.BM25Similarity#setDiscountOverlaps}} > method. > But this method makes BM25Similarity mutable. > > discountOverlaps should be set via the constructor and > {{setDiscountOverlaps}} method should be removed to make BM25Similarity > immutable. > > PR https://github.com/apache/lucene-solr/pull/2161 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9646) Set BM25Similarity discountOverlaps via the constructor
[ https://issues.apache.org/jira/browse/LUCENE-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264033#comment-17264033 ] Bruno Roustant commented on LUCENE-9646: I'm going to merge it tomorrow, on master only. > Set BM25Similarity discountOverlaps via the constructor > --- > > Key: LUCENE-9646 > URL: https://issues.apache.org/jira/browse/LUCENE-9646 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: master (9.0) >Reporter: Patrick Marty >Priority: Trivial > Time Spent: 0.5h > Remaining Estimate: 0h > > BM25Similarity discountOverlaps parameter is true by default. > It can be set with > {{org.apache.lucene.search.similarities.BM25Similarity#setDiscountOverlaps}} > method. > But this method makes BM25Similarity mutable. > > discountOverlaps should be set via the constructor and > {{setDiscountOverlaps}} method should be removed to make BM25Similarity > immutable. > > PR https://github.com/apache/lucene-solr/pull/2161 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-15061) Fix NPE in SearchHandler when shards.info
[ https://issues.apache.org/jira/browse/SOLR-15061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant resolved SOLR-15061. --- Fix Version/s: 8.8 Resolution: Fixed > Fix NPE in SearchHandler when shards.info > - > > Key: SOLR-15061 > URL: https://issues.apache.org/jira/browse/SOLR-15061 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Minor > Fix For: 8.8 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This NPE happens in a specific case > - Short-circuited distributed request > - With shards.info > - With no QueryComponent (e.g. only spellcheck) > One-liner fix in SearchHandler.handleRequestBody(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9570) Review code diffs after automatic formatting and correct problems before it is applied
[ https://issues.apache.org/jira/browse/LUCENE-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258851#comment-17258851 ] Bruno Roustant edited comment on LUCENE-9570 at 1/5/21, 11:51 AM: -- [~dweiss] Yes finally it's done. I pushed the commit directly to your branch. I have been working on spatial3d since yesterday. 123 classes, with some giant ones. Ouch, it took me more time than anticipated. Lots of (inconsistently) missing braces in a single-line if. I added them. I hope I didn't miss too many of them. Many multi-line comments where the lines were truncated with a newline inserted. I had to reformat manually the comment. Badly formatted commented code. Either line-truncated, or with original commented code had no space between concatenated strings (and I tried to add them as often as I could) Lots of duplicated code to reformat again and again. was (Author: broustant): [~dweiss] Yes finally it's done. I pushed the commit directly to your branch. I have been working on spatial3d since yesterday. 123 classes, with some giant ones. Ouch, it took me more time than anticipated. Lots of (inconsistently) missing braces in a single-line if. I added them. I hope I didn't miss too many of them. Many multi-line comments where the lines where truncated with a newline inserted. I had to reformat manually the comment. Badly formatted commented code. Either line-truncated, or with original commented code had no space between concatenated strings (and I tried to add them as often as I could) Lots of duplicated code to reformat again and again. > Review code diffs after automatic formatting and correct problems before it > is applied > -- > > Key: LUCENE-9570 > URL: https://issues.apache.org/jira/browse/LUCENE-9570 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Blocker > Time Spent: 20m > Remaining Estimate: 0h > > Review and correct all the javadocs before they're messed up by automatic > formatting. Apply project-by-project, review diff, correct. Lots of diffs but > it should be relatively quick. > *Reviewing diffs manually* > * switch to branch jira/LUCENE-9570 which the PR is based on: > {code:java} > git remote add dweiss g...@github.com:dweiss/lucene-solr.git > git fetch dweiss > git checkout jira/LUCENE-9570 > {code} > * Open gradle/validation/spotless.gradle and locate the project/ package you > wish to review. Enable it in spotless.gradle by creating a corresponding > switch case block (refer to existing examples), for example: > {code:java} > case ":lucene:highlighter": > target "src/**" > targetExclude "**/resources/**", "**/overview.html" > break > {code} > * Reformat the code: > {code:java} > gradlew tidy && git diff -w > /tmp/diff.patch && git status > {code} > * Look at what has changed (git status) and review the differences manually > (/tmp/diff.patch). If everything looks ok, commit it directly to > jira/LUCENE-9570 or make a PR against that branch. > {code:java} > git commit -am ":lucene:core - src/**/org/apache/lucene/document/**" > {code} > *Packages remaining* (put your name next to a module you're working on to > avoid duplication). > * case ":lucene:spatial3d": (Bruno Roustant) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9570) Review code diffs after automatic formatting and correct problems before it is applied
[ https://issues.apache.org/jira/browse/LUCENE-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258851#comment-17258851 ] Bruno Roustant commented on LUCENE-9570: [~dweiss] Yes finally it's done. I pushed the commit directly to your branch. I have been working on spatial3d since yesterday. 123 classes, with some giant ones. Ouch, it took me more time than anticipated. Lots of (inconsistently) missing braces in a single-line if. I added them. I hope I didn't miss too many of them. Many multi-line comments where the lines where truncated with a newline inserted. I had to reformat manually the comment. Badly formatted commented code. Either line-truncated, or with original commented code had no space between concatenated strings (and I tried to add them as often as I could) Lots of duplicated code to reformat again and again. > Review code diffs after automatic formatting and correct problems before it > is applied > -- > > Key: LUCENE-9570 > URL: https://issues.apache.org/jira/browse/LUCENE-9570 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Blocker > Time Spent: 20m > Remaining Estimate: 0h > > Review and correct all the javadocs before they're messed up by automatic > formatting. Apply project-by-project, review diff, correct. Lots of diffs but > it should be relatively quick. > *Reviewing diffs manually* > * switch to branch jira/LUCENE-9570 which the PR is based on: > {code:java} > git remote add dweiss g...@github.com:dweiss/lucene-solr.git > git fetch dweiss > git checkout jira/LUCENE-9570 > {code} > * Open gradle/validation/spotless.gradle and locate the project/ package you > wish to review. Enable it in spotless.gradle by creating a corresponding > switch case block (refer to existing examples), for example: > {code:java} > case ":lucene:highlighter": > target "src/**" > targetExclude "**/resources/**", "**/overview.html" > break > {code} > * Reformat the code: > {code:java} > gradlew tidy && git diff -w > /tmp/diff.patch && git status > {code} > * Look at what has changed (git status) and review the differences manually > (/tmp/diff.patch). If everything looks ok, commit it directly to > jira/LUCENE-9570 or make a PR against that branch. > {code:java} > git commit -am ":lucene:core - src/**/org/apache/lucene/document/**" > {code} > *Packages remaining* (put your name next to a module you're working on to > avoid duplication). > * case ":lucene:spatial3d": (Bruno Roustant) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9646) Set BM25Similarity discountOverlaps via the constructor
[ https://issues.apache.org/jira/browse/LUCENE-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258142#comment-17258142 ] Bruno Roustant commented on LUCENE-9646: I see that ClassicSimilarity constructor has a comment "Sole constructor: parameter-free". I don't know why it was designed to have this parameter free constructor with the same setDiscountOverlaps setter inherited from TFIDFSimilarity. Based on the usages of this setter, these similarities could indeed be immutable instead. > Set BM25Similarity discountOverlaps via the constructor > --- > > Key: LUCENE-9646 > URL: https://issues.apache.org/jira/browse/LUCENE-9646 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: master (9.0) >Reporter: Patrick Marty >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > BM25Similarity discountOverlaps parameter is true by default. > It can be set with > {{org.apache.lucene.search.similarities.BM25Similarity#setDiscountOverlaps}} > method. > But this method makes BM25Similarity mutable. > > discountOverlaps should be set via the constructor and > {{setDiscountOverlaps}} method should be removed to make BM25Similarity > immutable. > > PR https://github.com/apache/lucene-solr/pull/2161 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9570) Review code diffs after automatic formatting and correct problems before it is applied
[ https://issues.apache.org/jira/browse/LUCENE-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated LUCENE-9570: --- Description: Review and correct all the javadocs before they're messed up by automatic formatting. Apply project-by-project, review diff, correct. Lots of diffs but it should be relatively quick. *Reviewing diffs manually* * switch to branch jira/LUCENE-9570 which the PR is based on: {code:java} git remote add dweiss g...@github.com:dweiss/lucene-solr.git git fetch dweiss git checkout jira/LUCENE-9570 {code} * Open gradle/validation/spotless.gradle and locate the project/ package you wish to review. Enable it in spotless.gradle by creating a corresponding switch case block (refer to existing examples), for example: {code:java} case ":lucene:highlighter": target "src/**" targetExclude "**/resources/**", "**/overview.html" break {code} * Reformat the code: {code:java} gradlew tidy && git diff -w > /tmp/diff.patch && git status {code} * Look at what has changed (git status) and review the differences manually (/tmp/diff.patch). If everything looks ok, commit it directly to jira/LUCENE-9570 or make a PR against that branch. {code:java} git commit -am ":lucene:core - src/**/org/apache/lucene/document/**" {code} *Packages remaining* (put your name next to a module you're working on to avoid duplication). * case ":lucene:luke": * case ":lucene:sandbox": (Erick Erickson) * case ":lucene:spatial3d": (Bruno Roustant) * case ":lucene:spatial-extras": * case ":lucene:suggest": * case ":lucene:test-framework": was: Review and correct all the javadocs before they're messed up by automatic formatting. Apply project-by-project, review diff, correct. Lots of diffs but it should be relatively quick. *Reviewing diffs manually* * switch to branch jira/LUCENE-9570 which the PR is based on: {code:java} git remote add dweiss g...@github.com:dweiss/lucene-solr.git git fetch dweiss git checkout jira/LUCENE-9570 {code} * Open gradle/validation/spotless.gradle and locate the project/ package you wish to review. Enable it in spotless.gradle by creating a corresponding switch case block (refer to existing examples), for example: {code:java} case ":lucene:highlighter": target "src/**" targetExclude "**/resources/**", "**/overview.html" break {code} * Reformat the code: {code:java} gradlew tidy && git diff -w > /tmp/diff.patch && git status {code} * Look at what has changed (git status) and review the differences manually (/tmp/diff.patch). If everything looks ok, commit it directly to jira/LUCENE-9570 or make a PR against that branch. {code:java} git commit -am ":lucene:core - src/**/org/apache/lucene/document/**" {code} *Packages remaining* (put your name next to a module you're working on to avoid duplication). * case ":lucene:luke": * case ":lucene:sandbox": (Erick Erickson) * case ":lucene:spatial3d": * case ":lucene:spatial-extras": * case ":lucene:suggest": * case ":lucene:test-framework": > Review code diffs after automatic formatting and correct problems before it > is applied > -- > > Key: LUCENE-9570 > URL: https://issues.apache.org/jira/browse/LUCENE-9570 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Blocker > Time Spent: 10m > Remaining Estimate: 0h > > Review and correct all the javadocs before they're messed up by automatic > formatting. Apply project-by-project, review diff, correct. Lots of diffs but > it should be relatively quick. > *Reviewing diffs manually* > * switch to branch jira/LUCENE-9570 which the PR is based on: > {code:java} > git remote add dweiss g...@github.com:dweiss/lucene-solr.git > git fetch dweiss > git checkout jira/LUCENE-9570 > {code} > * Open gradle/validation/spotless.gradle and locate the project/ package you > wish to review. Enable it in spotless.gradle by creating a corresponding > switch case block (refer to existing examples), for example: > {code:java} > case ":lucene:highlighter": > target "src/**" > targetExclude "**/resources/**", "**/overview.html" > break > {code} > * Reformat the code: > {code:java} > gradlew tidy && git diff -w > /tmp/diff.patch && git status > {code} > * Look at what has changed (git status) and review the differences manually > (/tmp/diff.patch). If everything looks ok, commit it directly to > jira/LUCENE-9570 or make a PR against that branch. > {code:java} > git commit -am ":lucene:core - src/**/org/apache/lucene/document/**" > {code} > *Packages remaining* (put your name next to a module you're working on to > avoid
[jira] [Created] (SOLR-15061) Fix NPE in SearchHandler when shards.info
Bruno Roustant created SOLR-15061: - Summary: Fix NPE in SearchHandler when shards.info Key: SOLR-15061 URL: https://issues.apache.org/jira/browse/SOLR-15061 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Reporter: Bruno Roustant Assignee: Bruno Roustant This NPE happens in a specific case - Short-circuited distributed request - With shards.info - With no QueryComponent (e.g. only spellcheck) One-liner fix in SearchHandler.handleRequestBody(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-15060) Introduce DelegatingDirectoryFactory
[ https://issues.apache.org/jira/browse/SOLR-15060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-15060: -- Description: FilterDirectory already exists to delegate to a Directory, but there is no delegating DirectoryFactory. +Use cases:+ A DelegatingDirectoryFactory could be used in SOLR-15051 to make BlobDirectoryFactory easily delegate to any DirectoryFactory (e.g. MMapDirectoryFactory). If the DelegatingDirectoryFactory delegation can be configured in solrconfig.xml then it allows any user to change the delegate DirectoryFactory without code. A DelegatingDirectoryFactory could be used by an EncryptingDirectoryFactory that could encrypt/decrypt any delegate DirectoryFactory. Here again it should be configurable in solrconfig.xml. +Problem:+ But currently DirectoryFactory delegation does not work with a delegate CachingDirectoryFactory because the get() method creates internally a Directory instance of a type which cannot be controlled by the caller (the DelegatingDirectoryFactory). +Proposal:+ So here we propose to change DirectoryFactory.get() method by adding a fourth parameter Function that allows the caller to wrap the internal Directory with a custom FilterDirectory when it is created. Hence we would have a DelegatingDirectoryFactory that could delegate the creation of some FilterDirectory. was: FilterDirectory already exists to delegate to a Directory, but there is no delegating DirectoryFactory. +Use cases:+ A DelegatingDirectoryFactory could be used in SOLR-15051 to make BlobDirectoryFactory easily delegate to any DirectoryFactory (e.g. MMapDirectoryFactory). If the DelegatingDirectoryFactory delegation can be configured in solrconfig.xml then it allows any user to change the delegate DirectoryFactory without code. A DelegatingDirectoryFactory could be used by an EncryptingDirectoryFactory that could encrypt/decrypt any delegate DirectoryFactory. Here again it should be configurable in solrconfig.xml. +Problem:+ But currently DirectoryFactory delegation does not work with a delegate CachingDirectoryFactory because the get() method creates internally a Directory instance of a type which cannot be controlled by the caller (the DelegatingDirectoryFactory). +Proposal:+ So here we propose to change DirectoryFactory.get() method by adding a fourth parameter Function that allows the caller to wrap the internal Directory with a custom FilterDirectory when it is created. +Benefit:+ Hence we would have a DelegatingDirectoryFactory that could delegate the creation of some FilterDirectory. > Introduce DelegatingDirectoryFactory > > > Key: SOLR-15060 > URL: https://issues.apache.org/jira/browse/SOLR-15060 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > FilterDirectory already exists to delegate to a Directory, but there is no > delegating DirectoryFactory. > +Use cases:+ > A DelegatingDirectoryFactory could be used in SOLR-15051 to make > BlobDirectoryFactory easily delegate to any DirectoryFactory (e.g. > MMapDirectoryFactory). If the DelegatingDirectoryFactory delegation can be > configured in solrconfig.xml then it allows any user to change the delegate > DirectoryFactory without code. > A DelegatingDirectoryFactory could be used by an EncryptingDirectoryFactory > that could encrypt/decrypt any delegate DirectoryFactory. Here again it > should be configurable in solrconfig.xml. > +Problem:+ > But currently DirectoryFactory delegation does not work with a delegate > CachingDirectoryFactory because the get() method creates internally a > Directory instance of a type which cannot be controlled by the caller (the > DelegatingDirectoryFactory). > +Proposal:+ > So here we propose to change DirectoryFactory.get() method by adding a > fourth parameter Function that allows the caller to > wrap the internal Directory with a custom FilterDirectory when it is created. > Hence we would have a DelegatingDirectoryFactory that could delegate the > creation of some FilterDirectory. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-15060) Introduce DelegatingDirectoryFactory
[ https://issues.apache.org/jira/browse/SOLR-15060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-15060: -- Description: FilterDirectory already exists to delegate to a Directory, but there is no delegating DirectoryFactory. +Use cases:+ A DelegatingDirectoryFactory could be used in SOLR-15051 to make BlobDirectoryFactory easily delegate to any DirectoryFactory (e.g. MMapDirectoryFactory). If the DelegatingDirectoryFactory delegation can be configured in solrconfig.xml then it allows any user to change the delegate DirectoryFactory without code. A DelegatingDirectoryFactory could be used by an EncryptingDirectoryFactory that could encrypt/decrypt any delegate DirectoryFactory. Here again it should be configurable in solrconfig.xml. +Problem:+ But currently DirectoryFactory delegation does not work with a delegate CachingDirectoryFactory because the get() method creates internally a Directory instance of a type which cannot be controlled by the caller (the DelegatingDirectoryFactory). +Proposal:+ So here we propose to change DirectoryFactory.get() method by adding a fourth parameter Function that allows the caller to wrap the internal Directory with a custom FilterDirectory when it is created. +Benefit:+ Hence we would have a DelegatingDirectoryFactory that could delegate the creation of some FilterDirectory. was: FilterDirectory already exists to delegate to a Directory, but there is no delegating DirectoryFactory. +Use cases:+ A DelegatingDirectoryFactory could be used in SOLR-15051 to make BlobDirectoryFactory easily delegate to any DirectoryFactory (e.g. MMapDirectoryFactory). If the DelegatingDirectoryFactory delegation can be configured in solrconfig.xml then it allows any user to change the delegate DirectoryFactory without code. A DelegatingDirectoryFactory could be used by an EncryptingDirectoryFactory that could encrypt/decrypt any delegate DirectoryFactory. Here again it should be configurable in solrconfig.xml. +Problem: +But currently DirectoryFactory delegation does not work with a delegate CachingDirectoryFactory because the get() method creates internally a Directory instance of a type which cannot be controlled by the caller (the DelegatingDirectoryFactory). +Proposal:+ So here we propose to change DirectoryFactory.get() method by adding a fourth parameter Function that allows the caller to wrap the internal Directory with a custom FilterDirectory when it is created. +Benefit: +Hence we would have a DelegatingDirectoryFactory that could delegate the creation of some FilterDirectory. > Introduce DelegatingDirectoryFactory > > > Key: SOLR-15060 > URL: https://issues.apache.org/jira/browse/SOLR-15060 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > > FilterDirectory already exists to delegate to a Directory, but there is no > delegating DirectoryFactory. > +Use cases:+ > A DelegatingDirectoryFactory could be used in SOLR-15051 to make > BlobDirectoryFactory easily delegate to any DirectoryFactory (e.g. > MMapDirectoryFactory). If the DelegatingDirectoryFactory delegation can be > configured in solrconfig.xml then it allows any user to change the delegate > DirectoryFactory without code. > A DelegatingDirectoryFactory could be used by an EncryptingDirectoryFactory > that could encrypt/decrypt any delegate DirectoryFactory. Here again it > should be configurable in solrconfig.xml. > +Problem:+ > But currently DirectoryFactory delegation does not work with a delegate > CachingDirectoryFactory because the get() method creates internally a > Directory instance of a type which cannot be controlled by the caller (the > DelegatingDirectoryFactory). > +Proposal:+ > So here we propose to change DirectoryFactory.get() method by adding a > fourth parameter Function that allows the caller to > wrap the internal Directory with a custom FilterDirectory when it is created. > +Benefit:+ > Hence we would have a DelegatingDirectoryFactory that could delegate the > creation of some FilterDirectory. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-15060) Introduce DelegatingDirectoryFactory
Bruno Roustant created SOLR-15060: - Summary: Introduce DelegatingDirectoryFactory Key: SOLR-15060 URL: https://issues.apache.org/jira/browse/SOLR-15060 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Reporter: Bruno Roustant Assignee: Bruno Roustant FilterDirectory already exists to delegate to a Directory, but there is no delegating DirectoryFactory. +Use cases:+ A DelegatingDirectoryFactory could be used in SOLR-15051 to make BlobDirectoryFactory easily delegate to any DirectoryFactory (e.g. MMapDirectoryFactory). If the DelegatingDirectoryFactory delegation can be configured in solrconfig.xml then it allows any user to change the delegate DirectoryFactory without code. A DelegatingDirectoryFactory could be used by an EncryptingDirectoryFactory that could encrypt/decrypt any delegate DirectoryFactory. Here again it should be configurable in solrconfig.xml. +Problem: +But currently DirectoryFactory delegation does not work with a delegate CachingDirectoryFactory because the get() method creates internally a Directory instance of a type which cannot be controlled by the caller (the DelegatingDirectoryFactory). +Proposal:+ So here we propose to change DirectoryFactory.get() method by adding a fourth parameter Function that allows the caller to wrap the internal Directory with a custom FilterDirectory when it is created. +Benefit: +Hence we would have a DelegatingDirectoryFactory that could delegate the creation of some FilterDirectory. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-14975) Optimize CoreContainer.getAllCoreNames and getLoadedCoreNames
[ https://issues.apache.org/jira/browse/SOLR-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant resolved SOLR-14975. --- Resolution: Fixed Thanks Erick and David for the review! > Optimize CoreContainer.getAllCoreNames and getLoadedCoreNames > -- > > Key: SOLR-14975 > URL: https://issues.apache.org/jira/browse/SOLR-14975 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: David Smiley >Priority: Major > Time Spent: 4h 40m > Remaining Estimate: 0h > > The methods CoreContainer.getAllCoreNames and getLoadedCoreNames hold a lock > while they grab core names to put into a TreeSet. When there are *many* > cores, this delay is noticeable. Holding this lock effectively blocks > queries since queries lookup a core; so it's critically important that these > methods are *fast*. The tragedy here is that some callers merely want to > know if a particular name is in the set, or what the aggregated size is. > Some callers want to iterate the names but don't really care what the > iteration order is. > I propose that some callers of these two methods find suitable alternatives, > like getCoreDescriptor to check for null. And I propose that these methods > return a HashSet -- no order. If the caller wants it sorted, it can do so > itself. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14975) Optimize CoreContainer.getAllCoreNames and getLoadedCoreNames
[ https://issues.apache.org/jira/browse/SOLR-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227546#comment-17227546 ] Bruno Roustant commented on SOLR-14975: --- I added a PR. I believe descriptors are superset of loaded cores. And also permanent cores are distinct from transient cores. I simplified the logic to create lists based on distinct sets. And I added assertions to verify the sets are indeed distincts. All tests passed. > Optimize CoreContainer.getAllCoreNames and getLoadedCoreNames > -- > > Key: SOLR-14975 > URL: https://issues.apache.org/jira/browse/SOLR-14975 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: David Smiley >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > The methods CoreContainer.getAllCoreNames and getLoadedCoreNames hold a lock > while they grab core names to put into a TreeSet. When there are *many* > cores, this delay is noticeable. Holding this lock effectively blocks > queries since queries lookup a core; so it's critically important that these > methods are *fast*. The tragedy here is that some callers merely want to > know if a particular name is in the set, or what the aggregated size is. > Some callers want to iterate the names but don't really care what the > iteration order is. > I propose that some callers of these two methods find suitable alternatives, > like getCoreDescriptor to check for null. And I propose that these methods > return a HashSet -- no order. If the caller wants it sorted, it can do so > itself. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9455) ExitableTermsEnum (in ExitableDirectoryReader) should sample next()
[ https://issues.apache.org/jira/browse/LUCENE-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant resolved LUCENE-9455. Fix Version/s: 8.8 Resolution: Fixed Thanks [~zacharymorn] > ExitableTermsEnum (in ExitableDirectoryReader) should sample next() > --- > > Key: LUCENE-9455 > URL: https://issues.apache.org/jira/browse/LUCENE-9455 > Project: Lucene - Core > Issue Type: Improvement > Components: core/other >Reporter: David Smiley >Priority: Major > Labels: newdev > Fix For: 8.8 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > ExitableTermsEnum calls "checkAndThrow" on *every* call to next(). This is > too expensive; it should sample. I observed ElasticSearch uses the same > approach; I think Lucene would benefit from this: > https://github.com/elastic/elasticsearch/blob/4af4eb99e18fdaadac879b1223e986227dd2ee71/server/src/main/java/org/elasticsearch/search/internal/ExitableDirectoryReader.java#L151 > CC [~jimczi] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9455) ExitableTermsEnum (in ExitableDirectoryReader) should sample next()
[ https://issues.apache.org/jira/browse/LUCENE-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219291#comment-17219291 ] Bruno Roustant commented on LUCENE-9455: I plan to merge tomorrow. > ExitableTermsEnum (in ExitableDirectoryReader) should sample next() > --- > > Key: LUCENE-9455 > URL: https://issues.apache.org/jira/browse/LUCENE-9455 > Project: Lucene - Core > Issue Type: Improvement > Components: core/other >Reporter: David Smiley >Priority: Major > Labels: newdev > Time Spent: 2h > Remaining Estimate: 0h > > ExitableTermsEnum calls "checkAndThrow" on *every* call to next(). This is > too expensive; it should sample. I observed ElasticSearch uses the same > approach; I think Lucene would benefit from this: > https://github.com/elastic/elasticsearch/blob/4af4eb99e18fdaadac879b1223e986227dd2ee71/server/src/main/java/org/elasticsearch/search/internal/ExitableDirectoryReader.java#L151 > CC [~jimczi] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9455) ExitableTermsEnum (in ExitableDirectoryReader) should sample next()
[ https://issues.apache.org/jira/browse/LUCENE-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218810#comment-17218810 ] Bruno Roustant commented on LUCENE-9455: Your investigation revealed something I overlooked: MultiTermQueryConstantScoreWrapper BOOLEAN_REWRITE_TERM_COUNT_THRESHOLD prevents too many TermQuery. So there won't be too many calls to ExitableTermsEnum constructor. We actually don't need to sample it. > ExitableTermsEnum (in ExitableDirectoryReader) should sample next() > --- > > Key: LUCENE-9455 > URL: https://issues.apache.org/jira/browse/LUCENE-9455 > Project: Lucene - Core > Issue Type: Improvement > Components: core/other >Reporter: David Smiley >Priority: Major > Labels: newdev > Time Spent: 1h 50m > Remaining Estimate: 0h > > ExitableTermsEnum calls "checkAndThrow" on *every* call to next(). This is > too expensive; it should sample. I observed ElasticSearch uses the same > approach; I think Lucene would benefit from this: > https://github.com/elastic/elasticsearch/blob/4af4eb99e18fdaadac879b1223e986227dd2ee71/server/src/main/java/org/elasticsearch/search/internal/ExitableDirectoryReader.java#L151 > CC [~jimczi] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9455) ExitableTermsEnum (in ExitableDirectoryReader) should sample next()
[ https://issues.apache.org/jira/browse/LUCENE-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218367#comment-17218367 ] Bruno Roustant commented on LUCENE-9455: I propose to also sample the call to QueryTimeout.shouldExit() in ExitableTermsEnum constructor with (System.identityHashCode(this) & TIMEOUT_CHECK_SAMPLING) == 0 > ExitableTermsEnum (in ExitableDirectoryReader) should sample next() > --- > > Key: LUCENE-9455 > URL: https://issues.apache.org/jira/browse/LUCENE-9455 > Project: Lucene - Core > Issue Type: Improvement > Components: core/other >Reporter: David Smiley >Priority: Major > Labels: newdev > Time Spent: 50m > Remaining Estimate: 0h > > ExitableTermsEnum calls "checkAndThrow" on *every* call to next(). This is > too expensive; it should sample. I observed ElasticSearch uses the same > approach; I think Lucene would benefit from this: > https://github.com/elastic/elasticsearch/blob/4af4eb99e18fdaadac879b1223e986227dd2ee71/server/src/main/java/org/elasticsearch/search/internal/ExitableDirectoryReader.java#L151 > CC [~jimczi] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9455) ExitableTermsEnum (in ExitableDirectoryReader) should sample next()
[ https://issues.apache.org/jira/browse/LUCENE-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218292#comment-17218292 ] Bruno Roustant commented on LUCENE-9455: Thanks [~zacharymorn]. I added comments to the review. Fyi watchers, I have a broader question in the PR, I repeat it here: Overall I wonder if we can do better with the sampling. The goal is to avoid doing numerous repetitive calls to QueryTimeout.shouldExit(). This is essentially the case for multi-terms queries. But actually for multi-terms queries, a new TermsEnum is created for each matching term (in TermQuery.getTermsEnum(), to get doc ids). So we end up only sampling half of the calls to QueryTimeout.shouldExit() since the other half is done by the ExitableTermsEnum constructor which is not sampled. It would be better to also sample the ExitableTermsEnum constructor, but I don't know yet how to do that. > ExitableTermsEnum (in ExitableDirectoryReader) should sample next() > --- > > Key: LUCENE-9455 > URL: https://issues.apache.org/jira/browse/LUCENE-9455 > Project: Lucene - Core > Issue Type: Improvement > Components: core/other >Reporter: David Smiley >Priority: Major > Labels: newdev > Time Spent: 40m > Remaining Estimate: 0h > > ExitableTermsEnum calls "checkAndThrow" on *every* call to next(). This is > too expensive; it should sample. I observed ElasticSearch uses the same > approach; I think Lucene would benefit from this: > https://github.com/elastic/elasticsearch/blob/4af4eb99e18fdaadac879b1223e986227dd2ee71/server/src/main/java/org/elasticsearch/search/internal/ExitableDirectoryReader.java#L151 > CC [~jimczi] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9455) ExitableTermsEnum (in ExitableDirectoryReader) should sample next()
[ https://issues.apache.org/jira/browse/LUCENE-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214491#comment-17214491 ] Bruno Roustant commented on LUCENE-9455: I planned to work on this (I still plan) but actually too busy on other stuff. So if you want to try out, yes, please share a PR here. I'll be glad to participate to the review. > ExitableTermsEnum (in ExitableDirectoryReader) should sample next() > --- > > Key: LUCENE-9455 > URL: https://issues.apache.org/jira/browse/LUCENE-9455 > Project: Lucene - Core > Issue Type: Improvement > Components: core/other >Reporter: David Smiley >Priority: Major > > ExitableTermsEnum calls "checkAndThrow" on *every* call to next(). This is > too expensive; it should sample. I observed ElasticSearch uses the same > approach; I think Lucene would benefit from this: > https://github.com/elastic/elasticsearch/blob/4af4eb99e18fdaadac879b1223e986227dd2ee71/server/src/main/java/org/elasticsearch/search/internal/ExitableDirectoryReader.java#L151 > CC [~jimczi] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-14905) Update commons-io version to 2.8.0 due to security vulnerability
[ https://issues.apache.org/jira/browse/SOLR-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant resolved SOLR-14905. --- Fix Version/s: 8.7 Resolution: Fixed Thanks [~nazerke], this is in. > Update commons-io version to 2.8.0 due to security vulnerability > > > Key: SOLR-14905 > URL: https://issues.apache.org/jira/browse/SOLR-14905 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: security >Affects Versions: 8.6.2 >Reporter: Nazerke Seidan >Priority: Minor > Fix For: 8.7 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > The {{commons-io}} (version 2.6) package is vulnerable to Path Traversal. The > {{getPrefixLength}} method in {{FilenameUtils.class}} improperly verifies the > hostname value received from user input before processing client requests. > The issue has been fixed in 2.7 onward: > (https://issues.apache.org/jira/browse/IO-556, > https://issues.apache.org/jira/browse/IO-559) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14905) Update commons-io version to 2.8.0 due to security vulnerability
[ https://issues.apache.org/jira/browse/SOLR-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204715#comment-17204715 ] Bruno Roustant commented on SOLR-14905: --- Thanks Nazerke. I'm testing your PR on my side also. > Update commons-io version to 2.8.0 due to security vulnerability > > > Key: SOLR-14905 > URL: https://issues.apache.org/jira/browse/SOLR-14905 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: security >Affects Versions: 8.6.2 >Reporter: Nazerke Seidan >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > The {{commons-io}} (version 2.6) package is vulnerable to Path Traversal. The > {{getPrefixLength}} method in {{FilenameUtils.class}} improperly verifies the > hostname value received from user input before processing client requests. > The issue has been fixed in 2.7 onward: > (https://issues.apache.org/jira/browse/IO-556, > https://issues.apache.org/jira/browse/IO-559) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-14819) SOLR has linear log operations that could trivially be linear
[ https://issues.apache.org/jira/browse/SOLR-14819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant resolved SOLR-14819. --- Fix Version/s: 8.7 Resolution: Fixed Thanks Thomas for this fix. Please share the info about this detection tool in Lucene/Solr dev list ([https://lucene.apache.org/solr/community.html#mailing-lists-irc]) for a wider audience and discussion. Personally I find the report verbose, so I wonder how verbose it would be on a specific Jira issue. Clearly the difficulty is to avoid too much false detections. > SOLR has linear log operations that could trivially be linear > - > > Key: SOLR-14819 > URL: https://issues.apache.org/jira/browse/SOLR-14819 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Thomas DuBuisson >Priority: Trivial > Fix For: 8.7 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > The SOLR code has a few linear log operations that could be linear. That is, > operations of > > ``` > for(key in hashmap) doThing(hashmap.get(key)); > > ``` > > vs just `for(value in hashmap) doThing(value)` > > I have a PR incoming on GitHub to fix a couple of these issue as [found by > Infer on > Muse|https://console.muse.dev/result/TomMD/lucene-solr/01EH5WXS6C1RH1NFYHP6ATXTZ9?search=JsonSchemaValidator=results]. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-14782) QueryElevationComponent does not handle escaped query terms
[ https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant resolved SOLR-14782. --- Fix Version/s: 8.7 Resolution: Fixed No code change. Fixed by adding an example of how to unescape for QueryElevationComponent in the doc. Thanks Thomas for pointing this. > QueryElevationComponent does not handle escaped query terms > --- > > Key: SOLR-14782 > URL: https://issues.apache.org/jira/browse/SOLR-14782 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: query parsers >Affects Versions: 8.2 >Reporter: Thomas Schmiereck >Assignee: Bruno Roustant >Priority: Major > Labels: elevation > Fix For: 8.7 > > Attachments: SOLR-14782.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > h1. Description > if the elevate.xml contains a entry with spaces: > <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa > bbb"{color}><{color:#0033b3}doc > {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" > {color}/> > and the Solr query term is escaped: > {{?q=aaa+bbb}} > the Solr search itself handels this correctly, but the elevate component > "QueryElevationComponent" does not unescape the query term bevor the lookup > in the elevate.xml. > Result is that the entry is not elevated. > A also valid (not escaped) query like: > {{?q=aaa%20bbb}} > is working. > h1. Technical Notes > see: > org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14819) SOLR has linear log operations that could trivially be linear
[ https://issues.apache.org/jira/browse/SOLR-14819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189211#comment-17189211 ] Bruno Roustant commented on SOLR-14819: --- [~tommd] was there other issues found by Infer on Muse? > SOLR has linear log operations that could trivially be linear > - > > Key: SOLR-14819 > URL: https://issues.apache.org/jira/browse/SOLR-14819 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Thomas DuBuisson >Priority: Trivial > Time Spent: 0.5h > Remaining Estimate: 0h > > The SOLR code has a few linear log operations that could be linear. That is, > operations of > > ``` > for(key in hashmap) doThing(hashmap.get(key)); > > ``` > > vs just `for(value in hashmap) doThing(value)` > > I have a PR incoming on GitHub to fix a couple of these issue as [found by > Infer on > Muse|https://console.muse.dev/result/TomMD/lucene-solr/01EH5WXS6C1RH1NFYHP6ATXTZ9?search=JsonSchemaValidator=results]. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14782) QueryElevationComponent does not handle escaped query terms
[ https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189123#comment-17189123 ] Bruno Roustant commented on SOLR-14782: --- I added a PR for the doc improvement. I think it will be sufficient (no need for an UnescapeCharFilterFactory). Thomas, you should use a StandardTokenizerFactory because with 8.3 it becomes possible to match a subset of the query terms (not always a full match anymore), see SOLR-11866. If you use a KeywordTokenizerFactory, it produces only the full query as a single token, so no subset matching possible. > QueryElevationComponent does not handle escaped query terms > --- > > Key: SOLR-14782 > URL: https://issues.apache.org/jira/browse/SOLR-14782 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: query parsers >Affects Versions: 8.2 >Reporter: Thomas Schmiereck >Assignee: Bruno Roustant >Priority: Major > Labels: elevation > Attachments: SOLR-14782.patch > > Time Spent: 10m > Remaining Estimate: 0h > > h1. Description > if the elevate.xml contains a entry with spaces: > <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa > bbb"{color}><{color:#0033b3}doc > {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" > {color}/> > and the Solr query term is escaped: > {{?q=aaa+bbb}} > the Solr search itself handels this correctly, but the elevate component > "QueryElevationComponent" does not unescape the query term bevor the lookup > in the elevate.xml. > Result is that the entry is not elevated. > A also valid (not escaped) query like: > {{?q=aaa%20bbb}} > is working. > h1. Technical Notes > see: > org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-14782) QueryElevationComponent does not handle escaped query terms
[ https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189098#comment-17189098 ] Bruno Roustant edited comment on SOLR-14782 at 9/2/20, 9:17 AM: [~smk] you should add a CharFilter to unescape for query elevation. Instead of using lowercase for the queryFieldType you could use unescapelowercase with the following definition: {code:java} {code} [~dsmiley] this regex pattern replacement CharFilter is not easy. Should we have a new simpler and equivalent UnescapeCharFilterFactory? In any case I think it would be nice to add a paragraph to explain that in the QueryElevationComponent doc. was (Author: broustant): [~smk] you should add a CharFilter to unescape for query elevation. Instead of using lowercase for the queryFieldType you could use unescapelowercase with the following definition: {code:java} {code} [~dsmiley] this regex pattern replacement CharFilter is not easy. Should we have a new simpler and equivalent UnescapeCharFilterFactory? In any case I think it would be nice to add a paragraph to explain that in the QueryElevationComponent doc. > QueryElevationComponent does not handle escaped query terms > --- > > Key: SOLR-14782 > URL: https://issues.apache.org/jira/browse/SOLR-14782 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: query parsers >Affects Versions: 8.2 >Reporter: Thomas Schmiereck >Assignee: Bruno Roustant >Priority: Major > Labels: elevation > Attachments: SOLR-14782.patch > > > h1. Description > if the elevate.xml contains a entry with spaces: > <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa > bbb"{color}><{color:#0033b3}doc > {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" > {color}/> > and the Solr query term is escaped: > {{?q=aaa+bbb}} > the Solr search itself handels this correctly, but the elevate component > "QueryElevationComponent" does not unescape the query term bevor the lookup > in the elevate.xml. > Result is that the entry is not elevated. > A also valid (not escaped) query like: > {{?q=aaa%20bbb}} > is working. > h1. Technical Notes > see: > org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14782) QueryElevationComponent does not handle escaped query terms
[ https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189098#comment-17189098 ] Bruno Roustant commented on SOLR-14782: --- [~smk] you should add a CharFilter to unescape for query elevation. Instead of using lowercase for the queryFieldType you could use unescapelowercase with the following definition: {code:java} {code} [~dsmiley] this regex pattern replacement CharFilter is not easy. Should we have a new simpler and equivalent UnescapeCharFilterFactory? In any case I think it would be nice to add a paragraph to explain that in the QueryElevationComponent doc. > QueryElevationComponent does not handle escaped query terms > --- > > Key: SOLR-14782 > URL: https://issues.apache.org/jira/browse/SOLR-14782 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: query parsers >Affects Versions: 8.2 >Reporter: Thomas Schmiereck >Assignee: Bruno Roustant >Priority: Major > Labels: elevation > Attachments: SOLR-14782.patch > > > h1. Description > if the elevate.xml contains a entry with spaces: > <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa > bbb"{color}><{color:#0033b3}doc > {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" > {color}/> > and the Solr query term is escaped: > {{?q=aaa+bbb}} > the Solr search itself handels this correctly, but the elevate component > "QueryElevationComponent" does not unescape the query term bevor the lookup > in the elevate.xml. > Result is that the entry is not elevated. > A also valid (not escaped) query like: > {{?q=aaa%20bbb}} > is working. > h1. Technical Notes > see: > org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14782) QueryElevationComponent does not handle escaped query terms
[ https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188353#comment-17188353 ] Bruno Roustant commented on SOLR-14782: --- Ok, so I understand the expectation of unescaping in your use-case, but that's not always the case. For example someone else with a custom analyzer could handle differently: different way to escape/unescape, special logic when escaped characters are encountered. Maybe we should have a simple way to configure the escaping for simple use-cases. We could enhance the elevation config file (elevate.xml) to support an additional tag. This unescaping would be false by default (for back compat) and could be enabled simply. What is your opinion [~dsmiley]? > QueryElevationComponent does not handle escaped query terms > --- > > Key: SOLR-14782 > URL: https://issues.apache.org/jira/browse/SOLR-14782 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: query parsers >Affects Versions: 8.2 >Reporter: Thomas Schmiereck >Assignee: Bruno Roustant >Priority: Major > Labels: elevation > Attachments: SOLR-14782.patch > > > h1. Description > if the elevate.xml contains a entry with spaces: > <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa > bbb"{color}><{color:#0033b3}doc > {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" > {color}/> > and the Solr query term is escaped: > {{?q=aaa+bbb}} > the Solr search itself handels this correctly, but the elevate component > "QueryElevationComponent" does not unescape the query term bevor the lookup > in the elevate.xml. > Result is that the entry is not elevated. > A also valid (not escaped) query like: > {{?q=aaa%20bbb}} > is working. > h1. Technical Notes > see: > org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-14782) QueryElevationComponent does not handle escaped query terms
[ https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188251#comment-17188251 ] Bruno Roustant edited comment on SOLR-14782 at 9/1/20, 8:15 AM: [~smk] what is the analyzer defined in your schema for the fieldType corresponding to the "elevate" searchComponent (value defined for the queryFieldType). This analyzer is used to tokenize/filter both the elevation rules and the search query in the QueryElevationComponent. So this analyzer could unescape characters. was (Author: broustant): [~smk] what is the analyzer defined in your schema for the fieldType corresponding to the "elevate" searchComponent ( value defined for the queryFieldType). This analyzer is used to tokenize/filter both the elevation rules and the search query in the QueryElevationComponent. So this analyzer could unescape characters. > QueryElevationComponent does not handle escaped query terms > --- > > Key: SOLR-14782 > URL: https://issues.apache.org/jira/browse/SOLR-14782 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: query parsers >Affects Versions: 8.2 >Reporter: Thomas Schmiereck >Assignee: Bruno Roustant >Priority: Major > Labels: elevation > Attachments: SOLR-14782.patch > > > h1. Description > if the elevate.xml contains a entry with spaces: > <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa > bbb"{color}><{color:#0033b3}doc > {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" > {color}/> > and the Solr query term is escaped: > {{?q=aaa+bbb}} > the Solr search itself handels this correctly, but the elevate component > "QueryElevationComponent" does not unescape the query term bevor the lookup > in the elevate.xml. > Result is that the entry is not elevated. > A also valid (not escaped) query like: > {{?q=aaa%20bbb}} > is working. > h1. Technical Notes > see: > org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14782) QueryElevationComponent does not handle escaped query terms
[ https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188251#comment-17188251 ] Bruno Roustant commented on SOLR-14782: --- [~smk] what is the analyzer defined in your schema for the fieldType corresponding to the "elevate" searchComponent ( value defined for the queryFieldType). This analyzer is used to tokenize/filter both the elevation rules and the search query in the QueryElevationComponent. So this analyzer could unescape characters. > QueryElevationComponent does not handle escaped query terms > --- > > Key: SOLR-14782 > URL: https://issues.apache.org/jira/browse/SOLR-14782 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: query parsers >Affects Versions: 8.2 >Reporter: Thomas Schmiereck >Assignee: Bruno Roustant >Priority: Major > Labels: elevation > Attachments: SOLR-14782.patch > > > h1. Description > if the elevate.xml contains a entry with spaces: > <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa > bbb"{color}><{color:#0033b3}doc > {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" > {color}/> > and the Solr query term is escaped: > {{?q=aaa+bbb}} > the Solr search itself handels this correctly, but the elevate component > "QueryElevationComponent" does not unescape the query term bevor the lookup > in the elevate.xml. > Result is that the entry is not elevated. > A also valid (not escaped) query like: > {{?q=aaa%20bbb}} > is working. > h1. Technical Notes > see: > org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-14782) QueryElevationComponent does not handle escaped query terms
[ https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187563#comment-17187563 ] Bruno Roustant edited comment on SOLR-14782 at 8/31/20, 8:45 AM: - [~smk] could you retry to attach the patch? The current one seems empty. You can also create a Git PR. was (Author: broustant): [~smk] could you retry to attach the patch? The current one seems empty. > QueryElevationComponent does not handle escaped query terms > --- > > Key: SOLR-14782 > URL: https://issues.apache.org/jira/browse/SOLR-14782 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: query parsers >Affects Versions: 8.2 >Reporter: Thomas Schmiereck >Assignee: Bruno Roustant >Priority: Major > Labels: elevation > Attachments: SOLR-14782.patch > > > h1. Description > if the elevate.xml contains a entry with spaces: > <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa > bbb"{color}><{color:#0033b3}doc > {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" > {color}/> > and the Solr query term is escaped: > {{?q=aaa+bbb}} > the Solr search itself handels this correctly, but the elevate component > "QueryElevationComponent" does not unescape the query term bevor the lookup > in the elevate.xml. > Result is that the entry is not elevated. > A also valid (not escaped) query like: > {{?q=aaa%20bbb}} > is working. > h1. Technical Notes > see: > org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14782) QueryElevationComponent does not handle escaped query terms
[ https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187563#comment-17187563 ] Bruno Roustant commented on SOLR-14782: --- [~smk] could you retry to attach the patch? The current one seems empty. > QueryElevationComponent does not handle escaped query terms > --- > > Key: SOLR-14782 > URL: https://issues.apache.org/jira/browse/SOLR-14782 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: query parsers >Affects Versions: 8.2 >Reporter: Thomas Schmiereck >Assignee: Bruno Roustant >Priority: Major > Labels: elevation > Attachments: SOLR-14782.patch > > > h1. Description > if the elevate.xml contains a entry with spaces: > <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa > bbb"{color}><{color:#0033b3}doc > {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" > {color}/> > and the Solr query term is escaped: > {{?q=aaa+bbb}} > the Solr search itself handels this correctly, but the elevate component > "QueryElevationComponent" does not unescape the query term bevor the lookup > in the elevate.xml. > Result is that the entry is not elevated. > A also valid (not escaped) query like: > {{?q=aaa%20bbb}} > is working. > h1. Technical Notes > see: > org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14782) QueryElevationComponent does not handle escaped query terms
[ https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187479#comment-17187479 ] Bruno Roustant commented on SOLR-14782: --- Looking into this. > QueryElevationComponent does not handle escaped query terms > --- > > Key: SOLR-14782 > URL: https://issues.apache.org/jira/browse/SOLR-14782 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: query parsers >Affects Versions: 8.2 >Reporter: Thomas Schmiereck >Assignee: Bruno Roustant >Priority: Major > Labels: elevation > > h1. Description > if the elevate.xml contains a entry with spaces: > <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa > bbb"{color}><{color:#0033b3}doc > {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" > {color}/> > and the Solr query term is escaped: > {{?q=aaa\+bbb}} > the Solr search itself handels this correctly, but the elevate component > "QueryElevationComponent" does not unescape the query term bevor the lookup > in the elevate.xml. > Result is that the entry is not elevated. > A also valid (not escaped) query like: > {{?q=aaa%20bbb}} > is working. > h1. Technical Notes > see: > org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (SOLR-14782) QueryElevationComponent does not handle escaped query terms
[ https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant reassigned SOLR-14782: - Assignee: Bruno Roustant > QueryElevationComponent does not handle escaped query terms > --- > > Key: SOLR-14782 > URL: https://issues.apache.org/jira/browse/SOLR-14782 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: query parsers >Affects Versions: 8.2 >Reporter: Thomas Schmiereck >Assignee: Bruno Roustant >Priority: Major > Labels: elevation > > h1. Description > if the elevate.xml contains a entry with spaces: > <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa > bbb"{color}><{color:#0033b3}doc > {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" > {color}/> > and the Solr query term is escaped: > {{?q=aaa\+bbb}} > the Solr search itself handels this correctly, but the elevate component > "QueryElevationComponent" does not unescape the query term bevor the lookup > in the elevate.xml. > Result is that the entry is not elevated. > A also valid (not escaped) query like: > {{?q=aaa%20bbb}} > is working. > h1. Technical Notes > see: > org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9455) ExitableTermsEnum (in ExitableDirectoryReader) should sample next()
[ https://issues.apache.org/jira/browse/LUCENE-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176316#comment-17176316 ] Bruno Roustant commented on LUCENE-9455: +1 > ExitableTermsEnum (in ExitableDirectoryReader) should sample next() > --- > > Key: LUCENE-9455 > URL: https://issues.apache.org/jira/browse/LUCENE-9455 > Project: Lucene - Core > Issue Type: Improvement > Components: core/other >Reporter: David Smiley >Priority: Major > > ExitableTermsEnum calls "checkAndThrow" on *every* call to next(). This is > too expensive; it should sample. I observed ElasticSearch uses the same > approach; I think Lucene would benefit from this: > https://github.com/elastic/elasticsearch/blob/4af4eb99e18fdaadac879b1223e986227dd2ee71/server/src/main/java/org/elasticsearch/search/internal/ExitableDirectoryReader.java#L151 > CC [~jimczi] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption
[ https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175583#comment-17175583 ] Bruno Roustant commented on LUCENE-9379: [~Raji] maybe a better approach would be to have one tenant per collection, but you might have many tenants so the performance for many collection is poor? If this is the case, then I think the root problem is the perf for many collections. Without composite id router you could use an OS encryption per collection. > Directory based approach for index encryption > - > > Key: LUCENE-9379 > URL: https://issues.apache.org/jira/browse/LUCENE-9379 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > +Important+: This Lucene Directory wrapper approach is to be considered only > if an OS level encryption is not possible. OS level encryption better fits > Lucene usage of OS cache, and thus is more performant. > But there are some use-case where OS level encryption is not possible. This > Jira issue was created to address those. > > > The goal is to provide optional encryption of the index, with a scope limited > to an encryptable Lucene Directory wrapper. > Encryption is at rest on disk, not in memory. > This simple approach should fit any Codec as it would be orthogonal, without > modifying APIs as much as possible. > Use a standard encryption method. Limit perf/memory impact as much as > possible. > Determine how callers provide encryption keys. They must not be stored on > disk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9379) Directory based approach for index encryption
[ https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated LUCENE-9379: --- Description: +Important+: This Lucene Directory wrapper approach is to be considered only if an OS level encryption is not possible. OS level encryption better fits Lucene usage of OS cache, and thus is more performant. But there are some use-case where OS level encryption is not possible. This Jira issue was created to address those. The goal is to provide optional encryption of the index, with a scope limited to an encryptable Lucene Directory wrapper. Encryption is at rest on disk, not in memory. This simple approach should fit any Codec as it would be orthogonal, without modifying APIs as much as possible. Use a standard encryption method. Limit perf/memory impact as much as possible. Determine how callers provide encryption keys. They must not be stored on disk. was: The goal is to provide optional encryption of the index, with a scope limited to an encryptable Lucene Directory wrapper. Encryption is at rest on disk, not in memory. This simple approach should fit any Codec as it would be orthogonal, without modifying APIs as much as possible. Use a standard encryption method. Limit perf/memory impact as much as possible. Determine how callers provide encryption keys. They must not be stored on disk. > Directory based approach for index encryption > - > > Key: LUCENE-9379 > URL: https://issues.apache.org/jira/browse/LUCENE-9379 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > +Important+: This Lucene Directory wrapper approach is to be considered only > if an OS level encryption is not possible. OS level encryption better fits > Lucene usage of OS cache, and thus is more performant. > But there are some use-case where OS level encryption is not possible. This > Jira issue was created to address those. > > > The goal is to provide optional encryption of the index, with a scope limited > to an encryptable Lucene Directory wrapper. > Encryption is at rest on disk, not in memory. > This simple approach should fit any Codec as it would be orthogonal, without > modifying APIs as much as possible. > Use a standard encryption method. Limit perf/memory impact as much as > possible. > Determine how callers provide encryption keys. They must not be stored on > disk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption
[ https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17156566#comment-17156566 ] Bruno Roustant commented on LUCENE-9379: I'm going to pause my work on this for some time, until there are comments added here that share use-cases where OS level encryption is not possible. If you can use OS level encryption, do so, it will be faster. If not, share your use-case here. > Directory based approach for index encryption > - > > Key: LUCENE-9379 > URL: https://issues.apache.org/jira/browse/LUCENE-9379 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > The goal is to provide optional encryption of the index, with a scope limited > to an encryptable Lucene Directory wrapper. > Encryption is at rest on disk, not in memory. > This simple approach should fit any Codec as it would be orthogonal, without > modifying APIs as much as possible. > Use a standard encryption method. Limit perf/memory impact as much as > possible. > Determine how callers provide encryption keys. They must not be stored on > disk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips
[ https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151951#comment-17151951 ] Bruno Roustant commented on LUCENE-9356: [8.6 release manager] Is this issue resolved [~jpountz]? I'm checking to prepare 8.6 RC tomorrow. > Add tests for corruptions caused by byte flips > -- > > Key: LUCENE-9356 > URL: https://issues.apache.org/jira/browse/LUCENE-9356 > Project: Lucene - Core > Issue Type: Test >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.6 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > We already have tests that file truncation and modification of the index > headers are caught correctly. I'd like to add another test that flipping a > byte in a way that modifies the checksum of the file is always caught > gracefully by Lucene. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9191) Fix linefiledocs compression or replace in tests
[ https://issues.apache.org/jira/browse/LUCENE-9191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant resolved LUCENE-9191. Resolution: Fixed [8.6 release manager] It seems to me this issue is resolved. I change its status for 8.6 RC. > Fix linefiledocs compression or replace in tests > > > Key: LUCENE-9191 > URL: https://issues.apache.org/jira/browse/LUCENE-9191 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Assignee: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-9191.patch, LUCENE-9191.patch, LUCENE-9191.patch > > > LineFileDocs(random) is very slow, even to open. It does a very slow "random > skip" through a gzip compressed file. > For the analyzers tests, in LUCENE-9186 I simply removed its usage, since > TestUtil.randomAnalysisString is superior, and fast. But we should address > other tests using it, since LineFileDocs(random) is slow! > I think it is also the case that every lucene test has probably tested every > LineFileDocs line many times now, whereas randomAnalysisString will invent > new ones. > Alternatively, we could "fix" LineFileDocs(random), e.g. special compression > options (in blocks)... deflate supports such stuff. But it would make it even > hairier than it is now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption
[ https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151920#comment-17151920 ] Bruno Roustant commented on LUCENE-9379: I tested with FST ON-HEAP: we gain +15% to +20% perf on all queries. I tested my Light version of javax.crypto.Cipher. It is indeed much faster for construction and cloning, but not for the core encryption. The reason is that two internal classes in com.sun.crypto have an @HotSpotIntrinsicCandidate annotation that makes the encryption extremely fast. I tested with a hack version that takes the best of the two versions. It brings a cumulative +10% perf improvement. So as a conclusion for the perf benchmark: * An OS level encryption is best and fastest. * If really it’s not possible, expect an average of -20% perf impact on most queries, -60% on multiterm queries. * If you need more you can make FST on-heap and expect +15% perf. * If you need more you can use a Cipher hack to get +10% perf. > Directory based approach for index encryption > - > > Key: LUCENE-9379 > URL: https://issues.apache.org/jira/browse/LUCENE-9379 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > The goal is to provide optional encryption of the index, with a scope limited > to an encryptable Lucene Directory wrapper. > Encryption is at rest on disk, not in memory. > This simple approach should fit any Codec as it would be orthogonal, without > modifying APIs as much as possible. > Use a standard encryption method. Limit perf/memory impact as much as > possible. > Determine how callers provide encryption keys. They must not be stored on > disk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption
[ https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151918#comment-17151918 ] Bruno Roustant commented on LUCENE-9379: TaskQPS Lucene86 StdDevQPS EncryptionTim StdDev Pct diff Respell 41.55 (2.7%) 10.76 (0.9%) -74.1% ( -75% - -72%) Fuzzy2 44.81 (9.0%) 12.00 (1.1%) -73.2% ( -76% - -69%) Fuzzy1 41.03 (7.3%) 16.24 (1.9%) -60.4% ( -64% - -55%) Wildcard 28.02 (4.0%) 14.94 (2.0%) -46.7% ( -50% - -42%) OrHighNotLow 747.43 (4.2%) 485.90 (3.5%) -35.0% ( -40% - -28%) OrNotHighMed 524.60 (4.2%) 344.06 (2.9%) -34.4% ( -39% - -28%) OrHighNotHigh 576.32 (5.0%) 382.60 (4.0%) -33.6% ( -40% - -25%) OrHighNotMed 553.85 (4.1%) 371.73 (3.4%) -32.9% ( -38% - -26%) MedTerm 1116.53 (3.6%) 766.39 (2.6%) -31.4% ( -36% - -26%) LowTerm 1376.31 (4.2%) 947.48 (3.0%) -31.2% ( -36% - -25%) OrNotHighLow 492.68 (4.7%) 342.05 (4.7%) -30.6% ( -38% - -22%) AndHighLow 482.97 (3.8%) 342.18 (3.4%) -29.2% ( -34% - -22%) OrHighLow 410.23 (3.7%) 294.38 (3.8%) -28.2% ( -34% - -21%) HighTerm 971.63 (5.3%) 701.77 (3.2%) -27.8% ( -34% - -20%) OrNotHighHigh 493.99 (5.1%) 358.95 (3.9%) -27.3% ( -34% - -19%) LowPhrase 286.03 (2.9%) 246.04 (2.8%) -14.0% ( -19% - -8%) HighPhrase 290.25 (3.3%) 252.54 (3.4%) -13.0% ( -18% - -6%) Prefix3 51.36 (4.8%) 45.20 (4.1%) -12.0% ( -19% - -3%) AndHighMed 113.34 (4.0%) 105.77 (4.0%) -6.7% ( -14% - 1%) MedSloppyPhrase 79.83 (3.5%) 74.78 (3.6%) -6.3% ( -13% - 0%) HighTermDayOfYearSort 63.32 (13.3%) 59.34 (14.6%) -6.3% ( -30% - 24%) HighTermTitleBDVSort 86.16 (10.3%) 81.63 (10.0%) -5.3% ( -23% - 16%) LowSpanNear 58.07 (3.1%) 55.13 (3.2%) -5.1% ( -10% - 1%) AndHighHigh 44.58 (4.1%) 42.92 (4.2%) -3.7% ( -11% - 4%) OrHighMed 56.53 (4.4%) 54.65 (4.1%) -3.3% ( -11% - 5%) BrowseDateTaxoFacets 1.54 (4.6%) 1.50 (5.2%) -2.5% ( -11% - 7%) HighTermMonthSort 18.51 (10.5%) 18.06 (10.1%) -2.4% ( -20% - 20%) BrowseDayOfYearTaxoFacets 1.53 (4.7%) 1.49 (5.3%) -2.3% ( -11% - 8%) BrowseMonthTaxoFacets 1.77 (3.5%) 1.74 (4.2%) -2.1% ( -9% - 5%) HighSpanNear 12.75 (3.6%) 12.50 (4.1%) -2.0% ( -9% - 5%) MedPhrase 107.89 (3.2%) 106.01 (3.9%) -1.7% ( -8% - 5%) HighSloppyPhrase 12.86 (4.0%) 12.71 (4.7%) -1.2% ( -9% - 7%) MedSpanNear 11.76 (3.1%) 11.62 (3.4%) -1.1% ( -7% - 5%) HighIntervalsOrdered 13.61 (3.2%) 13.46 (3.3%) -1.1% ( -7% - 5%) OrHighHigh 11.12 (3.7%) 11.12 (4.1%) -0.1% ( -7% - 8%) BrowseMonthSSDVFacets 4.28 (3.9%) 4.29 (3.9%) 0.2% ( -7% - 8%) BrowseDayOfYearSSDVFacets 3.82 (3.7%) 3.84 (3.4%) 0.3% ( -6% - 7%) IntNRQ 25.54 (3.1%) 26.34 (3.4%) 3.1% ( -3% - 9%) PKLookup 174.98 (3.0%) 183.78 (4.5%) 5.0% ( -2% - 12%) LowSloppyPhrase 6.29 (3.5%) 6.89 (4.5%) 9.6% ( 1% - 18%) > Directory based approach for index encryption > - > > Key: LUCENE-9379 > URL: https://issues.apache.org/jira/browse/LUCENE-9379 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > The goal is to provide optional encryption of the index, with a scope limited > to an encryptable Lucene Directory wrapper. > Encryption is at rest on disk, not in memory. > This simple approach should fit any Codec as it would be orthogonal, without > modifying APIs as much as possible. > Use a standard encryption method. Limit perf/memory impact as much as > possible. > Determine how callers provide encryption keys. They must not be stored on > disk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption
[ https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151915#comment-17151915 ] Bruno Roustant commented on LUCENE-9379: I ran the benchmarks to measure the perf impact of this IndexInput-level encryption on the PostingsFormat (luceneutil on wikimediumall). When encrypting only the terms file, FST file and metadata file (.tim .tip .tmd) (not doc id nor postings): Most queries run between -0% to -35% Wildcard -47% Fuzzy/Respell between -60% to -74% It is possible to encrypt all files, but the perf drops considerably, -60% for most queries, -90% for fuzzy queries. > Directory based approach for index encryption > - > > Key: LUCENE-9379 > URL: https://issues.apache.org/jira/browse/LUCENE-9379 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > The goal is to provide optional encryption of the index, with a scope limited > to an encryptable Lucene Directory wrapper. > Encryption is at rest on disk, not in memory. > This simple approach should fit any Codec as it would be orthogonal, without > modifying APIs as much as possible. > Use a standard encryption method. Limit perf/memory impact as much as > possible. > Determine how callers provide encryption keys. They must not be stored on > disk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption
[ https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151914#comment-17151914 ] Bruno Roustant commented on LUCENE-9379: [~rcmuir] makes an important callout in the PR. A better approach is by leveraging the OS encryption at filesystem level because it fits the OS filesystem cache. That way the cached pages are decrypted in the cache. So whenever it is possible, we must use OS level encryption. An OS filesystem encryption allows to encrypt differently per directory/file, and some allow to manage multiple keys. But OS level encryption is not always possible. The example I can think of is running on computing engines on public cloud. In this case we don't have access to the OS level encryption (there is one but we cannot manage keys). So this Jira issue propose a solution in the case we cannot use OS level encryption and we need to manage multiple keys. It should be stated well in the doc/javadoc. It is sub-optimal because it has to decrypt each time it accesses a cached IO page. So expect more performance impact. > Directory based approach for index encryption > - > > Key: LUCENE-9379 > URL: https://issues.apache.org/jira/browse/LUCENE-9379 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 2h 20m > Remaining Estimate: 0h > > The goal is to provide optional encryption of the index, with a scope limited > to an encryptable Lucene Directory wrapper. > Encryption is at rest on disk, not in memory. > This simple approach should fit any Codec as it would be orthogonal, without > modifying APIs as much as possible. > Use a standard encryption method. Limit perf/memory impact as much as > possible. > Determine how callers provide encryption keys. They must not be stored on > disk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption
[ https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149694#comment-17149694 ] Bruno Roustant commented on LUCENE-9379: Watchers, I need your help. I need to know how you would use the encryption, and more precisely how you would provide the keys. Is my approach of using either an EncryptingDirectory (in the PR look at SimpleEncryptingDirectory) or a custom Codec (in the PR look at EncryptingCodec) appropriate for your use-case? Note that both SimpleEncryptingDirectory and EncryptingCodec are only in test packages as I expect the users to write some custom code to use encryption. If you have an idea of a standard code that could be added to make encryption easy, please share your idea here. > Directory based approach for index encryption > - > > Key: LUCENE-9379 > URL: https://issues.apache.org/jira/browse/LUCENE-9379 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > The goal is to provide optional encryption of the index, with a scope limited > to an encryptable Lucene Directory wrapper. > Encryption is at rest on disk, not in memory. > This simple approach should fit any Codec as it would be orthogonal, without > modifying APIs as much as possible. > Use a standard encryption method. Limit perf/memory impact as much as > possible. > Determine how callers provide encryption keys. They must not be stored on > disk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption
[ https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149691#comment-17149691 ] Bruno Roustant commented on LUCENE-9379: I updated the PR. Now it is functional and complete, with javadoc. There should be no perf issue anymore because I replaced javax.crypto.Cipher by a much lighter code that is strictly equivalent, encryption/decryption is the same (tested randomly by 3 different tests). For reviewers, there are 33 changed files in the PR but only 10 source classes, the other are for tests. Look for the classes in store package (e.g. EncryptingDirectory, EncryptingIndexOutput, EncryptingIndexInput) and the new util.crypto package (e.g. AesCtrEncrypter). Now all tests pass when enabling the encryption with a test codec or a test directory. Next step: * Run luceneutil benchmark to evaluate the perf impact. > Directory based approach for index encryption > - > > Key: LUCENE-9379 > URL: https://issues.apache.org/jira/browse/LUCENE-9379 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > The goal is to provide optional encryption of the index, with a scope limited > to an encryptable Lucene Directory wrapper. > Encryption is at rest on disk, not in memory. > This simple approach should fit any Codec as it would be orthogonal, without > modifying APIs as much as possible. > Use a standard encryption method. Limit perf/memory impact as much as > possible. > Determine how callers provide encryption keys. They must not be stored on > disk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14537) Improve performance of ExportWriter
[ https://issues.apache.org/jira/browse/SOLR-14537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-14537: -- Fix Version/s: (was: 8.6) > Improve performance of ExportWriter > --- > > Key: SOLR-14537 > URL: https://issues.apache.org/jira/browse/SOLR-14537 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Export Writer >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Retrieving, sorting and writing out documents in {{ExportWriter}} are three > aspects of the /export handler that can be further optimized. > SOLR-14470 introduced some level of caching in {{StringValue}}. Further > options for caching and speedups should be explored. > Currently the sort/retrieve and write operations are done sequentially, but > they could be parallelized, considering that they block on different channels > - the first is index reading & CPU bound, the other is bound by the receiving > end because it uses blocking IO. The sorting and retrieving of values could > be done in parallel with the operation of writing out the current batch of > results. > One possible approach here would be to use "double buffering" where one > buffered batch that is ready (already sorted and retrieved) is being written > out, while the other batch is being prepared in a background thread, and when > both are done the buffers are swapped. This wouldn't complicate the current > code too much but it should instantly give up to 2x higher throughput. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption
[ https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143886#comment-17143886 ] Bruno Roustant commented on LUCENE-9379: First PR, functional but incomplete. The idea of using a pool of Cipher does not work in Lucene. To run the tests, two options: test -Dtests.codec=Encrypting It executes the tests with the EncryptingCodec in test-framework. Currently it encrypts a delegate PostingsFormat. This option shows how to provide the encryption key depending on the SegmentInfo. test -Dtests.directory=org.apache.lucene.codecs.encrypting.SimpleEncryptingDirectory It executes the tests with the SimpleEncryptingDirectory in test-framework. This option is the simplest; it shows how to provide the encryption key as a constant (could be a property) or only depending on the name of the file to encrypt (no SegmentInfo). There is a performance issue because of too many new Ciphers when slicing IndexInput. javax.crypto.Cipher is heavy weight to create and is stateful. I tried a CipherPool, but actually there are many cases where we need to get lots of slices of the IndexInput so we have to create lots of new stateful Cipher. The pool turns out to be a no-go, there are too many Cipher in it. TODO: * find a lighter alternative to Cipher if it exists. * fix a couple of tests still failing because of unclosed IndexOutput. > Directory based approach for index encryption > - > > Key: LUCENE-9379 > URL: https://issues.apache.org/jira/browse/LUCENE-9379 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > The goal is to provide optional encryption of the index, with a scope limited > to an encryptable Lucene Directory wrapper. > Encryption is at rest on disk, not in memory. > This simple approach should fit any Codec as it would be orthogonal, without > modifying APIs as much as possible. > Use a standard encryption method. Limit perf/memory impact as much as > possible. > Determine how callers provide encryption keys. They must not be stored on > disk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9379) Directory based approach for index encryption
[ https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136783#comment-17136783 ] Bruno Roustant edited comment on LUCENE-9379 at 6/23/20, 12:51 PM: --- So I plan to implement an EncryptingDirectory extending FilterDirectory. +Encryption method:+ AES CTR (counter) * This mode is approved by NIST. ([https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29]) * AES encryption has the same size as the original clear text (no padding). So we can use the same file pointers. * CTR mode allows random access to encrypted blocks (128 bits blocks). * IV (initialisation vector) must be random, and is stored at the beginning of the encrypted file because it can be public. No need to repeat the IV for each block (less disk impact compared to CBC mode). * It is appropriate to encrypt streams. +API:+ I don’t anticipate any API change. +How to provide encryption keys:+ EncryptingDirectory would require a delegate Directory, an encryption key supplier, and a Cipher pool (for performance). For the callers to pass the encryption keys, I see two ways: 1- In Solr, declare a DirectoryFactory in solrconfig.xml that creates EncryptingDirectory. This factory is able to determine the encryption key per file based on the path. It is the responsibility of this factory to access the keys (e.g. stored in safe DB, received with an admin handler, read from properties, etc). The Cipher pool is hold by the DirectoryFactory. 2- More generally the EncryptingDirectory can be created to wrap a Directory when opening a segment (e.g. in PostingsFormat/DocValuesFormat fieldsConsumer()/fieldsProducer(), in StoredFieldFormat fieldsReader()/fieldsWriter(), etc). In this case the PostingsFormat/DocValuesFormat/StoredFieldFormat extension determines the encryption key based on the SegmentInfo. A custom Codec can be created to handle encrypting formats. The Cipher pool is hold either in the Codec or in the Format. +Code:+ I will inspire from Apache commons-crypto CtrCryptoOutputStream, although not directly using it because it is an OutputStream while we need an IndexOutput. And we can probably simplify since we have a specific use-case compared to this lib wide usage. was (Author: broustant): So I plan to implement an EncryptingDirectory extending FilterDirectory. +Encryption method:+ AES CTR (counter) * This mode is approved by NIST. ([https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29]) * AES encryption has the same size as the original clear text (though the last block is padded to 128 bits). So we can use the same file pointers. * CTR mode allows random access to encrypted blocks (128 bits blocks). * IV (initialisation vector) must be random, and is stored at the beginning of the encrypted file because it can be public. No need to repeat the IV for each block (less disk impact compared to CBC mode). * It is appropriate to encrypt streams. +API:+ I don’t anticipate any API change. +How to provide encryption keys:+ EncryptingDirectory would require a delegate Directory, an encryption key supplier, and a Cipher pool (for performance). For the callers to pass the encryption keys, I see two ways: 1- In Solr, declare a DirectoryFactory in solrconfig.xml that creates EncryptingDirectory. This factory is able to determine the encryption key per file based on the path. It is the responsibility of this factory to access the keys (e.g. stored in safe DB, received with an admin handler, read from properties, etc). The Cipher pool is hold by the DirectoryFactory. 2- More generally the EncryptingDirectory can be created to wrap a Directory when opening a segment (e.g. in PostingsFormat/DocValuesFormat fieldsConsumer()/fieldsProducer(), in StoredFieldFormat fieldsReader()/fieldsWriter(), etc). In this case the PostingsFormat/DocValuesFormat/StoredFieldFormat extension determines the encryption key based on the SegmentInfo. A custom Codec can be created to handle encrypting formats. The Cipher pool is hold either in the Codec or in the Format. +Code:+ I will inspire from Apache commons-crypto CtrCryptoOutputStream, although not directly using it because it is an OutputStream while we need an IndexOutput. And we can probably simplify since we have a specific use-case compared to this lib wide usage. > Directory based approach for index encryption > - > > Key: LUCENE-9379 > URL: https://issues.apache.org/jira/browse/LUCENE-9379 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > > The goal is to provide optional encryption of the index, with a scope limited > to an encryptable
[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use
[ https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140505#comment-17140505 ] Bruno Roustant commented on LUCENE-9286: Maybe we miss a benchmark on FSTEnum traversal speed? Although I don't know where we could put it. > FST arc.copyOf clones BitTables and this can lead to excessive memory use > - > > Key: LUCENE-9286 > URL: https://issues.apache.org/jira/browse/LUCENE-9286 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.5 >Reporter: Dawid Weiss >Assignee: Bruno Roustant >Priority: Major > Fix For: 8.6 > > Attachments: screen-[1].png > > Time Spent: 1h 50m > Remaining Estimate: 0h > > I see a dramatic increase in the amount of memory required for construction > of (arguably large) automata. It currently OOMs with 8GB of memory consumed > for bit tables. I am pretty sure this didn't require so much memory before > (the automaton is ~50MB after construction). > Something bad happened in between. Thoughts, [~broustant], [~sokolov]? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9379) Directory based approach for index encryption
[ https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136783#comment-17136783 ] Bruno Roustant edited comment on LUCENE-9379 at 6/16/20, 4:25 PM: -- So I plan to implement an EncryptingDirectory extending FilterDirectory. +Encryption method:+ AES CTR (counter) * This mode is approved by NIST. ([https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29]) * AES encryption has the same size as the original clear text (though the last block is padded to 128 bits). So we can use the same file pointers. * CTR mode allows random access to encrypted blocks (128 bits blocks). * IV (initialisation vector) must be random, and is stored at the beginning of the encrypted file because it can be public. No need to repeat the IV for each block (less disk impact compared to CBC mode). * It is appropriate to encrypt streams. +API:+ I don’t anticipate any API change. +How to provide encryption keys:+ EncryptingDirectory would require a delegate Directory, an encryption key supplier, and a Cipher pool (for performance). For the callers to pass the encryption keys, I see two ways: 1- In Solr, declare a DirectoryFactory in solrconfig.xml that creates EncryptingDirectory. This factory is able to determine the encryption key per file based on the path. It is the responsibility of this factory to access the keys (e.g. stored in safe DB, received with an admin handler, read from properties, etc). The Cipher pool is hold by the DirectoryFactory. 2- More generally the EncryptingDirectory can be created to wrap a Directory when opening a segment (e.g. in PostingsFormat/DocValuesFormat fieldsConsumer()/fieldsProducer(), in StoredFieldFormat fieldsReader()/fieldsWriter(), etc). In this case the PostingsFormat/DocValuesFormat/StoredFieldFormat extension determines the encryption key based on the SegmentInfo. A custom Codec can be created to handle encrypting formats. The Cipher pool is hold either in the Codec or in the Format. +Code:+ I will inspire from Apache commons-crypto CtrCryptoOutputStream, although not directly using it because it is an OutputStream while we need an IndexOutput. And we can probably simplify since we have a specific use-case compared to this lib wide usage. was (Author: broustant): So I plan to implement an EncryptingDirectory extending FilterDirectory. +Encryption method:+ AES CTR (counter) * This mode is approved by NIST. ([https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29]) * AES encryption has the same size as the original clear text (though the last block is padded to 128 bits). So we can use the same file pointers. * CTR mode allows random access to encrypted blocks (128 bits blocks). * IV (initialisation vector) must be random, and is stored at the beginning of the encrypted file because it can be public. * It is appropriate to encrypt streams. +API:+ I don’t anticipate any API change. +How to provide encryption keys:+ EncryptingDirectory would require a delegate Directory, an encryption key supplier, and a Cipher pool (for performance). For the callers to pass the encryption keys, I see two ways: 1- In Solr, declare a DirectoryFactory in solrconfig.xml that creates EncryptingDirectory. This factory is able to determine the encryption key per file based on the path. It is the responsibility of this factory to access the keys (e.g. stored in safe DB, received with an admin handler, read from properties, etc). The Cipher pool is hold by the DirectoryFactory. 2- More generally the EncryptingDirectory can be created to wrap a Directory when opening a segment (e.g. in PostingsFormat/DocValuesFormat fieldsConsumer()/fieldsProducer(), in StoredFieldFormat fieldsReader()/fieldsWriter(), etc). In this case the PostingsFormat/DocValuesFormat/StoredFieldFormat extension determines the encryption key based on the SegmentInfo. A custom Codec can be created to handle encrypting formats. The Cipher pool is hold either in the Codec or in the Format. +Code:+ I will inspire from Apache commons-crypto CtrCryptoOutputStream, although not directly using it because it is an OutputStream while we need an IndexOutput. And we can probably simplify since we have a specific use-case compared to this lib wide usage. > Directory based approach for index encryption > - > > Key: LUCENE-9379 > URL: https://issues.apache.org/jira/browse/LUCENE-9379 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > > The goal is to provide optional encryption of the index, with a scope limited > to an encryptable Lucene Directory wrapper. > Encryption is at rest on
[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption
[ https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136783#comment-17136783 ] Bruno Roustant commented on LUCENE-9379: So I plan to implement an EncryptingDirectory extending FilterDirectory. +Encryption method:+ AES CTR (counter) * This mode is approved by NIST. ([https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29]) * AES encryption has the same size as the original clear text (though the last block is padded to 128 bits). So we can use the same file pointers. * CTR mode allows random access to encrypted blocks (128 bits blocks). * IV (initialisation vector) must be random, and is stored at the beginning of the encrypted file because it can be public. * It is appropriate to encrypt streams. +API:+ I don’t anticipate any API change. +How to provide encryption keys:+ EncryptingDirectory would require a delegate Directory, an encryption key supplier, and a Cipher pool (for performance). For the callers to pass the encryption keys, I see two ways: 1- In Solr, declare a DirectoryFactory in solrconfig.xml that creates EncryptingDirectory. This factory is able to determine the encryption key per file based on the path. It is the responsibility of this factory to access the keys (e.g. stored in safe DB, received with an admin handler, read from properties, etc). The Cipher pool is hold by the DirectoryFactory. 2- More generally the EncryptingDirectory can be created to wrap a Directory when opening a segment (e.g. in PostingsFormat/DocValuesFormat fieldsConsumer()/fieldsProducer(), in StoredFieldFormat fieldsReader()/fieldsWriter(), etc). In this case the PostingsFormat/DocValuesFormat/StoredFieldFormat extension determines the encryption key based on the SegmentInfo. A custom Codec can be created to handle encrypting formats. The Cipher pool is hold either in the Codec or in the Format. +Code:+ I will inspire from Apache commons-crypto CtrCryptoOutputStream, although not directly using it because it is an OutputStream while we need an IndexOutput. And we can probably simplify since we have a specific use-case compared to this lib wide usage. > Directory based approach for index encryption > - > > Key: LUCENE-9379 > URL: https://issues.apache.org/jira/browse/LUCENE-9379 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > > The goal is to provide optional encryption of the index, with a scope limited > to an encryptable Lucene Directory wrapper. > Encryption is at rest on disk, not in memory. > This simple approach should fit any Codec as it would be orthogonal, without > modifying APIs as much as possible. > Use a standard encryption method. Limit perf/memory impact as much as > possible. > Determine how callers provide encryption keys. They must not be stored on > disk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9397) UniformSplit supports encodable fields metadata
[ https://issues.apache.org/jira/browse/LUCENE-9397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant resolved LUCENE-9397. Fix Version/s: 8.6 Resolution: Fixed Thanks [~dsmiley] for the review. > UniformSplit supports encodable fields metadata > --- > > Key: LUCENE-9397 > URL: https://issues.apache.org/jira/browse/LUCENE-9397 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Fix For: 8.6 > > Time Spent: 20m > Remaining Estimate: 0h > > UniformSplit already supports custom encoding for term blocks. This is an > extension to also support encodable fields metadata. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9397) UniformSplit supports encodable fields metadata
[ https://issues.apache.org/jira/browse/LUCENE-9397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133028#comment-17133028 ] Bruno Roustant commented on LUCENE-9397: Currently we use the encoder interface to cypher term blocks, FST and fields metadata. We don't attach more data. However I'm going to work on LUCENE-9379 for a directory-based approach of encryption that would not be tied to a postings format. Eventually we would like to move to that solution. > UniformSplit supports encodable fields metadata > --- > > Key: LUCENE-9397 > URL: https://issues.apache.org/jira/browse/LUCENE-9397 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > UniformSplit already supports custom encoding for term blocks. This is an > extension to also support encodable fields metadata. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org