[jira] [Commented] (LUCENE-8836) Optimize DocValues TermsDict to continue scanning from the last position when possible

2022-04-25 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527666#comment-17527666
 ] 

Bruno Roustant commented on LUCENE-8836:


Thanks [~jpountz] for this simplified improvement!

I agree to mark this issue as resolved.

> Optimize DocValues TermsDict to continue scanning from the last position when 
> possible
> --
>
> Key: LUCENE-8836
> URL: https://issues.apache.org/jira/browse/LUCENE-8836
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
>  Labels: docValues, optimization
> Fix For: 9.2
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Lucene80DocValuesProducer.TermsDict is used to lookup for either a term or a 
> term ordinal.
> Currently it does not have the optimization the FSTEnum has: to be able to 
> continue a sequential scan from where the last lookup was in the IndexInput. 
> For sparse lookups (when searching only a few terms or ordinal) it is not an 
> issue. But for multiple lookups in a row this optimization could save 
> re-scanning all the terms from the block start (since they are delat encoded).
> This patch proposes the optimization.
> To estimate the gain, we ran 3 Lucene tests while counting the seeks and the 
> term reads in the IndexInput, with and without the optimization:
> TestLucene70DocValuesFormat - the optimization saves 24% seeks and 15% term 
> reads.
> TestDocValuesQueries - the optimization adds 0.7% seeks and 0.003% term reads.
> TestDocValuesRewriteMethod.testRegexps - the optimization saves 71% seeks and 
> 82% term reads.
> In some cases, when scanning many terms in lexicographical order, the 
> optimization saves a lot. In some case, when only looking for some sparse 
> terms, the optimization does not bring improvement, but does not penalize 
> neither. It seems to be worth to always have it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10225) Improve IntroSelector with 3-way partitioning

2021-11-17 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved LUCENE-10225.
-
Fix Version/s: 9.1
   Resolution: Fixed

Thanks Dawid and Adrien!

> Improve IntroSelector with 3-way partitioning
> -
>
> Key: LUCENE-10225
> URL: https://issues.apache.org/jira/browse/LUCENE-10225
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: 9.1
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The same way we improved IntroSorter, we can improve IntroSelector with 
> Bentley-McIlroy 3-way partitioning. The gain in performance is roughly the 
> same.
> With this new approach, we always use medians-of-medians (Tukey's Ninther), 
> so there is no real gain to fallback to a slower medians-of-medians technique 
> as an introspective protection (like the existing implementation does). 
> Instead we can simply shuffle the sub-range if we exceed the recursive max 
> depth (this does not change the speed as this intro-protective mechanism 
> almost never happens - maybe only in adversarial cases).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10225) Improve IntroSelector with 3-way partitioning

2021-11-17 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445045#comment-17445045
 ] 

Bruno Roustant commented on LUCENE-10225:
-

I'm a bit confused. I put this change in the 9.1 section in CHANGES, but 
actually should it be in the 9.0 section [~jpountz]?

> Improve IntroSelector with 3-way partitioning
> -
>
> Key: LUCENE-10225
> URL: https://issues.apache.org/jira/browse/LUCENE-10225
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> The same way we improved IntroSorter, we can improve IntroSelector with 
> Bentley-McIlroy 3-way partitioning. The gain in performance is roughly the 
> same.
> With this new approach, we always use medians-of-medians (Tukey's Ninther), 
> so there is no real gain to fallback to a slower medians-of-medians technique 
> as an introspective protection (like the existing implementation does). 
> Instead we can simply shuffle the sub-range if we exceed the recursive max 
> depth (this does not change the speed as this intro-protective mechanism 
> almost never happens - maybe only in adversarial cases).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10225) Improve IntroSelector with 3-way partitioning

2021-11-13 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443195#comment-17443195
 ] 

Bruno Roustant edited comment on LUCENE-10225 at 11/13/21, 10:25 PM:
-

With the addition of the adaptive top-k algorithm triggered when k is close to 
from or last, we get significant perf gain (x3-x4) when (k - from <= 20) or 
(last - k <= 20):
{noformat}
RANDOM
  IntroSelector  ...  485  317  371  449  299  454  333  345  190  311
  IntroSelector2 ...   85   86   86   97   90   87   87   86   87   91
RANDOM_LOW_CARDINALITY
  IntroSelector  ...  475  483  239  445  234  236  455  460  365  508
  IntroSelector2 ...  113  114  115  114  112  113  114  114  113  114
RANDOM_MEDIUM_CARDINALITY
  IntroSelector  ...  529  287  448  374  347  392  356  412  182  408
  IntroSelector2 ...   88   85   86   87   87   87   86   86   88   85
ASCENDING
  IntroSelector  ...  157  150  148  146  146  147  146  147  146  146
  IntroSelector2 ...   80   80   82   82   82   83   81   82   81   82
DESCENDING
  IntroSelector  ...  250  250  246  246  254  251  255  247  247  249
  IntroSelector2 ...   84   84   84   85   83   82   80   81   81   82
STRICTLY_DESCENDING
  IntroSelector  ...  239  237  239  238  238  250  240  239  240  240
  IntroSelector2 ...   82   83   82   81   82   80   82   85   84   83
ASCENDING_SEQUENCES
  IntroSelector  ...  209  217  145  125  185  142  131  172  240  146
  IntroSelector2 ...   85   86   86   84   84   82   83   83   83   81
MOSTLY_ASCENDING
  IntroSelector  ...  151  155  150  150  154  147  150  154  154  154
  IntroSelector2 ...   82   82   81   81   81   81   82   82  104   85
{noformat}


was (Author: broustant):
With the addition of the adaptive top-k algorithm triggered when k is close to 
from or last, we get significant perf gain when (k - from <= 20) or (last - k 
<= 20):
{noformat}
RANDOM
  IntroSelector  ...  485  317  371  449  299  454  333  345  190  311
  IntroSelector2 ...   85   86   86   97   90   87   87   86   87   91
RANDOM_LOW_CARDINALITY
  IntroSelector  ...  475  483  239  445  234  236  455  460  365  508
  IntroSelector2 ...  113  114  115  114  112  113  114  114  113  114
RANDOM_MEDIUM_CARDINALITY
  IntroSelector  ...  529  287  448  374  347  392  356  412  182  408
  IntroSelector2 ...   88   85   86   87   87   87   86   86   88   85
ASCENDING
  IntroSelector  ...  157  150  148  146  146  147  146  147  146  146
  IntroSelector2 ...   80   80   82   82   82   83   81   82   81   82
DESCENDING
  IntroSelector  ...  250  250  246  246  254  251  255  247  247  249
  IntroSelector2 ...   84   84   84   85   83   82   80   81   81   82
STRICTLY_DESCENDING
  IntroSelector  ...  239  237  239  238  238  250  240  239  240  240
  IntroSelector2 ...   82   83   82   81   82   80   82   85   84   83
ASCENDING_SEQUENCES
  IntroSelector  ...  209  217  145  125  185  142  131  172  240  146
  IntroSelector2 ...   85   86   86   84   84   82   83   83   83   81
MOSTLY_ASCENDING
  IntroSelector  ...  151  155  150  150  154  147  150  154  154  154
  IntroSelector2 ...   82   82   81   81   81   81   82   82  104   85
{noformat}

> Improve IntroSelector with 3-way partitioning
> -
>
> Key: LUCENE-10225
> URL: https://issues.apache.org/jira/browse/LUCENE-10225
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The same way we improved IntroSorter, we can improve IntroSelector with 
> Bentley-McIlroy 3-way partitioning. The gain in performance is roughly the 
> same.
> With this new approach, we always use medians-of-medians (Tukey's Ninther), 
> so there is no real gain to fallback to a slower medians-of-medians technique 
> as an introspective protection (like the existing implementation does). 
> Instead we can simply shuffle the sub-range if we exceed the recursive max 
> depth (this does not change the speed as this intro-protective mechanism 
> almost never happens - maybe only in adversarial cases).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10225) Improve IntroSelector with 3-way partitioning

2021-11-13 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17443195#comment-17443195
 ] 

Bruno Roustant commented on LUCENE-10225:
-

With the addition of the adaptive top-k algorithm triggered when k is close to 
from or last, we get significant perf gain when (k - from <= 20) or (last - k 
<= 20):
{noformat}
RANDOM
  IntroSelector  ...  485  317  371  449  299  454  333  345  190  311
  IntroSelector2 ...   85   86   86   97   90   87   87   86   87   91
RANDOM_LOW_CARDINALITY
  IntroSelector  ...  475  483  239  445  234  236  455  460  365  508
  IntroSelector2 ...  113  114  115  114  112  113  114  114  113  114
RANDOM_MEDIUM_CARDINALITY
  IntroSelector  ...  529  287  448  374  347  392  356  412  182  408
  IntroSelector2 ...   88   85   86   87   87   87   86   86   88   85
ASCENDING
  IntroSelector  ...  157  150  148  146  146  147  146  147  146  146
  IntroSelector2 ...   80   80   82   82   82   83   81   82   81   82
DESCENDING
  IntroSelector  ...  250  250  246  246  254  251  255  247  247  249
  IntroSelector2 ...   84   84   84   85   83   82   80   81   81   82
STRICTLY_DESCENDING
  IntroSelector  ...  239  237  239  238  238  250  240  239  240  240
  IntroSelector2 ...   82   83   82   81   82   80   82   85   84   83
ASCENDING_SEQUENCES
  IntroSelector  ...  209  217  145  125  185  142  131  172  240  146
  IntroSelector2 ...   85   86   86   84   84   82   83   83   83   81
MOSTLY_ASCENDING
  IntroSelector  ...  151  155  150  150  154  147  150  154  154  154
  IntroSelector2 ...   82   82   81   81   81   81   82   82  104   85
{noformat}

> Improve IntroSelector with 3-way partitioning
> -
>
> Key: LUCENE-10225
> URL: https://issues.apache.org/jira/browse/LUCENE-10225
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The same way we improved IntroSorter, we can improve IntroSelector with 
> Bentley-McIlroy 3-way partitioning. The gain in performance is roughly the 
> same.
> With this new approach, we always use medians-of-medians (Tukey's Ninther), 
> so there is no real gain to fallback to a slower medians-of-medians technique 
> as an introspective protection (like the existing implementation does). 
> Instead we can simply shuffle the sub-range if we exceed the recursive max 
> depth (this does not change the speed as this intro-protective mechanism 
> almost never happens - maybe only in adversarial cases).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] (LUCENE-10225) Improve IntroSelector with 3-way partitioning

2021-11-09 Thread Bruno Roustant (Jira)


[ https://issues.apache.org/jira/browse/LUCENE-10225 ]


Bruno Roustant deleted comment on LUCENE-10225:
-

was (Author: broustant):
PR: https://github.com/apache/lucene/pull/430

> Improve IntroSelector with 3-way partitioning
> -
>
> Key: LUCENE-10225
> URL: https://issues.apache.org/jira/browse/LUCENE-10225
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The same way we improved IntroSorter, we can improve IntroSelector with 
> Bentley-McIlroy 3-way partitioning. The gain in performance is roughly the 
> same.
> With this new approach, we always use medians-of-medians (Tukey's Ninther), 
> so there is no real gain to fallback to a slower medians-of-medians technique 
> as an introspective protection (like the existing implementation does). 
> Instead we can simply shuffle the sub-range if we exceed the recursive max 
> depth (this does not change the speed as this intro-protective mechanism 
> almost never happens - maybe only in adversarial cases).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10225) Improve IntroSelector with 3-way partitioning

2021-11-08 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440556#comment-17440556
 ] 

Bruno Roustant commented on LUCENE-10225:
-

PR: https://github.com/apache/lucene/pull/430

> Improve IntroSelector with 3-way partitioning
> -
>
> Key: LUCENE-10225
> URL: https://issues.apache.org/jira/browse/LUCENE-10225
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
>
> The same way we improved IntroSorter, we can improve IntroSelector with 
> Bentley-McIlroy 3-way partitioning. The gain in performance is roughly the 
> same.
> With this new approach, we always use medians-of-medians (Tukey's Ninther), 
> so there is no real gain to fallback to a slower medians-of-medians technique 
> as an introspective protection (like the existing implementation does). 
> Instead we can simply shuffle the sub-range if we exceed the recursive max 
> depth (this does not change the speed as this intro-protective mechanism 
> almost never happens - maybe only in adversarial cases).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10225) Improve IntroSelector with 3-way partitioning

2021-11-08 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17440475#comment-17440475
 ] 

Bruno Roustant commented on LUCENE-10225:
-

{code:java}
RANDOM
  IntroSelector  ...  397  422  407  448  395  394  417  390  394  406
  IntroSelector2 ...  361  357  372  368  366  363  362  361  361  371
RANDOM_LOW_CARDINALITY
  IntroSelector  ...  661  724  743  707  830  696  734  711  745  779
  IntroSelector2 ...  360  393  355  369  373  359  367  369  344  360
RANDOM_MEDIUM_CARDINALITY
  IntroSelector  ...  423  394  465  387  398  418  396  393  415  399
  IntroSelector2 ...  378  364  371  373  361  360  368  362  369  365
ASCENDING
  IntroSelector  ...  127  127  128  126  130  127  126  131  127  130
  IntroSelector2 ...  137  134  135  133  134  134  135  134  135  137
DESCENDING
  IntroSelector  ...  209  221  205  212  203  205  210  208  206  211
  IntroSelector2 ...  185  184  183  183  184  186  187  184  184  183
STRICTLY_DESCENDING
  IntroSelector  ...  213  210  208  214  207  213  213  206  209  206
  IntroSelector2 ...  184  188  183  186  184  183  182  188  184  184
ASCENDING_SEQUENCES
  IntroSelector  ...  308  320  493  460  287  374  372  423  391  380
  IntroSelector2 ...  201  216  234  218  218  211  211  203  207  236
MOSTLY_ASCENDING
  IntroSelector  ...  256  298  264  351  255  448  196  353  256  397
  IntroSelector2 ...  134  138  139  137  140  138  135  137  138  140 {code}

> Improve IntroSelector with 3-way partitioning
> -
>
> Key: LUCENE-10225
> URL: https://issues.apache.org/jira/browse/LUCENE-10225
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
>
> The same way we improved IntroSorter, we can improve IntroSelector with 
> Bentley-McIlroy 3-way partitioning. The gain in performance is roughly the 
> same.
> With this new approach, we always use medians-of-medians (Tukey's Ninther), 
> so there is no real gain to fallback to a slower medians-of-medians technique 
> as an introspective protection (like the existing implementation does). 
> Instead we can simply shuffle the sub-range if we exceed the recursive max 
> depth (this does not change the speed as this intro-protective mechanism 
> almost never happens - maybe only in adversarial cases).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10225) Improve IntroSelector with 3-way partitioning

2021-11-08 Thread Bruno Roustant (Jira)
Bruno Roustant created LUCENE-10225:
---

 Summary: Improve IntroSelector with 3-way partitioning
 Key: LUCENE-10225
 URL: https://issues.apache.org/jira/browse/LUCENE-10225
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Bruno Roustant


The same way we improved IntroSorter, we can improve IntroSelector with 
Bentley-McIlroy 3-way partitioning. The gain in performance is roughly the same.

With this new approach, we always use medians-of-medians (Tukey's Ninther), so 
there is no real gain to fallback to a slower medians-of-medians technique as 
an introspective protection (like the existing implementation does). Instead we 
can simply shuffle the sub-range if we exceed the recursive max depth (this 
does not change the speed as this intro-protective mechanism almost never 
happens - maybe only in adversarial cases).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning

2021-11-03 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437876#comment-17437876
 ] 

Bruno Roustant commented on LUCENE-10196:
-

Thanks for sharing the benchmark Adrien.

I'm not sure about IntroSelector, but I suppose yes. This is an exciting 
challenge :). I'll find some time to investigate.

> Improve IntroSorter with 3-ways partitioning
> 
>
> Key: LUCENE-10196
> URL: https://issues.apache.org/jira/browse/LUCENE-10196
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: 8.11
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> I added a SorterBenchmark to evaluate the performance of the various Sorter 
> implementations depending on the strategies defined in BaseSortTestCase 
> (random, random-low-cardinality, ascending, descending, etc).
> By changing the implementation of the IntroSorter to use a 3-ways 
> partitioning, we can gain a significant performance improvement when sorting 
> low-cardinality lists, and with additional changes we can also improve the 
> performance for all the strategies.
> Proposed changes:
>  - Sort small ranges with insertion sort (instead of binary sort).
>  - Select the quick sort pivot with medians.
>  - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm.
>  - Replace the tail recursion by a loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning

2021-11-02 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437247#comment-17437247
 ] 

Bruno Roustant commented on LUCENE-10196:
-

Oh! Thank you [~jpountz]! I was wrong in the CHANGES file indeed.

> Improve IntroSorter with 3-ways partitioning
> 
>
> Key: LUCENE-10196
> URL: https://issues.apache.org/jira/browse/LUCENE-10196
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: 8.11
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> I added a SorterBenchmark to evaluate the performance of the various Sorter 
> implementations depending on the strategies defined in BaseSortTestCase 
> (random, random-low-cardinality, ascending, descending, etc).
> By changing the implementation of the IntroSorter to use a 3-ways 
> partitioning, we can gain a significant performance improvement when sorting 
> low-cardinality lists, and with additional changes we can also improve the 
> performance for all the strategies.
> Proposed changes:
>  - Sort small ranges with insertion sort (instead of binary sort).
>  - Select the quick sort pivot with medians.
>  - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm.
>  - Replace the tail recursion by a loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning

2021-11-01 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved LUCENE-10196.
-
Fix Version/s: 8.11
   Resolution: Fixed

Thanks reviewers!

> Improve IntroSorter with 3-ways partitioning
> 
>
> Key: LUCENE-10196
> URL: https://issues.apache.org/jira/browse/LUCENE-10196
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: 8.11
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> I added a SorterBenchmark to evaluate the performance of the various Sorter 
> implementations depending on the strategies defined in BaseSortTestCase 
> (random, random-low-cardinality, ascending, descending, etc).
> By changing the implementation of the IntroSorter to use a 3-ways 
> partitioning, we can gain a significant performance improvement when sorting 
> low-cardinality lists, and with additional changes we can also improve the 
> performance for all the strategies.
> Proposed changes:
>  - Sort small ranges with insertion sort (instead of binary sort).
>  - Select the quick sort pivot with medians.
>  - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm.
>  - Replace the tail recursion by a loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning

2021-10-21 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated LUCENE-10196:

Description: 
I added a SorterBenchmark to evaluate the performance of the various Sorter 
implementations depending on the strategies defined in BaseSortTestCase 
(random, random-low-cardinality, ascending, descending, etc).

By changing the implementation of the IntroSorter to use a 3-ways partitioning, 
we can gain a significant performance improvement when sorting low-cardinality 
lists, and with additional changes we can also improve the performance for all 
the strategies.

Proposed changes:
 - Sort small ranges with insertion sort (instead of binary sort).
 - Select the quick sort pivot with medians.
 - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm.
 - Replace the tail recursion by a loop.

  was:
I added a SorterBenchmark to evaluate the performance of the various Sorter 
implementations depending on the strategies defined in BaseSortTestCase 
(random, random-low-cardinality, ascending, descending, etc).

By changing the implementation of the IntroSorter to use a 3-ways partitioning, 
we can gain a significant performance improvement when sorting low-cardinality 
lists, and we additional changes we can also improve the performance for all 
the strategies.

Proposed changes:
- Sort small ranges with insertion sort (instead of binary sort).
- Select the quick sort pivot with medians.
- Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm.
- Replace the tail recursion by a loop.


> Improve IntroSorter with 3-ways partitioning
> 
>
> Key: LUCENE-10196
> URL: https://issues.apache.org/jira/browse/LUCENE-10196
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I added a SorterBenchmark to evaluate the performance of the various Sorter 
> implementations depending on the strategies defined in BaseSortTestCase 
> (random, random-low-cardinality, ascending, descending, etc).
> By changing the implementation of the IntroSorter to use a 3-ways 
> partitioning, we can gain a significant performance improvement when sorting 
> low-cardinality lists, and with additional changes we can also improve the 
> performance for all the strategies.
> Proposed changes:
>  - Sort small ranges with insertion sort (instead of binary sort).
>  - Select the quick sort pivot with medians.
>  - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm.
>  - Replace the tail recursion by a loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Issue Comment Deleted] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning

2021-10-21 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated LUCENE-10196:

Comment: was deleted

(was: https://github.com/apache/lucene/pull/404)

> Improve IntroSorter with 3-ways partitioning
> 
>
> Key: LUCENE-10196
> URL: https://issues.apache.org/jira/browse/LUCENE-10196
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I added a SorterBenchmark to evaluate the performance of the various Sorter 
> implementations depending on the strategies defined in BaseSortTestCase 
> (random, random-low-cardinality, ascending, descending, etc).
> By changing the implementation of the IntroSorter to use a 3-ways 
> partitioning, we can gain a significant performance improvement when sorting 
> low-cardinality lists, and we additional changes we can also improve the 
> performance for all the strategies.
> Proposed changes:
> - Sort small ranges with insertion sort (instead of binary sort).
> - Select the quick sort pivot with medians.
> - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm.
> - Replace the tail recursion by a loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning

2021-10-21 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432685#comment-17432685
 ] 

Bruno Roustant commented on LUCENE-10196:
-

https://github.com/apache/lucene/pull/404

> Improve IntroSorter with 3-ways partitioning
> 
>
> Key: LUCENE-10196
> URL: https://issues.apache.org/jira/browse/LUCENE-10196
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
>
> I added a SorterBenchmark to evaluate the performance of the various Sorter 
> implementations depending on the strategies defined in BaseSortTestCase 
> (random, random-low-cardinality, ascending, descending, etc).
> By changing the implementation of the IntroSorter to use a 3-ways 
> partitioning, we can gain a significant performance improvement when sorting 
> low-cardinality lists, and we additional changes we can also improve the 
> performance for all the strategies.
> Proposed changes:
> - Sort small ranges with insertion sort (instead of binary sort).
> - Select the quick sort pivot with medians.
> - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm.
> - Replace the tail recursion by a loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning

2021-10-21 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432670#comment-17432670
 ] 

Bruno Roustant commented on LUCENE-10196:
-

Benchmark run to compare sorters implementations for different shapes of data:

Each value is the time to complete a run. The values for the same column can be 
compared because the same data is provided as input to the various sorters. 
Each column has different random data. IntroSorter2 is the new modified version 
of IntroSorter.

The benchmark runs with 20K comparable entries. Comparing IntroSorter and 
IntroSorter2, we mainly observe a x5 speed for random low cardinality, and an 
improvement for each data shape.
{noformat}
RANDOM
  IntroSorter  ...  445  445  459  458  453  458  460  465  452  451
  IntroSorter2 ...  394  403  401  400  401  398  400  404  396  399
  TimSorter... 1196 1203 1197 1206 1193 1195 1193 1204 1230 1207
  MergeSorter  ... 1462 1470 1482 1466 1463 1475 1478 1475 1466 1471
RANDOM_LOW_CARDINALITY
  IntroSorter  ...  505  513  504  490  527  499  510  512  509  525
  IntroSorter2 ...   89   84   88   88   86   90   88   89   92   88
  TimSorter...  511  513  508  508  513  512  521  511  524  516
  MergeSorter  ...  725  725  725  762  737  723  727  724  736  733
RANDOM_MEDIUM_CARDINALITY
  IntroSorter  ...  463  451  452  455  448  452  451  459  458  455
  IntroSorter2 ...  370  381  378  373  375  376  376  372  370  370
  TimSorter... 1192 1212 1197 1196 1201 1202 1196 1199 1196 1204
  MergeSorter  ... 1493 1465 1470 1480 1460 1470 1483 1464 1506 1500
ASCENDING
  IntroSorter  ...  211  205  215  213  207  206  208  214  212  211
  IntroSorter2 ...  191  188  190  193  194  191  188  187  185  188
  TimSorter...   17   18   18   18   19   19   18   17   18   19
  MergeSorter  ...   73   71   72   75   72   73   73   77   72   71
DESCENDING
  IntroSorter  ...  225  253  229  220  225  231  222  217  220  223
  IntroSorter2 ...  220  213  214  220  205  211  208  210  208  212
  TimSorter...  545  576  562  553  543  551  552  552  548  546
  MergeSorter  ...  537  537  548  538  537  536  533  530  533  545
STRICTLY_DESCENDING
  IntroSorter  ...  215  214  221  224  218  227  213  212  212  211
  IntroSorter2 ...  202  203  202  205  202  204  206  204  202  204
  TimSorter...   22   21   21   22   22   21   21   22   22   23
  MergeSorter  ...  534  531  533  527  531  529  526  527  528  527
ASCENDING_SEQUENCES
  IntroSorter  ...  370  366  361  376  367  369  358  364  379  376
  IntroSorter2 ...  234  235  231  236  234  245  242  239  239  236
  TimSorter...  686  679  745  673  694  685  673  719  682  685
  MergeSorter  ...  894  911  932  907  923  907  918  917  920  916
MOSTLY_ASCENDING
  IntroSorter  ...  284  282  282  283  285  282  278  284  283  287
  IntroSorter2 ...  254  252  249  250  255  255  249  250  252  251
  TimSorter...  233  233  230  235  232  234  234  233  228  238
  MergeSorter  ...  399  385  390  398  398  392  380  377  377  387
{noformat}

> Improve IntroSorter with 3-ways partitioning
> 
>
> Key: LUCENE-10196
> URL: https://issues.apache.org/jira/browse/LUCENE-10196
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
>
> I added a SorterBenchmark to evaluate the performance of the various Sorter 
> implementations depending on the strategies defined in BaseSortTestCase 
> (random, random-low-cardinality, ascending, descending, etc).
> By changing the implementation of the IntroSorter to use a 3-ways 
> partitioning, we can gain a significant performance improvement when sorting 
> low-cardinality lists, and we additional changes we can also improve the 
> performance for all the strategies.
> Proposed changes:
> - Sort small ranges with insertion sort (instead of binary sort).
> - Select the quick sort pivot with medians.
> - Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm.
> - Replace the tail recursion by a loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10196) Improve IntroSorter with 3-ways partitioning

2021-10-21 Thread Bruno Roustant (Jira)
Bruno Roustant created LUCENE-10196:
---

 Summary: Improve IntroSorter with 3-ways partitioning
 Key: LUCENE-10196
 URL: https://issues.apache.org/jira/browse/LUCENE-10196
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Bruno Roustant


I added a SorterBenchmark to evaluate the performance of the various Sorter 
implementations depending on the strategies defined in BaseSortTestCase 
(random, random-low-cardinality, ascending, descending, etc).

By changing the implementation of the IntroSorter to use a 3-ways partitioning, 
we can gain a significant performance improvement when sorting low-cardinality 
lists, and we additional changes we can also improve the performance for all 
the strategies.

Proposed changes:
- Sort small ranges with insertion sort (instead of binary sort).
- Select the quick sort pivot with medians.
- Partition with the fast Bentley-McIlroy 3-ways partitioning algorithm.
- Replace the tail recursion by a loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10097) Replace TreeMap use by HashMap when unnecessary

2021-10-04 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved LUCENE-10097.
-
Resolution: Won't Do

> Replace TreeMap use by HashMap when unnecessary
> ---
>
> Key: LUCENE-10097
> URL: https://issues.apache.org/jira/browse/LUCENE-10097
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> There are a couple of places where TreeMap is used although it could easily 
> be replaced by a HashMap with potentially a single sort. Sometimes it would 
> bring perf improvement (e.g. when TreeMap.entrySet() is called), other times 
> it's more for consistency to use a simpler HashMap if there is no strong need 
> for a TreeMap.
> I saw other places where we have TODOs to see whether we can replace the 
> TreeMap, but when it is more complex, I'll prefer to open separate Jira 
> issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10097) Replace TreeMap use by HashMap when unnecessary

2021-10-04 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423780#comment-17423780
 ] 

Bruno Roustant commented on LUCENE-10097:
-

Based on remarks in the PRs, I'm closing this Jira issue as HashMap + List is 
too complicated and TreeMap was a deliberated choice for keeping the memory 
usage low.
Maybe I'll come back later with another proposal as I think most of the time we 
use TreeMap whereas the intent is to build an immutable sorted map, which could 
be more memory efficient.

> Replace TreeMap use by HashMap when unnecessary
> ---
>
> Key: LUCENE-10097
> URL: https://issues.apache.org/jira/browse/LUCENE-10097
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> There are a couple of places where TreeMap is used although it could easily 
> be replaced by a HashMap with potentially a single sort. Sometimes it would 
> bring perf improvement (e.g. when TreeMap.entrySet() is called), other times 
> it's more for consistency to use a simpler HashMap if there is no strong need 
> for a TreeMap.
> I saw other places where we have TODOs to see whether we can replace the 
> TreeMap, but when it is more complex, I'll prefer to open separate Jira 
> issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10097) Replace TreeMap use by HashMap when unnecessary

2021-09-10 Thread Bruno Roustant (Jira)
Bruno Roustant created LUCENE-10097:
---

 Summary: Replace TreeMap use by HashMap when unnecessary
 Key: LUCENE-10097
 URL: https://issues.apache.org/jira/browse/LUCENE-10097
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Bruno Roustant
Assignee: Bruno Roustant


There are a couple of places where TreeMap is used although it could easily be 
replaced by a HashMap with potentially a single sort. Sometimes it would bring 
perf improvement (e.g. when TreeMap.entrySet() is called), other times it's 
more for consistency to use a simpler HashMap if there is no strong need for a 
TreeMap.

I saw other places where we have TODOs to see whether we can replace the 
TreeMap, but when it is more complex, I'll prefer to open separate Jira issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-06-07 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358418#comment-17358418
 ] 

Bruno Roustant commented on LUCENE-9983:


[~zhai7631] do you have some stats about the numbers of NFA / DFA states 
manipulated?

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9983) Stop sorting determinize powersets unnecessarily

2021-06-02 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355600#comment-17355600
 ] 

Bruno Roustant commented on LUCENE-9983:


How many states are manipulated?
If the states are numbered from 0 to N, and we keep most of the states during 
the computation, or N is not too high, then should we use an array instead of a 
map? With array[state] is the "reference count". We wouldn't have to sort the 
set of states for equality check because it would be directly the array order 
(skipping states with 0 reference).

> Stop sorting determinize powersets unnecessarily
> 
>
> Key: LUCENE-9983
> URL: https://issues.apache.org/jira/browse/LUCENE-9983
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Spinoff from LUCENE-9981.
> Today, our {{Operations.determinize}} implementation builds powersets of all 
> subsets of NFA states that "belong" in the same determinized state, using 
> [this algorithm|https://en.wikipedia.org/wiki/Powerset_construction].
> To hold each powerset, we use a malleable {{SortedIntSet}} and periodically 
> freeze it to a {{FrozenIntSet}}, also sorted.  We pay a high price to keep 
> these growing maps of int key, int value sorted by key, e.g. upgrading to a 
> {{TreeMap}} once the map is large enough (> 30 entries).
> But I think sorting is entirely unnecessary here!  Really all we need is the 
> ability to add/delete keys from the map, and hashCode / equals (by key only – 
> ignoring value!), and to freeze the map (a small optimization that we could 
> skip initially).  We only use these maps to lookup in the (growing) 
> determinized automaton whether this powerset has already been seen.
> Maybe we could simply poach the {{IntIntScatterMap}} implementation from 
> [HPPC|https://github.com/carrotsearch/hppc]?  And then change its 
> {{hashCode}}/{{equals }}to only use keys (not values).
> This change should be a big speedup for the kinds of (admittedly adversarial) 
> regexps we saw on LUCENE-9981.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption

2021-05-31 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354275#comment-17354275
 ] 

Bruno Roustant commented on LUCENE-9379:


_RE AES-XTS vs AES-CTR:_
In the case of Lucene, we produce read-only files per index segment. And if we 
have a new random IV per file, we don't repeat the same (AES encrypted) blocks. 
So we are in a safe read-only-once case where AES-XTS and AES-CTR have the same 
strength [1][2]. Given that CTR is simpler, that's why I chose it for this 
patch.

[1] 
https://crypto.stackexchange.com/questions/64556/aes-xts-vs-aes-ctr-for-write-once-storage
[2] 
https://crypto.stackexchange.com/questions/14628/why-do-we-use-xts-over-ctr-for-disk-encryption

> Directory based approach for index encryption
> -
>
> Key: LUCENE-9379
> URL: https://issues.apache.org/jira/browse/LUCENE-9379
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> +Important+: This Lucene Directory wrapper approach is to be considered only 
> if an OS level encryption is not possible. OS level encryption better fits 
> Lucene usage of OS cache, and thus is more performant.
> But there are some use-case where OS level encryption is not possible. This 
> Jira issue was created to address those.
> 
>  
> The goal is to provide optional encryption of the index, with a scope limited 
> to an encryptable Lucene Directory wrapper.
> Encryption is at rest on disk, not in memory.
> This simple approach should fit any Codec as it would be orthogonal, without 
> modifying APIs as much as possible.
> Use a standard encryption method. Limit perf/memory impact as much as 
> possible.
> Determine how callers provide encryption keys. They must not be stored on 
> disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-03-17 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303494#comment-17303494
 ] 

Bruno Roustant commented on LUCENE-9663:


Ok, I backported to 8.x branch, and I updated CHANGES.txt in main to move to 
8.9.0 section.

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
> Fix For: 8.9
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-03-17 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated LUCENE-9663:
---
Fix Version/s: (was: main (9.0))
   8.9

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
> Fix For: 8.9
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9796) fix SortedDocValues to no longer extend BinaryDocValues

2021-03-15 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301817#comment-17301817
 ] 

Bruno Roustant commented on LUCENE-9796:


Ok, I'll try to update Solr to use TermOrdValComparator, reusing the same issue.

> fix SortedDocValues to no longer extend BinaryDocValues
> ---
>
> Key: LUCENE-9796
> URL: https://issues.apache.org/jira/browse/LUCENE-9796
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
> Fix For: main (9.0)
>
> Attachments: LUCENE-9796.patch, LUCENE-9796_prototype.patch
>
>
> SortedDocValues give ordinals and a way to derefence ordinal as a byte[]
> But currently they *extend* BinaryDocValues, which allows directly calling 
> {{binaryValue()}}.
> This allows them to act as a "slow" BinaryDocValues, but it is a performance 
> trap, especially now that terms bytes may be block-compressed (LUCENE-9663).
> I think this should be detangled to prevent performance traps like 
> LUCENE-9795: SortedDocValues shouldn't have the trappy inherited 
> {{binaryValue()}} method that implicitly derefs the ord for the doc, then the 
> term bytes for the ord.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9796) fix SortedDocValues to no longer extend BinaryDocValues

2021-03-15 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17301800#comment-17301800
 ] 

Bruno Roustant commented on LUCENE-9796:


Solr main branch build broke after this change. I worked on SOLR-15261 to fix 
that. We'll have this kind of issues until Solr uses a Lucene snapshot.

While fixing I noticed that maybe 
o.a.l.search.FieldComparator$TermValComparator.getLeafComparator() needs to be 
adapted to not create only BinaryDocValues but also SortedDocValues?

> fix SortedDocValues to no longer extend BinaryDocValues
> ---
>
> Key: LUCENE-9796
> URL: https://issues.apache.org/jira/browse/LUCENE-9796
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
> Fix For: main (9.0)
>
> Attachments: LUCENE-9796.patch, LUCENE-9796_prototype.patch
>
>
> SortedDocValues give ordinals and a way to derefence ordinal as a byte[]
> But currently they *extend* BinaryDocValues, which allows directly calling 
> {{binaryValue()}}.
> This allows them to act as a "slow" BinaryDocValues, but it is a performance 
> trap, especially now that terms bytes may be block-compressed (LUCENE-9663).
> I think this should be detangled to prevent performance traps like 
> LUCENE-9795: SortedDocValues shouldn't have the trappy inherited 
> {{binaryValue()}} method that implicitly derefs the ord for the doc, then the 
> term bytes for the ord.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15170) Elevation file in data dir not working in Solr Cloud

2021-03-10 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298918#comment-17298918
 ] 

Bruno Roustant commented on SOLR-15170:
---

PR for the fix added. I plan to backport the fix in branch 8.9 (not in 7.7.2 
where the issue was detected).

> Elevation file in data dir not working in Solr Cloud
> 
>
> Key: SOLR-15170
> URL: https://issues.apache.org/jira/browse/SOLR-15170
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.7.2
>Reporter: Monica Marrero
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When using elevation, it is not possible to store the _elevate.xml_ file in 
> the data folder instead of in the configuration folder in Solr Cloud. It is 
> only possible in standalone mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality

2021-03-04 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295358#comment-17295358
 ] 

Bruno Roustant commented on SOLR-15038:
---

I reverted this specific line in both master and branch_8x.

> Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to 
> elevation functionality
> 
>
> Key: SOLR-15038
> URL: https://issues.apache.org/jira/browse/SOLR-15038
> Project: Solr
>  Issue Type: Improvement
>  Components: query
>Reporter: Tobias Kässmann
>Priority: Minor
> Fix For: 8.9
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We've worked a lot with Query Elevation component in the last time and we 
> were missing two features:
>  * Elevate only documents that are part of the search result
>  * In combination with collapsing: Only show the representative if the 
> elevated documents does have the same collapse field value.
> Because of this, we've added these two feature toggles 
> _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality

2021-03-04 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295203#comment-17295203
 ] 

Bruno Roustant edited comment on SOLR-15038 at 3/4/21, 10:37 AM:
-

Ouch, yes I'll revert that. I played with this permission but didn't intend to 
commit it.

When running the tests I noticed many Solr tests have warning about being 
unable to create some test resources.
{code:java}
java.security.AccessControlException: access denied ("java.io.FilePermission" 
"lucene-solr/solr/core/build/resources/test/solr/userfiles" "write")
at 
java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
 ~[?:?]
at java.security.AccessController.checkPermission(AccessController.java:897) 
~[?:?]
at java.lang.SecurityManager.checkPermission(SecurityManager.java:322) ~[?:?]
at java.lang.SecurityManager.checkWrite(SecurityManager.java:752) ~[?:?]
at sun.nio.fs.UnixPath.checkWrite(UnixPath.java:824) ~[?:?]
at 
sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:377)
 ~[?:?]
at java.nio.file.Files.createDirectory(Files.java:689) ~[?:?]
at java.nio.file.Files.createAndCheckIsDirectory(Files.java:796) ~[?:?]
at java.nio.file.Files.createDirectories(Files.java:742) ~[?:?]
at org.apache.solr.core.CoreContainer.(CoreContainer.java:383) [main/:?]
at org.apache.solr.core.CoreContainer.(CoreContainer.java:344) 
[main/:?]{code}
I noticed they disappeared when I changed the permission for write-access in 
solr-tests.policy.

[~dweiss] do you know how to get rid of these (many) warnings?


was (Author: broustant):
Ouch, yes I'll revert that. I played with this permission but didn't intend to 
commit it.

When running the tests I noticed many Solr tests have warning about being 
unable to create some test resources.
{code:java}
java.security.AccessControlException: access denied ("java.io.FilePermission" 
"lucene-solr/solr/core/build/resources/test/solr/userfiles" "write")
at 
java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
 ~[?:?]
at java.security.AccessController.checkPermission(AccessController.java:897) 
~[?:?]
at java.lang.SecurityManager.checkPermission(SecurityManager.java:322) ~[?:?]
at java.lang.SecurityManager.checkWrite(SecurityManager.java:752) ~[?:?]
at sun.nio.fs.UnixPath.checkWrite(UnixPath.java:824) ~[?:?]
at 
sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:377)
 ~[?:?]
at java.nio.file.Files.createDirectory(Files.java:689) ~[?:?]
at java.nio.file.Files.createAndCheckIsDirectory(Files.java:796) ~[?:?]
at java.nio.file.Files.createDirectories(Files.java:742) ~[?:?]
at org.apache.solr.core.CoreContainer.(CoreContainer.java:383) [main/:?]
at org.apache.solr.core.CoreContainer.(CoreContainer.java:344) 
[main/:?]{code}
I noticed they disappeared when I changed the permission for write-access in 
solr-tests.policy.

Do you know how to get rid of these (many) warnings?

> Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to 
> elevation functionality
> 
>
> Key: SOLR-15038
> URL: https://issues.apache.org/jira/browse/SOLR-15038
> Project: Solr
>  Issue Type: Improvement
>  Components: query
>Reporter: Tobias Kässmann
>Priority: Minor
> Fix For: 8.9
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We've worked a lot with Query Elevation component in the last time and we 
> were missing two features:
>  * Elevate only documents that are part of the search result
>  * In combination with collapsing: Only show the representative if the 
> elevated documents does have the same collapse field value.
> Because of this, we've added these two feature toggles 
> _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality

2021-03-04 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295203#comment-17295203
 ] 

Bruno Roustant commented on SOLR-15038:
---

Ouch, yes I'll revert that. I played with this permission but didn't intend to 
commit it.

When running the tests I noticed many Solr tests have warning about being 
unable to create some test resources.
{code:java}
java.security.AccessControlException: access denied ("java.io.FilePermission" 
"lucene-solr/solr/core/build/resources/test/solr/userfiles" "write")
at 
java.security.AccessControlContext.checkPermission(AccessControlContext.java:472)
 ~[?:?]
at java.security.AccessController.checkPermission(AccessController.java:897) 
~[?:?]
at java.lang.SecurityManager.checkPermission(SecurityManager.java:322) ~[?:?]
at java.lang.SecurityManager.checkWrite(SecurityManager.java:752) ~[?:?]
at sun.nio.fs.UnixPath.checkWrite(UnixPath.java:824) ~[?:?]
at 
sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:377)
 ~[?:?]
at java.nio.file.Files.createDirectory(Files.java:689) ~[?:?]
at java.nio.file.Files.createAndCheckIsDirectory(Files.java:796) ~[?:?]
at java.nio.file.Files.createDirectories(Files.java:742) ~[?:?]
at org.apache.solr.core.CoreContainer.(CoreContainer.java:383) [main/:?]
at org.apache.solr.core.CoreContainer.(CoreContainer.java:344) 
[main/:?]{code}
I noticed they disappeared when I changed the permission for write-access in 
solr-tests.policy.

Do you know how to get rid of these (many) warnings?

> Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to 
> elevation functionality
> 
>
> Key: SOLR-15038
> URL: https://issues.apache.org/jira/browse/SOLR-15038
> Project: Solr
>  Issue Type: Improvement
>  Components: query
>Reporter: Tobias Kässmann
>Priority: Minor
> Fix For: 8.9
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We've worked a lot with Query Elevation component in the last time and we 
> were missing two features:
>  * Elevate only documents that are part of the search result
>  * In combination with collapsing: Only show the representative if the 
> elevated documents does have the same collapse field value.
> Because of this, we've added these two feature toggles 
> _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9796) fix SortedDocValues to no longer extend BinaryDocValues

2021-03-01 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292769#comment-17292769
 ] 

Bruno Roustant commented on LUCENE-9796:


+1

> fix SortedDocValues to no longer extend BinaryDocValues
> ---
>
> Key: LUCENE-9796
> URL: https://issues.apache.org/jira/browse/LUCENE-9796
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
> Attachments: LUCENE-9796.patch, LUCENE-9796_prototype.patch
>
>
> SortedDocValues give ordinals and a way to derefence ordinal as a byte[]
> But currently they *extend* BinaryDocValues, which allows directly calling 
> {{binaryValue()}}.
> This allows them to act as a "slow" BinaryDocValues, but it is a performance 
> trap, especially now that terms bytes may be block-compressed (LUCENE-9663).
> I think this should be detangled to prevent performance traps like 
> LUCENE-9795: SortedDocValues shouldn't have the trappy inherited 
> {{binaryValue()}} method that implicitly derefs the ord for the doc, then the 
> term bytes for the ord.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9815) PerField formats can select the format based on FieldInfo

2021-03-01 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292768#comment-17292768
 ] 

Bruno Roustant commented on LUCENE-9815:


+1 on LUCENE-9796
I'll close this PR and try to find some cycles to help on LUCENE-9796.

> PerField formats can select the format based on FieldInfo
> -
>
> Key: LUCENE-9815
> URL: https://issues.apache.org/jira/browse/LUCENE-9815
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
> Attachments: Screen_Shot_2021-02-28_at_16.08.05.png
>
>
> PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the 
> format based on the field name.
> If we improve them to also support the selection based on the FieldInfo, it 
> will be possible to select based on some FieldInfo attribute, DocValuesType, 
> etc.
> +Example use-case:+
>  It will be possible to adapt the compression mode of doc values fields 
> easily based on the DocValuesType. E.g. compressing sorted and not binary doc 
> values.
> > User creates a new custom codec which provides a custom DocValuesFormat 
> > which extends PerFieldDocValuesFormat and implements the method
>  DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo).
>  This method provides either a standard Lucene80DocValuesFormat (no 
> compression) or another new custom DocValuesFormat extending 
> Lucene80DocValuesFormat with BEST_COMPRESSION mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9815) PerField formats can select the format based on FieldInfo

2021-03-01 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved LUCENE-9815.

Resolution: Won't Do

> PerField formats can select the format based on FieldInfo
> -
>
> Key: LUCENE-9815
> URL: https://issues.apache.org/jira/browse/LUCENE-9815
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
> Attachments: Screen_Shot_2021-02-28_at_16.08.05.png
>
>
> PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the 
> format based on the field name.
> If we improve them to also support the selection based on the FieldInfo, it 
> will be possible to select based on some FieldInfo attribute, DocValuesType, 
> etc.
> +Example use-case:+
>  It will be possible to adapt the compression mode of doc values fields 
> easily based on the DocValuesType. E.g. compressing sorted and not binary doc 
> values.
> > User creates a new custom codec which provides a custom DocValuesFormat 
> > which extends PerFieldDocValuesFormat and implements the method
>  DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo).
>  This method provides either a standard Lucene80DocValuesFormat (no 
> compression) or another new custom DocValuesFormat extending 
> Lucene80DocValuesFormat with BEST_COMPRESSION mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-15170) Elevation file in data dir not working in Solr Cloud

2021-03-01 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant reassigned SOLR-15170:
-

Assignee: Bruno Roustant

> Elevation file in data dir not working in Solr Cloud
> 
>
> Key: SOLR-15170
> URL: https://issues.apache.org/jira/browse/SOLR-15170
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.7.2
>Reporter: Monica Marrero
>Assignee: Bruno Roustant
>Priority: Major
>
> When using elevation, it is not possible to store the _elevate.xml_ file in 
> the data folder instead of in the configuration folder in Solr Cloud. It is 
> only possible in standalone mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15170) Elevation file in data dir not working in Solr Cloud

2021-03-01 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292730#comment-17292730
 ] 

Bruno Roustant commented on SOLR-15170:
---

I'll have a look soon

> Elevation file in data dir not working in Solr Cloud
> 
>
> Key: SOLR-15170
> URL: https://issues.apache.org/jira/browse/SOLR-15170
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.7.2
>Reporter: Monica Marrero
>Priority: Major
>
> When using elevation, it is not possible to store the _elevate.xml_ file in 
> the data folder instead of in the configuration folder in Solr Cloud. It is 
> only possible in standalone mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9815) PerField formats can select the format based on FieldInfo

2021-02-28 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292518#comment-17292518
 ] 

Bruno Roustant commented on LUCENE-9815:


[~rcmuir] do you mean always compressing sorted docvalues and having a on/off 
mode for binary docvalues?
Based on LUCENE-9378 binary docvalues compression causes a big perf impact, so 
currently the on/off compression mode for all docvalues is not so useful as 
users do not want to hit the perf for binary docvalues so they don't enable 
compression for sorted set neither.

> PerField formats can select the format based on FieldInfo
> -
>
> Key: LUCENE-9815
> URL: https://issues.apache.org/jira/browse/LUCENE-9815
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
>
> PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the 
> format based on the field name.
> If we improve them to also support the selection based on the FieldInfo, it 
> will be possible to select based on some FieldInfo attribute, DocValuesType, 
> etc.
> +Example use-case:+
>  It will be possible to adapt the compression mode of doc values fields 
> easily based on the DocValuesType. E.g. compressing sorted and not binary doc 
> values.
> > User creates a new custom codec which provides a custom DocValuesFormat 
> > which extends PerFieldDocValuesFormat and implements the method
>  DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo).
>  This method provides either a standard Lucene80DocValuesFormat (no 
> compression) or another new custom DocValuesFormat extending 
> Lucene80DocValuesFormat with BEST_COMPRESSION mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9815) PerField formats can select the format based on FieldInfo

2021-02-28 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated LUCENE-9815:
---
Description: 
PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the 
format based on the field name.

If we improve them to also support the selection based on the FieldInfo, it 
will be possible to select based on some FieldInfo attribute, DocValuesType, 
etc.

+Example use-case:+
 It will be possible to adapt the compression mode of doc values fields easily 
based on the DocValuesType. E.g. compressing sorted and not binary doc values.

> User creates a new custom codec which provides a custom DocValuesFormat which 
> extends PerFieldDocValuesFormat and implements the method
 DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo).
 This method provides either a standard Lucene80DocValuesFormat (no 
compression) or another new custom DocValuesFormat extending 
Lucene80DocValuesFormat with BEST_COMPRESSION mode.

  was:
PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the 
format based on the field name.

If we improve them to also support the selection based on the FieldInfo, it 
will be possible to select based on some FieldInfo attribute, DocValuesType, 
etc.

+Use-case example:+
It will be possible for example to adapt the compression mode of doc values 
fields easily based on the DocValuesType. E.g. compressing sorted and not 
binary doc values.

> User creates a new custom codec which provides a custom DocValuesFormat which 
> extends PerFieldDocValuesFormat and implements the method
DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo).
This method provides either a standard Lucene80DocValuesFormat (no compression) 
or another new custom DocValuesFormat extending Lucene80DocValuesFormat with 
BEST_COMPRESSION mode.


> PerField formats can select the format based on FieldInfo
> -
>
> Key: LUCENE-9815
> URL: https://issues.apache.org/jira/browse/LUCENE-9815
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Minor
>
> PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the 
> format based on the field name.
> If we improve them to also support the selection based on the FieldInfo, it 
> will be possible to select based on some FieldInfo attribute, DocValuesType, 
> etc.
> +Example use-case:+
>  It will be possible to adapt the compression mode of doc values fields 
> easily based on the DocValuesType. E.g. compressing sorted and not binary doc 
> values.
> > User creates a new custom codec which provides a custom DocValuesFormat 
> > which extends PerFieldDocValuesFormat and implements the method
>  DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo).
>  This method provides either a standard Lucene80DocValuesFormat (no 
> compression) or another new custom DocValuesFormat extending 
> Lucene80DocValuesFormat with BEST_COMPRESSION mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9815) PerField formats can select the format based on FieldInfo

2021-02-28 Thread Bruno Roustant (Jira)
Bruno Roustant created LUCENE-9815:
--

 Summary: PerField formats can select the format based on FieldInfo
 Key: LUCENE-9815
 URL: https://issues.apache.org/jira/browse/LUCENE-9815
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Bruno Roustant


PerFieldDocValuesFormat and PerFieldPostingsFormat currently only select the 
format based on the field name.

If we improve them to also support the selection based on the FieldInfo, it 
will be possible to select based on some FieldInfo attribute, DocValuesType, 
etc.

+Use-case example:+
It will be possible for example to adapt the compression mode of doc values 
fields easily based on the DocValuesType. E.g. compressing sorted and not 
binary doc values.

> User creates a new custom codec which provides a custom DocValuesFormat which 
> extends PerFieldDocValuesFormat and implements the method
DocValuesFormat getDocValuesFormatForField(FieldInfo fieldInfo).
This method provides either a standard Lucene80DocValuesFormat (no compression) 
or another new custom DocValuesFormat extending Lucene80DocValuesFormat with 
BEST_COMPRESSION mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality

2021-02-17 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved SOLR-15038.
---
Fix Version/s: 8.9
   Resolution: Fixed

Thanks [~kaessmann]

> Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to 
> elevation functionality
> 
>
> Key: SOLR-15038
> URL: https://issues.apache.org/jira/browse/SOLR-15038
> Project: Solr
>  Issue Type: Improvement
>  Components: query
>Reporter: Tobias Kässmann
>Priority: Minor
> Fix For: 8.9
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We've worked a lot with Query Elevation component in the last time and we 
> were missing two features:
>  * Elevate only documents that are part of the search result
>  * In combination with collapsing: Only show the representative if the 
> elevated documents does have the same collapse field value.
> Because of this, we've added these two feature toggles 
> _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality

2021-02-16 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285321#comment-17285321
 ] 

Bruno Roustant commented on SOLR-15038:
---

The PR sounds good to me. I'll merge soon.

> Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to 
> elevation functionality
> 
>
> Key: SOLR-15038
> URL: https://issues.apache.org/jira/browse/SOLR-15038
> Project: Solr
>  Issue Type: Improvement
>  Components: query
>Reporter: Tobias Kässmann
>Priority: Minor
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> We've worked a lot with Query Elevation component in the last time and we 
> were missing two features:
>  * Elevate only documents that are part of the search result
>  * In combination with collapsing: Only show the representative if the 
> elevated documents does have the same collapse field value.
> Because of this, we've added these two feature toggles 
> _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-02-15 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated LUCENE-9663:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Thanks [~Jaison], this is merged in master 9.0.

I don't plan to port it to branch 8.x unless I'm advised otherwise.

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
> Fix For: master (9.0)
>
>  Time Spent: 11h 10m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-15038) Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to elevation functionality

2021-02-15 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-15038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17284798#comment-17284798
 ] 

Bruno Roustant commented on SOLR-15038:
---

Thanks for the PR. I have some comments/questions there.

> Add elevateDocsWithoutMatchingQ and onlyElevatedRepresentative parameters to 
> elevation functionality
> 
>
> Key: SOLR-15038
> URL: https://issues.apache.org/jira/browse/SOLR-15038
> Project: Solr
>  Issue Type: Improvement
>  Components: query
>Reporter: Tobias Kässmann
>Priority: Minor
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We've worked a lot with Query Elevation component in the last time and we 
> were missing two features:
>  * Elevate only documents that are part of the search result
>  * In combination with collapsing: Only show the representative if the 
> elevated documents does have the same collapse field value.
> Because of this, we've added these two feature toggles 
> _elevateDocsWithoutMatchingQ_ and _onlyElevatedRepresentative._
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-02-09 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281779#comment-17281779
 ] 

Bruno Roustant commented on LUCENE-9663:


I'm ready to merge. I think it could go to 8.9 branch but I'd like to have 
confirmation.

This change adds compression to Lucene80DocValuesFormat if the 
Mode.BEST_COMPRESSION is used and is backward compatible.

[~jpountz] any suggestion? Thanks

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
> Fix For: master (9.0)
>
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues

2021-02-04 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278966#comment-17278966
 ] 

Bruno Roustant commented on LUCENE-9663:


The latest PR looks good. I'm going to merge it in a couple of days if there is 
no objection.

[~Jaison] you may want to open another Jira issue if you want to propose more 
configuration for the compression (and you can link it to this issue).

> Adding compression to terms dict from SortedSet/Sorted DocValues
> 
>
> Key: LUCENE-9663
> URL: https://issues.apache.org/jira/browse/LUCENE-9663
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Jaison.Bi
>Priority: Trivial
> Fix For: master (9.0)
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Elasticsearch keyword field uses SortedSet DocValues. In our applications, 
> “keyword” is the most frequently used field type.
>  LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do 
> better by replacing prefix-compression with LZ4. In one of our application, 
> the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB).
>  I've done simple tests based on the real application data, comparing the 
> write/merge time cost, and the on-disk *.dvd file size(after merge into 1 
> segment).
> || ||Before||After||
> |Write time cost(ms)|591972|618200|
> |Merge time cost(ms)|270661|294663|
> |*.dvd file size(GB)|1.95|1.15|
> This feature is only for the high-cardinality fields. 
>  I'm doing the benchmark test based on luceneutil. Will attach the report and 
> patch after the test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9646) Set BM25Similarity discountOverlaps via the constructor

2021-01-19 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved LUCENE-9646.

Fix Version/s: master (9.0)
   Resolution: Fixed

Thank you [~pmarty]

> Set BM25Similarity discountOverlaps via the constructor
> ---
>
> Key: LUCENE-9646
> URL: https://issues.apache.org/jira/browse/LUCENE-9646
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (9.0)
>Reporter: Patrick Marty
>Assignee: Bruno Roustant
>Priority: Trivial
> Fix For: master (9.0)
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> BM25Similarity discountOverlaps parameter is true by default.
> It can be set with 
> {{org.apache.lucene.search.similarities.BM25Similarity#setDiscountOverlaps}} 
> method.
> But this method makes BM25Similarity mutable.
>  
> discountOverlaps should be set via the constructor and 
> {{setDiscountOverlaps}} method should be removed to make BM25Similarity 
> immutable.
>  
> PR https://github.com/apache/lucene-solr/pull/2161



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-9646) Set BM25Similarity discountOverlaps via the constructor

2021-01-13 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant reassigned LUCENE-9646:
--

Assignee: Bruno Roustant

> Set BM25Similarity discountOverlaps via the constructor
> ---
>
> Key: LUCENE-9646
> URL: https://issues.apache.org/jira/browse/LUCENE-9646
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (9.0)
>Reporter: Patrick Marty
>Assignee: Bruno Roustant
>Priority: Trivial
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> BM25Similarity discountOverlaps parameter is true by default.
> It can be set with 
> {{org.apache.lucene.search.similarities.BM25Similarity#setDiscountOverlaps}} 
> method.
> But this method makes BM25Similarity mutable.
>  
> discountOverlaps should be set via the constructor and 
> {{setDiscountOverlaps}} method should be removed to make BM25Similarity 
> immutable.
>  
> PR https://github.com/apache/lucene-solr/pull/2161



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9646) Set BM25Similarity discountOverlaps via the constructor

2021-01-13 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17264033#comment-17264033
 ] 

Bruno Roustant commented on LUCENE-9646:


I'm going to merge it tomorrow, on master only.

> Set BM25Similarity discountOverlaps via the constructor
> ---
>
> Key: LUCENE-9646
> URL: https://issues.apache.org/jira/browse/LUCENE-9646
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (9.0)
>Reporter: Patrick Marty
>Priority: Trivial
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> BM25Similarity discountOverlaps parameter is true by default.
> It can be set with 
> {{org.apache.lucene.search.similarities.BM25Similarity#setDiscountOverlaps}} 
> method.
> But this method makes BM25Similarity mutable.
>  
> discountOverlaps should be set via the constructor and 
> {{setDiscountOverlaps}} method should be removed to make BM25Similarity 
> immutable.
>  
> PR https://github.com/apache/lucene-solr/pull/2161



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-15061) Fix NPE in SearchHandler when shards.info

2021-01-05 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved SOLR-15061.
---
Fix Version/s: 8.8
   Resolution: Fixed

> Fix NPE in SearchHandler when shards.info
> -
>
> Key: SOLR-15061
> URL: https://issues.apache.org/jira/browse/SOLR-15061
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Minor
> Fix For: 8.8
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This NPE happens in a specific case
>  - Short-circuited distributed request
>  - With shards.info
>  - With no QueryComponent (e.g. only spellcheck)
> One-liner fix in SearchHandler.handleRequestBody().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9570) Review code diffs after automatic formatting and correct problems before it is applied

2021-01-05 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258851#comment-17258851
 ] 

Bruno Roustant edited comment on LUCENE-9570 at 1/5/21, 11:51 AM:
--

[~dweiss] Yes finally it's done. I pushed the commit directly to your branch.

I have been working on spatial3d since yesterday. 123 classes, with some giant 
ones. Ouch, it took me more time than anticipated.

Lots of (inconsistently) missing braces in a single-line if. I added them. I 
hope I didn't miss too many of them.

Many multi-line comments where the lines were truncated with a newline 
inserted. I had to reformat manually the comment.

Badly formatted commented code. Either line-truncated, or with original 
commented code had no space between concatenated strings (and I tried to add 
them as often as I could)

Lots of duplicated code to reformat again and again.

 


was (Author: broustant):
[~dweiss] Yes finally it's done. I pushed the commit directly to your branch.

I have been working on spatial3d since yesterday. 123 classes, with some giant 
ones. Ouch, it took me more time than anticipated.

Lots of (inconsistently) missing braces in a single-line if. I added them. I 
hope I didn't miss too many of them.

Many multi-line comments where the lines where truncated with a newline 
inserted. I had to reformat manually the comment.

Badly formatted commented code. Either line-truncated, or with original 
commented code had no space between concatenated strings (and I tried to add 
them as often as I could)

Lots of duplicated code to reformat again and again.

 

> Review code diffs after automatic formatting and correct problems before it 
> is applied
> --
>
> Key: LUCENE-9570
> URL: https://issues.apache.org/jira/browse/LUCENE-9570
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Blocker
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Review and correct all the javadocs before they're messed up by automatic 
> formatting. Apply project-by-project, review diff, correct. Lots of diffs but 
> it should be relatively quick.
> *Reviewing diffs manually*
>  * switch to branch jira/LUCENE-9570 which the PR is based on:
> {code:java}
> git remote add dweiss g...@github.com:dweiss/lucene-solr.git
> git fetch dweiss
> git checkout jira/LUCENE-9570
> {code}
>  * Open gradle/validation/spotless.gradle and locate the project/ package you 
> wish to review. Enable it in spotless.gradle by creating a corresponding 
> switch case block (refer to existing examples), for example:
> {code:java}
>   case ":lucene:highlighter":
> target "src/**"
> targetExclude "**/resources/**", "**/overview.html"
> break
> {code}
>  * Reformat the code:
> {code:java}
> gradlew tidy && git diff -w > /tmp/diff.patch && git status
> {code}
>  * Look at what has changed (git status) and review the differences manually 
> (/tmp/diff.patch). If everything looks ok, commit it directly to 
> jira/LUCENE-9570 or make a PR against that branch.
> {code:java}
> git commit -am ":lucene:core - src/**/org/apache/lucene/document/**"
> {code}
> *Packages remaining* (put your name next to a module you're working on to 
> avoid duplication).
>  * case ":lucene:spatial3d": (Bruno Roustant)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9570) Review code diffs after automatic formatting and correct problems before it is applied

2021-01-05 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258851#comment-17258851
 ] 

Bruno Roustant commented on LUCENE-9570:


[~dweiss] Yes finally it's done. I pushed the commit directly to your branch.

I have been working on spatial3d since yesterday. 123 classes, with some giant 
ones. Ouch, it took me more time than anticipated.

Lots of (inconsistently) missing braces in a single-line if. I added them. I 
hope I didn't miss too many of them.

Many multi-line comments where the lines where truncated with a newline 
inserted. I had to reformat manually the comment.

Badly formatted commented code. Either line-truncated, or with original 
commented code had no space between concatenated strings (and I tried to add 
them as often as I could)

Lots of duplicated code to reformat again and again.

 

> Review code diffs after automatic formatting and correct problems before it 
> is applied
> --
>
> Key: LUCENE-9570
> URL: https://issues.apache.org/jira/browse/LUCENE-9570
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Blocker
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Review and correct all the javadocs before they're messed up by automatic 
> formatting. Apply project-by-project, review diff, correct. Lots of diffs but 
> it should be relatively quick.
> *Reviewing diffs manually*
>  * switch to branch jira/LUCENE-9570 which the PR is based on:
> {code:java}
> git remote add dweiss g...@github.com:dweiss/lucene-solr.git
> git fetch dweiss
> git checkout jira/LUCENE-9570
> {code}
>  * Open gradle/validation/spotless.gradle and locate the project/ package you 
> wish to review. Enable it in spotless.gradle by creating a corresponding 
> switch case block (refer to existing examples), for example:
> {code:java}
>   case ":lucene:highlighter":
> target "src/**"
> targetExclude "**/resources/**", "**/overview.html"
> break
> {code}
>  * Reformat the code:
> {code:java}
> gradlew tidy && git diff -w > /tmp/diff.patch && git status
> {code}
>  * Look at what has changed (git status) and review the differences manually 
> (/tmp/diff.patch). If everything looks ok, commit it directly to 
> jira/LUCENE-9570 or make a PR against that branch.
> {code:java}
> git commit -am ":lucene:core - src/**/org/apache/lucene/document/**"
> {code}
> *Packages remaining* (put your name next to a module you're working on to 
> avoid duplication).
>  * case ":lucene:spatial3d": (Bruno Roustant)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9646) Set BM25Similarity discountOverlaps via the constructor

2021-01-04 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258142#comment-17258142
 ] 

Bruno Roustant commented on LUCENE-9646:


I see that ClassicSimilarity constructor has a comment "Sole constructor: 
parameter-free". I don't know why it was designed to have this parameter free 
constructor with the same setDiscountOverlaps setter inherited from 
TFIDFSimilarity. Based on the usages of this setter, these similarities could 
indeed be immutable instead.

> Set BM25Similarity discountOverlaps via the constructor
> ---
>
> Key: LUCENE-9646
> URL: https://issues.apache.org/jira/browse/LUCENE-9646
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: master (9.0)
>Reporter: Patrick Marty
>Priority: Trivial
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> BM25Similarity discountOverlaps parameter is true by default.
> It can be set with 
> {{org.apache.lucene.search.similarities.BM25Similarity#setDiscountOverlaps}} 
> method.
> But this method makes BM25Similarity mutable.
>  
> discountOverlaps should be set via the constructor and 
> {{setDiscountOverlaps}} method should be removed to make BM25Similarity 
> immutable.
>  
> PR https://github.com/apache/lucene-solr/pull/2161



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9570) Review code diffs after automatic formatting and correct problems before it is applied

2021-01-04 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated LUCENE-9570:
---
Description: 
Review and correct all the javadocs before they're messed up by automatic 
formatting. Apply project-by-project, review diff, correct. Lots of diffs but 
it should be relatively quick.

*Reviewing diffs manually*
 * switch to branch jira/LUCENE-9570 which the PR is based on:
{code:java}
git remote add dweiss g...@github.com:dweiss/lucene-solr.git
git fetch dweiss
git checkout jira/LUCENE-9570
{code}

 * Open gradle/validation/spotless.gradle and locate the project/ package you 
wish to review. Enable it in spotless.gradle by creating a corresponding switch 
case block (refer to existing examples), for example:
{code:java}
  case ":lucene:highlighter":
target "src/**"
targetExclude "**/resources/**", "**/overview.html"
break
{code}

 * Reformat the code:
{code:java}
gradlew tidy && git diff -w > /tmp/diff.patch && git status
{code}

 * Look at what has changed (git status) and review the differences manually 
(/tmp/diff.patch). If everything looks ok, commit it directly to 
jira/LUCENE-9570 or make a PR against that branch.
{code:java}
git commit -am ":lucene:core - src/**/org/apache/lucene/document/**"
{code}

*Packages remaining* (put your name next to a module you're working on to avoid 
duplication).
 * case ":lucene:luke":
 * case ":lucene:sandbox": (Erick Erickson)
 * case ":lucene:spatial3d": (Bruno Roustant)
 * case ":lucene:spatial-extras":
 * case ":lucene:suggest":
 * case ":lucene:test-framework":

  was:
Review and correct all the javadocs before they're messed up by automatic 
formatting. Apply project-by-project, review diff, correct. Lots of diffs but 
it should be relatively quick.

*Reviewing diffs manually*
 * switch to branch jira/LUCENE-9570 which the PR is based on:
{code:java}
git remote add dweiss g...@github.com:dweiss/lucene-solr.git
git fetch dweiss
git checkout jira/LUCENE-9570
{code}

 * Open gradle/validation/spotless.gradle and locate the project/ package you 
wish to review. Enable it in spotless.gradle by creating a corresponding switch 
case block (refer to existing examples), for example:
{code:java}
  case ":lucene:highlighter":
target "src/**"
targetExclude "**/resources/**", "**/overview.html"
break
{code}

 * Reformat the code:
{code:java}
gradlew tidy && git diff -w > /tmp/diff.patch && git status
{code}

 * Look at what has changed (git status) and review the differences manually 
(/tmp/diff.patch). If everything looks ok, commit it directly to 
jira/LUCENE-9570 or make a PR against that branch.
{code:java}
git commit -am ":lucene:core - src/**/org/apache/lucene/document/**"
{code}

*Packages remaining* (put your name next to a module you're working on to avoid 
duplication).
 * case ":lucene:luke":
 * case ":lucene:sandbox": (Erick Erickson)
 * case ":lucene:spatial3d":
 * case ":lucene:spatial-extras":
 * case ":lucene:suggest":
 * case ":lucene:test-framework":


> Review code diffs after automatic formatting and correct problems before it 
> is applied
> --
>
> Key: LUCENE-9570
> URL: https://issues.apache.org/jira/browse/LUCENE-9570
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Blocker
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Review and correct all the javadocs before they're messed up by automatic 
> formatting. Apply project-by-project, review diff, correct. Lots of diffs but 
> it should be relatively quick.
> *Reviewing diffs manually*
>  * switch to branch jira/LUCENE-9570 which the PR is based on:
> {code:java}
> git remote add dweiss g...@github.com:dweiss/lucene-solr.git
> git fetch dweiss
> git checkout jira/LUCENE-9570
> {code}
>  * Open gradle/validation/spotless.gradle and locate the project/ package you 
> wish to review. Enable it in spotless.gradle by creating a corresponding 
> switch case block (refer to existing examples), for example:
> {code:java}
>   case ":lucene:highlighter":
> target "src/**"
> targetExclude "**/resources/**", "**/overview.html"
> break
> {code}
>  * Reformat the code:
> {code:java}
> gradlew tidy && git diff -w > /tmp/diff.patch && git status
> {code}
>  * Look at what has changed (git status) and review the differences manually 
> (/tmp/diff.patch). If everything looks ok, commit it directly to 
> jira/LUCENE-9570 or make a PR against that branch.
> {code:java}
> git commit -am ":lucene:core - src/**/org/apache/lucene/document/**"
> {code}
> *Packages remaining* (put your name next to a module you're working on to 
> avoid 

[jira] [Created] (SOLR-15061) Fix NPE in SearchHandler when shards.info

2020-12-24 Thread Bruno Roustant (Jira)
Bruno Roustant created SOLR-15061:
-

 Summary: Fix NPE in SearchHandler when shards.info
 Key: SOLR-15061
 URL: https://issues.apache.org/jira/browse/SOLR-15061
 Project: Solr
  Issue Type: Bug
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Bruno Roustant
Assignee: Bruno Roustant


This NPE happens in a specific case
 - Short-circuited distributed request
 - With shards.info
 - With no QueryComponent (e.g. only spellcheck)

One-liner fix in SearchHandler.handleRequestBody().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-15060) Introduce DelegatingDirectoryFactory

2020-12-23 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-15060:
--
Description: 
FilterDirectory already exists to delegate to a Directory, but there is no 
delegating DirectoryFactory.

+Use cases:+

A DelegatingDirectoryFactory could be used in SOLR-15051 to make 
BlobDirectoryFactory easily delegate to any DirectoryFactory (e.g. 
MMapDirectoryFactory). If the DelegatingDirectoryFactory delegation can be 
configured in solrconfig.xml then it allows any user to change the delegate 
DirectoryFactory without code.

A DelegatingDirectoryFactory could be used by an EncryptingDirectoryFactory 
that could encrypt/decrypt any delegate DirectoryFactory. Here again it should 
be configurable in solrconfig.xml.

+Problem:+
 But currently DirectoryFactory delegation does not work with a delegate 
CachingDirectoryFactory because the get() method creates internally a Directory 
instance of a type which cannot be controlled by the caller (the 
DelegatingDirectoryFactory).

+Proposal:+
 So here we propose to change DirectoryFactory.get() method by adding a fourth 
parameter Function that allows the caller to wrap the 
internal Directory with a custom FilterDirectory when it is created. Hence we 
would have a DelegatingDirectoryFactory that could delegate the creation of 
some FilterDirectory.

  was:
FilterDirectory already exists to delegate to a Directory, but there is no 
delegating DirectoryFactory.

+Use cases:+

A DelegatingDirectoryFactory could be used in SOLR-15051 to make 
BlobDirectoryFactory easily delegate to any DirectoryFactory (e.g. 
MMapDirectoryFactory). If the DelegatingDirectoryFactory delegation can be 
configured in solrconfig.xml then it allows any user to change the delegate 
DirectoryFactory without code.

A DelegatingDirectoryFactory could be used by an EncryptingDirectoryFactory 
that could encrypt/decrypt any delegate DirectoryFactory. Here again it should 
be configurable in solrconfig.xml.

+Problem:+
But currently DirectoryFactory delegation does not work with a delegate 
CachingDirectoryFactory because the get() method creates internally a Directory 
instance of a type which cannot be controlled by the caller (the 
DelegatingDirectoryFactory).

+Proposal:+
 So here we propose to change DirectoryFactory.get() method by adding a fourth 
parameter Function that allows the caller to wrap the 
internal Directory with a custom FilterDirectory when it is created.

+Benefit:+
Hence we would have a DelegatingDirectoryFactory that could delegate the 
creation of some FilterDirectory.


> Introduce DelegatingDirectoryFactory
> 
>
> Key: SOLR-15060
> URL: https://issues.apache.org/jira/browse/SOLR-15060
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> FilterDirectory already exists to delegate to a Directory, but there is no 
> delegating DirectoryFactory.
> +Use cases:+
> A DelegatingDirectoryFactory could be used in SOLR-15051 to make 
> BlobDirectoryFactory easily delegate to any DirectoryFactory (e.g. 
> MMapDirectoryFactory). If the DelegatingDirectoryFactory delegation can be 
> configured in solrconfig.xml then it allows any user to change the delegate 
> DirectoryFactory without code.
> A DelegatingDirectoryFactory could be used by an EncryptingDirectoryFactory 
> that could encrypt/decrypt any delegate DirectoryFactory. Here again it 
> should be configurable in solrconfig.xml.
> +Problem:+
>  But currently DirectoryFactory delegation does not work with a delegate 
> CachingDirectoryFactory because the get() method creates internally a 
> Directory instance of a type which cannot be controlled by the caller (the 
> DelegatingDirectoryFactory).
> +Proposal:+
>  So here we propose to change DirectoryFactory.get() method by adding a 
> fourth parameter Function that allows the caller to 
> wrap the internal Directory with a custom FilterDirectory when it is created. 
> Hence we would have a DelegatingDirectoryFactory that could delegate the 
> creation of some FilterDirectory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-15060) Introduce DelegatingDirectoryFactory

2020-12-23 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-15060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-15060:
--
Description: 
FilterDirectory already exists to delegate to a Directory, but there is no 
delegating DirectoryFactory.

+Use cases:+

A DelegatingDirectoryFactory could be used in SOLR-15051 to make 
BlobDirectoryFactory easily delegate to any DirectoryFactory (e.g. 
MMapDirectoryFactory). If the DelegatingDirectoryFactory delegation can be 
configured in solrconfig.xml then it allows any user to change the delegate 
DirectoryFactory without code.

A DelegatingDirectoryFactory could be used by an EncryptingDirectoryFactory 
that could encrypt/decrypt any delegate DirectoryFactory. Here again it should 
be configurable in solrconfig.xml.

+Problem:+
But currently DirectoryFactory delegation does not work with a delegate 
CachingDirectoryFactory because the get() method creates internally a Directory 
instance of a type which cannot be controlled by the caller (the 
DelegatingDirectoryFactory).

+Proposal:+
 So here we propose to change DirectoryFactory.get() method by adding a fourth 
parameter Function that allows the caller to wrap the 
internal Directory with a custom FilterDirectory when it is created.

+Benefit:+
Hence we would have a DelegatingDirectoryFactory that could delegate the 
creation of some FilterDirectory.

  was:
FilterDirectory already exists to delegate to a Directory, but there is no 
delegating DirectoryFactory.

+Use cases:+

A DelegatingDirectoryFactory could be used in SOLR-15051 to make 
BlobDirectoryFactory easily delegate to any DirectoryFactory (e.g. 
MMapDirectoryFactory). If the DelegatingDirectoryFactory delegation can be 
configured in solrconfig.xml then it allows any user to change the delegate 
DirectoryFactory without code.

A DelegatingDirectoryFactory could be used by an EncryptingDirectoryFactory 
that could encrypt/decrypt any delegate DirectoryFactory. Here again it should 
be configurable in solrconfig.xml.

+Problem:
+But currently DirectoryFactory delegation does not work with a delegate 
CachingDirectoryFactory because the get() method creates internally a Directory 
instance of a type which cannot be controlled by the caller (the 
DelegatingDirectoryFactory).

+Proposal:+
So here we propose to change DirectoryFactory.get() method by adding a fourth 
parameter Function that allows the caller to wrap the 
internal Directory with a custom FilterDirectory when it is created.

+Benefit:
+Hence we would have a DelegatingDirectoryFactory that could delegate the 
creation of some FilterDirectory.


> Introduce DelegatingDirectoryFactory
> 
>
> Key: SOLR-15060
> URL: https://issues.apache.org/jira/browse/SOLR-15060
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>
> FilterDirectory already exists to delegate to a Directory, but there is no 
> delegating DirectoryFactory.
> +Use cases:+
> A DelegatingDirectoryFactory could be used in SOLR-15051 to make 
> BlobDirectoryFactory easily delegate to any DirectoryFactory (e.g. 
> MMapDirectoryFactory). If the DelegatingDirectoryFactory delegation can be 
> configured in solrconfig.xml then it allows any user to change the delegate 
> DirectoryFactory without code.
> A DelegatingDirectoryFactory could be used by an EncryptingDirectoryFactory 
> that could encrypt/decrypt any delegate DirectoryFactory. Here again it 
> should be configurable in solrconfig.xml.
> +Problem:+
> But currently DirectoryFactory delegation does not work with a delegate 
> CachingDirectoryFactory because the get() method creates internally a 
> Directory instance of a type which cannot be controlled by the caller (the 
> DelegatingDirectoryFactory).
> +Proposal:+
>  So here we propose to change DirectoryFactory.get() method by adding a 
> fourth parameter Function that allows the caller to 
> wrap the internal Directory with a custom FilterDirectory when it is created.
> +Benefit:+
> Hence we would have a DelegatingDirectoryFactory that could delegate the 
> creation of some FilterDirectory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (SOLR-15060) Introduce DelegatingDirectoryFactory

2020-12-23 Thread Bruno Roustant (Jira)
Bruno Roustant created SOLR-15060:
-

 Summary: Introduce DelegatingDirectoryFactory
 Key: SOLR-15060
 URL: https://issues.apache.org/jira/browse/SOLR-15060
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
Reporter: Bruno Roustant
Assignee: Bruno Roustant


FilterDirectory already exists to delegate to a Directory, but there is no 
delegating DirectoryFactory.

+Use cases:+

A DelegatingDirectoryFactory could be used in SOLR-15051 to make 
BlobDirectoryFactory easily delegate to any DirectoryFactory (e.g. 
MMapDirectoryFactory). If the DelegatingDirectoryFactory delegation can be 
configured in solrconfig.xml then it allows any user to change the delegate 
DirectoryFactory without code.

A DelegatingDirectoryFactory could be used by an EncryptingDirectoryFactory 
that could encrypt/decrypt any delegate DirectoryFactory. Here again it should 
be configurable in solrconfig.xml.

+Problem:
+But currently DirectoryFactory delegation does not work with a delegate 
CachingDirectoryFactory because the get() method creates internally a Directory 
instance of a type which cannot be controlled by the caller (the 
DelegatingDirectoryFactory).

+Proposal:+
So here we propose to change DirectoryFactory.get() method by adding a fourth 
parameter Function that allows the caller to wrap the 
internal Directory with a custom FilterDirectory when it is created.

+Benefit:
+Hence we would have a DelegatingDirectoryFactory that could delegate the 
creation of some FilterDirectory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14975) Optimize CoreContainer.getAllCoreNames and getLoadedCoreNames

2020-11-12 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved SOLR-14975.
---
Resolution: Fixed

Thanks Erick and David for the review!

> Optimize CoreContainer.getAllCoreNames and getLoadedCoreNames 
> --
>
> Key: SOLR-14975
> URL: https://issues.apache.org/jira/browse/SOLR-14975
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Priority: Major
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> The methods CoreContainer.getAllCoreNames and getLoadedCoreNames hold a lock 
> while they grab core names to put into a TreeSet.  When there are *many* 
> cores, this delay is noticeable.  Holding this lock effectively blocks 
> queries since queries lookup a core; so it's critically important that these 
> methods are *fast*.  The tragedy here is that some callers merely want to 
> know if a particular name is in the set, or what the aggregated size is.  
> Some callers want to iterate the names but don't really care what the 
> iteration order is.
> I propose that some callers of these two methods find suitable alternatives, 
> like getCoreDescriptor to check for null.  And I propose that these methods 
> return a HashSet -- no order.  If the caller wants it sorted, it can do so 
> itself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14975) Optimize CoreContainer.getAllCoreNames and getLoadedCoreNames

2020-11-06 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227546#comment-17227546
 ] 

Bruno Roustant commented on SOLR-14975:
---

I added a PR.

I believe descriptors are superset of loaded cores. And also permanent cores 
are distinct from transient cores.

I simplified the logic to create lists based on distinct sets. And I added 
assertions to verify the sets are indeed distincts. All tests passed.

> Optimize CoreContainer.getAllCoreNames and getLoadedCoreNames 
> --
>
> Key: SOLR-14975
> URL: https://issues.apache.org/jira/browse/SOLR-14975
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: David Smiley
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The methods CoreContainer.getAllCoreNames and getLoadedCoreNames hold a lock 
> while they grab core names to put into a TreeSet.  When there are *many* 
> cores, this delay is noticeable.  Holding this lock effectively blocks 
> queries since queries lookup a core; so it's critically important that these 
> methods are *fast*.  The tragedy here is that some callers merely want to 
> know if a particular name is in the set, or what the aggregated size is.  
> Some callers want to iterate the names but don't really care what the 
> iteration order is.
> I propose that some callers of these two methods find suitable alternatives, 
> like getCoreDescriptor to check for null.  And I propose that these methods 
> return a HashSet -- no order.  If the caller wants it sorted, it can do so 
> itself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9455) ExitableTermsEnum (in ExitableDirectoryReader) should sample next()

2020-10-27 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved LUCENE-9455.

Fix Version/s: 8.8
   Resolution: Fixed

Thanks [~zacharymorn]

> ExitableTermsEnum (in ExitableDirectoryReader) should sample next()
> ---
>
> Key: LUCENE-9455
> URL: https://issues.apache.org/jira/browse/LUCENE-9455
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/other
>Reporter: David Smiley
>Priority: Major
>  Labels: newdev
> Fix For: 8.8
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> ExitableTermsEnum calls "checkAndThrow" on *every* call to next().  This is 
> too expensive; it should sample.  I observed ElasticSearch uses the same 
> approach; I think Lucene would benefit from this:
> https://github.com/elastic/elasticsearch/blob/4af4eb99e18fdaadac879b1223e986227dd2ee71/server/src/main/java/org/elasticsearch/search/internal/ExitableDirectoryReader.java#L151
> CC [~jimczi]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9455) ExitableTermsEnum (in ExitableDirectoryReader) should sample next()

2020-10-22 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219291#comment-17219291
 ] 

Bruno Roustant commented on LUCENE-9455:


I plan to merge tomorrow.

> ExitableTermsEnum (in ExitableDirectoryReader) should sample next()
> ---
>
> Key: LUCENE-9455
> URL: https://issues.apache.org/jira/browse/LUCENE-9455
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/other
>Reporter: David Smiley
>Priority: Major
>  Labels: newdev
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> ExitableTermsEnum calls "checkAndThrow" on *every* call to next().  This is 
> too expensive; it should sample.  I observed ElasticSearch uses the same 
> approach; I think Lucene would benefit from this:
> https://github.com/elastic/elasticsearch/blob/4af4eb99e18fdaadac879b1223e986227dd2ee71/server/src/main/java/org/elasticsearch/search/internal/ExitableDirectoryReader.java#L151
> CC [~jimczi]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9455) ExitableTermsEnum (in ExitableDirectoryReader) should sample next()

2020-10-22 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218810#comment-17218810
 ] 

Bruno Roustant commented on LUCENE-9455:


Your investigation revealed something I overlooked: 
MultiTermQueryConstantScoreWrapper BOOLEAN_REWRITE_TERM_COUNT_THRESHOLD 
prevents too many TermQuery. So there won't be too many calls to 
ExitableTermsEnum constructor. We actually don't need to sample it.

> ExitableTermsEnum (in ExitableDirectoryReader) should sample next()
> ---
>
> Key: LUCENE-9455
> URL: https://issues.apache.org/jira/browse/LUCENE-9455
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/other
>Reporter: David Smiley
>Priority: Major
>  Labels: newdev
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> ExitableTermsEnum calls "checkAndThrow" on *every* call to next().  This is 
> too expensive; it should sample.  I observed ElasticSearch uses the same 
> approach; I think Lucene would benefit from this:
> https://github.com/elastic/elasticsearch/blob/4af4eb99e18fdaadac879b1223e986227dd2ee71/server/src/main/java/org/elasticsearch/search/internal/ExitableDirectoryReader.java#L151
> CC [~jimczi]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9455) ExitableTermsEnum (in ExitableDirectoryReader) should sample next()

2020-10-21 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218367#comment-17218367
 ] 

Bruno Roustant commented on LUCENE-9455:


I propose to also sample the call to QueryTimeout.shouldExit() in 
ExitableTermsEnum constructor with

(System.identityHashCode(this) & TIMEOUT_CHECK_SAMPLING) == 0

> ExitableTermsEnum (in ExitableDirectoryReader) should sample next()
> ---
>
> Key: LUCENE-9455
> URL: https://issues.apache.org/jira/browse/LUCENE-9455
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/other
>Reporter: David Smiley
>Priority: Major
>  Labels: newdev
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> ExitableTermsEnum calls "checkAndThrow" on *every* call to next().  This is 
> too expensive; it should sample.  I observed ElasticSearch uses the same 
> approach; I think Lucene would benefit from this:
> https://github.com/elastic/elasticsearch/blob/4af4eb99e18fdaadac879b1223e986227dd2ee71/server/src/main/java/org/elasticsearch/search/internal/ExitableDirectoryReader.java#L151
> CC [~jimczi]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9455) ExitableTermsEnum (in ExitableDirectoryReader) should sample next()

2020-10-21 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218292#comment-17218292
 ] 

Bruno Roustant commented on LUCENE-9455:


Thanks [~zacharymorn]. I added comments to the review.

Fyi watchers, I have a broader question in the PR, I repeat it here:

Overall I wonder if we can do better with the sampling. The goal is to avoid 
doing numerous repetitive calls to QueryTimeout.shouldExit(). This is 
essentially the case for multi-terms queries. But actually for multi-terms 
queries, a new TermsEnum is created for each matching term (in 
TermQuery.getTermsEnum(), to get doc ids). So we end up only sampling half of 
the calls to QueryTimeout.shouldExit() since the other half is done by the 
ExitableTermsEnum constructor which is not sampled.
It would be better to also sample the ExitableTermsEnum constructor, but I 
don't know yet how to do that.

> ExitableTermsEnum (in ExitableDirectoryReader) should sample next()
> ---
>
> Key: LUCENE-9455
> URL: https://issues.apache.org/jira/browse/LUCENE-9455
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/other
>Reporter: David Smiley
>Priority: Major
>  Labels: newdev
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> ExitableTermsEnum calls "checkAndThrow" on *every* call to next().  This is 
> too expensive; it should sample.  I observed ElasticSearch uses the same 
> approach; I think Lucene would benefit from this:
> https://github.com/elastic/elasticsearch/blob/4af4eb99e18fdaadac879b1223e986227dd2ee71/server/src/main/java/org/elasticsearch/search/internal/ExitableDirectoryReader.java#L151
> CC [~jimczi]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9455) ExitableTermsEnum (in ExitableDirectoryReader) should sample next()

2020-10-15 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214491#comment-17214491
 ] 

Bruno Roustant commented on LUCENE-9455:


I planned to work on this (I still plan) but actually too busy on other stuff. 
So if you want to try out, yes, please share a PR here. I'll be glad to 
participate to the review.

> ExitableTermsEnum (in ExitableDirectoryReader) should sample next()
> ---
>
> Key: LUCENE-9455
> URL: https://issues.apache.org/jira/browse/LUCENE-9455
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/other
>Reporter: David Smiley
>Priority: Major
>
> ExitableTermsEnum calls "checkAndThrow" on *every* call to next().  This is 
> too expensive; it should sample.  I observed ElasticSearch uses the same 
> approach; I think Lucene would benefit from this:
> https://github.com/elastic/elasticsearch/blob/4af4eb99e18fdaadac879b1223e986227dd2ee71/server/src/main/java/org/elasticsearch/search/internal/ExitableDirectoryReader.java#L151
> CC [~jimczi]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14905) Update commons-io version to 2.8.0 due to security vulnerability

2020-10-01 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved SOLR-14905.
---
Fix Version/s: 8.7
   Resolution: Fixed

Thanks [~nazerke], this is in.

> Update commons-io version to 2.8.0 due to security vulnerability
> 
>
> Key: SOLR-14905
> URL: https://issues.apache.org/jira/browse/SOLR-14905
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: security
>Affects Versions: 8.6.2
>Reporter: Nazerke Seidan
>Priority: Minor
> Fix For: 8.7
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> The {{commons-io}} (version 2.6) package is vulnerable to Path Traversal. The 
> {{getPrefixLength}} method in {{FilenameUtils.class}} improperly verifies the 
> hostname value received from user input before processing client requests.
> The issue has been fixed in 2.7 onward:
> (https://issues.apache.org/jira/browse/IO-556, 
> https://issues.apache.org/jira/browse/IO-559) 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14905) Update commons-io version to 2.8.0 due to security vulnerability

2020-09-30 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204715#comment-17204715
 ] 

Bruno Roustant commented on SOLR-14905:
---

Thanks Nazerke. I'm testing your PR on my side also.

> Update commons-io version to 2.8.0 due to security vulnerability
> 
>
> Key: SOLR-14905
> URL: https://issues.apache.org/jira/browse/SOLR-14905
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: security
>Affects Versions: 8.6.2
>Reporter: Nazerke Seidan
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The {{commons-io}} (version 2.6) package is vulnerable to Path Traversal. The 
> {{getPrefixLength}} method in {{FilenameUtils.class}} improperly verifies the 
> hostname value received from user input before processing client requests.
> The issue has been fixed in 2.7 onward:
> (https://issues.apache.org/jira/browse/IO-556, 
> https://issues.apache.org/jira/browse/IO-559) 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14819) SOLR has linear log operations that could trivially be linear

2020-09-03 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved SOLR-14819.
---
Fix Version/s: 8.7
   Resolution: Fixed

Thanks Thomas for this fix.

Please share the info about this detection tool in Lucene/Solr dev list 
([https://lucene.apache.org/solr/community.html#mailing-lists-irc]) for a wider 
audience and discussion.

Personally I find the report verbose, so I wonder how verbose it would be on a 
specific Jira issue. Clearly the difficulty is to avoid too much false 
detections.

> SOLR has linear log operations that could trivially be linear
> -
>
> Key: SOLR-14819
> URL: https://issues.apache.org/jira/browse/SOLR-14819
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Thomas DuBuisson
>Priority: Trivial
> Fix For: 8.7
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The SOLR code has a few linear log operations that could be linear.  That is, 
> operations of
>  
> ```
> for(key in hashmap) doThing(hashmap.get(key));
>  
> ```
>  
> vs just `for(value in hashmap) doThing(value)`
>  
> I have a PR incoming on GitHub to fix a couple of these issue as [found by 
> Infer on 
> Muse|https://console.muse.dev/result/TomMD/lucene-solr/01EH5WXS6C1RH1NFYHP6ATXTZ9?search=JsonSchemaValidator=results].
>     



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-14782) QueryElevationComponent does not handle escaped query terms

2020-09-02 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved SOLR-14782.
---
Fix Version/s: 8.7
   Resolution: Fixed

No code change. Fixed by adding an example of how to unescape for 
QueryElevationComponent in the doc.

Thanks Thomas for pointing this.

> QueryElevationComponent does not handle escaped query terms
> ---
>
> Key: SOLR-14782
> URL: https://issues.apache.org/jira/browse/SOLR-14782
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 8.2
>Reporter: Thomas Schmiereck
>Assignee: Bruno Roustant
>Priority: Major
>  Labels: elevation
> Fix For: 8.7
>
> Attachments: SOLR-14782.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> h1. Description
> if the elevate.xml contains a entry with spaces:
> <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa 
> bbb"{color}><{color:#0033b3}doc 
> {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" 
> {color}/>
> and the Solr query term is escaped:
> {{?q=aaa+bbb}}
> the Solr search itself handels this correctly, but the elevate component 
> "QueryElevationComponent" does not unescape the query term bevor the lookup 
> in the elevate.xml.
> Result is that the entry is not elevated.
> A also valid (not escaped) query like:
> {{?q=aaa%20bbb}}
> is working.
> h1. Technical Notes
> see:
> org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14819) SOLR has linear log operations that could trivially be linear

2020-09-02 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189211#comment-17189211
 ] 

Bruno Roustant commented on SOLR-14819:
---

[~tommd] was there other issues found by Infer on Muse?

> SOLR has linear log operations that could trivially be linear
> -
>
> Key: SOLR-14819
> URL: https://issues.apache.org/jira/browse/SOLR-14819
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Thomas DuBuisson
>Priority: Trivial
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The SOLR code has a few linear log operations that could be linear.  That is, 
> operations of
>  
> ```
> for(key in hashmap) doThing(hashmap.get(key));
>  
> ```
>  
> vs just `for(value in hashmap) doThing(value)`
>  
> I have a PR incoming on GitHub to fix a couple of these issue as [found by 
> Infer on 
> Muse|https://console.muse.dev/result/TomMD/lucene-solr/01EH5WXS6C1RH1NFYHP6ATXTZ9?search=JsonSchemaValidator=results].
>     



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14782) QueryElevationComponent does not handle escaped query terms

2020-09-02 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189123#comment-17189123
 ] 

Bruno Roustant commented on SOLR-14782:
---

I added a PR for the doc improvement. I think it will be sufficient (no need 
for an UnescapeCharFilterFactory).

Thomas, you should use a StandardTokenizerFactory because with 8.3 it becomes 
possible to match a subset of the query terms (not always a full match 
anymore), see SOLR-11866. If you use a KeywordTokenizerFactory, it produces 
only the full query as a single token, so no subset matching possible.

> QueryElevationComponent does not handle escaped query terms
> ---
>
> Key: SOLR-14782
> URL: https://issues.apache.org/jira/browse/SOLR-14782
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 8.2
>Reporter: Thomas Schmiereck
>Assignee: Bruno Roustant
>Priority: Major
>  Labels: elevation
> Attachments: SOLR-14782.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h1. Description
> if the elevate.xml contains a entry with spaces:
> <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa 
> bbb"{color}><{color:#0033b3}doc 
> {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" 
> {color}/>
> and the Solr query term is escaped:
> {{?q=aaa+bbb}}
> the Solr search itself handels this correctly, but the elevate component 
> "QueryElevationComponent" does not unescape the query term bevor the lookup 
> in the elevate.xml.
> Result is that the entry is not elevated.
> A also valid (not escaped) query like:
> {{?q=aaa%20bbb}}
> is working.
> h1. Technical Notes
> see:
> org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-14782) QueryElevationComponent does not handle escaped query terms

2020-09-02 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189098#comment-17189098
 ] 

Bruno Roustant edited comment on SOLR-14782 at 9/2/20, 9:17 AM:


[~smk] you should add a CharFilter to unescape for query elevation. Instead of 
using lowercase for the queryFieldType you could use unescapelowercase with the 
following definition:
{code:java}

  



  
{code}
[~dsmiley] this regex pattern replacement CharFilter is not easy. Should we 
have a new simpler and equivalent UnescapeCharFilterFactory? In any case I 
think it would be nice to add a paragraph to explain that in the 
QueryElevationComponent doc.


was (Author: broustant):
[~smk] you should add a CharFilter to unescape for query elevation. Instead of 
using lowercase for the queryFieldType you could use unescapelowercase with the 
following definition:
{code:java}

  



  
{code}
[~dsmiley] this regex pattern replacement CharFilter is not easy. Should we 
have a new simpler and equivalent UnescapeCharFilterFactory? In any case I 
think it would be nice to add a paragraph to explain that in the 
QueryElevationComponent doc.

> QueryElevationComponent does not handle escaped query terms
> ---
>
> Key: SOLR-14782
> URL: https://issues.apache.org/jira/browse/SOLR-14782
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 8.2
>Reporter: Thomas Schmiereck
>Assignee: Bruno Roustant
>Priority: Major
>  Labels: elevation
> Attachments: SOLR-14782.patch
>
>
> h1. Description
> if the elevate.xml contains a entry with spaces:
> <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa 
> bbb"{color}><{color:#0033b3}doc 
> {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" 
> {color}/>
> and the Solr query term is escaped:
> {{?q=aaa+bbb}}
> the Solr search itself handels this correctly, but the elevate component 
> "QueryElevationComponent" does not unescape the query term bevor the lookup 
> in the elevate.xml.
> Result is that the entry is not elevated.
> A also valid (not escaped) query like:
> {{?q=aaa%20bbb}}
> is working.
> h1. Technical Notes
> see:
> org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14782) QueryElevationComponent does not handle escaped query terms

2020-09-02 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189098#comment-17189098
 ] 

Bruno Roustant commented on SOLR-14782:
---

[~smk] you should add a CharFilter to unescape for query elevation. Instead of 
using lowercase for the queryFieldType you could use unescapelowercase with the 
following definition:
{code:java}

  



  
{code}
[~dsmiley] this regex pattern replacement CharFilter is not easy. Should we 
have a new simpler and equivalent UnescapeCharFilterFactory? In any case I 
think it would be nice to add a paragraph to explain that in the 
QueryElevationComponent doc.

> QueryElevationComponent does not handle escaped query terms
> ---
>
> Key: SOLR-14782
> URL: https://issues.apache.org/jira/browse/SOLR-14782
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 8.2
>Reporter: Thomas Schmiereck
>Assignee: Bruno Roustant
>Priority: Major
>  Labels: elevation
> Attachments: SOLR-14782.patch
>
>
> h1. Description
> if the elevate.xml contains a entry with spaces:
> <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa 
> bbb"{color}><{color:#0033b3}doc 
> {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" 
> {color}/>
> and the Solr query term is escaped:
> {{?q=aaa+bbb}}
> the Solr search itself handels this correctly, but the elevate component 
> "QueryElevationComponent" does not unescape the query term bevor the lookup 
> in the elevate.xml.
> Result is that the entry is not elevated.
> A also valid (not escaped) query like:
> {{?q=aaa%20bbb}}
> is working.
> h1. Technical Notes
> see:
> org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14782) QueryElevationComponent does not handle escaped query terms

2020-09-01 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188353#comment-17188353
 ] 

Bruno Roustant commented on SOLR-14782:
---

Ok, so I understand the expectation of unescaping in your use-case, but that's 
not always the case. For example someone else with a custom analyzer could 
handle differently: different way to escape/unescape, special logic when 
escaped characters are encountered.

Maybe we should have a simple way to configure the escaping for simple 
use-cases. We could enhance the elevation config file (elevate.xml) to support 
an additional  tag. This unescaping would be false by 
default (for back compat) and could be enabled simply.

What is your opinion [~dsmiley]?

> QueryElevationComponent does not handle escaped query terms
> ---
>
> Key: SOLR-14782
> URL: https://issues.apache.org/jira/browse/SOLR-14782
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 8.2
>Reporter: Thomas Schmiereck
>Assignee: Bruno Roustant
>Priority: Major
>  Labels: elevation
> Attachments: SOLR-14782.patch
>
>
> h1. Description
> if the elevate.xml contains a entry with spaces:
> <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa 
> bbb"{color}><{color:#0033b3}doc 
> {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" 
> {color}/>
> and the Solr query term is escaped:
> {{?q=aaa+bbb}}
> the Solr search itself handels this correctly, but the elevate component 
> "QueryElevationComponent" does not unescape the query term bevor the lookup 
> in the elevate.xml.
> Result is that the entry is not elevated.
> A also valid (not escaped) query like:
> {{?q=aaa%20bbb}}
> is working.
> h1. Technical Notes
> see:
> org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-14782) QueryElevationComponent does not handle escaped query terms

2020-09-01 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188251#comment-17188251
 ] 

Bruno Roustant edited comment on SOLR-14782 at 9/1/20, 8:15 AM:


[~smk] what is the analyzer defined in your schema for the fieldType 
corresponding to the "elevate" searchComponent (value defined for the 
queryFieldType).

This analyzer is used to tokenize/filter both the elevation rules and the 
search query in the QueryElevationComponent. So this analyzer could unescape 
characters.


was (Author: broustant):
[~smk] what is the analyzer defined in your schema for the fieldType 
corresponding to the "elevate" searchComponent (

value defined for the queryFieldType).

This analyzer is used to tokenize/filter both the elevation rules and the 
search query in the QueryElevationComponent. So this analyzer could unescape 
characters.

> QueryElevationComponent does not handle escaped query terms
> ---
>
> Key: SOLR-14782
> URL: https://issues.apache.org/jira/browse/SOLR-14782
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 8.2
>Reporter: Thomas Schmiereck
>Assignee: Bruno Roustant
>Priority: Major
>  Labels: elevation
> Attachments: SOLR-14782.patch
>
>
> h1. Description
> if the elevate.xml contains a entry with spaces:
> <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa 
> bbb"{color}><{color:#0033b3}doc 
> {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" 
> {color}/>
> and the Solr query term is escaped:
> {{?q=aaa+bbb}}
> the Solr search itself handels this correctly, but the elevate component 
> "QueryElevationComponent" does not unescape the query term bevor the lookup 
> in the elevate.xml.
> Result is that the entry is not elevated.
> A also valid (not escaped) query like:
> {{?q=aaa%20bbb}}
> is working.
> h1. Technical Notes
> see:
> org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14782) QueryElevationComponent does not handle escaped query terms

2020-09-01 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188251#comment-17188251
 ] 

Bruno Roustant commented on SOLR-14782:
---

[~smk] what is the analyzer defined in your schema for the fieldType 
corresponding to the "elevate" searchComponent (

value defined for the queryFieldType).

This analyzer is used to tokenize/filter both the elevation rules and the 
search query in the QueryElevationComponent. So this analyzer could unescape 
characters.

> QueryElevationComponent does not handle escaped query terms
> ---
>
> Key: SOLR-14782
> URL: https://issues.apache.org/jira/browse/SOLR-14782
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 8.2
>Reporter: Thomas Schmiereck
>Assignee: Bruno Roustant
>Priority: Major
>  Labels: elevation
> Attachments: SOLR-14782.patch
>
>
> h1. Description
> if the elevate.xml contains a entry with spaces:
> <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa 
> bbb"{color}><{color:#0033b3}doc 
> {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" 
> {color}/>
> and the Solr query term is escaped:
> {{?q=aaa+bbb}}
> the Solr search itself handels this correctly, but the elevate component 
> "QueryElevationComponent" does not unescape the query term bevor the lookup 
> in the elevate.xml.
> Result is that the entry is not elevated.
> A also valid (not escaped) query like:
> {{?q=aaa%20bbb}}
> is working.
> h1. Technical Notes
> see:
> org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-14782) QueryElevationComponent does not handle escaped query terms

2020-08-31 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187563#comment-17187563
 ] 

Bruno Roustant edited comment on SOLR-14782 at 8/31/20, 8:45 AM:
-

[~smk] could you retry to attach the patch? The current one seems empty.

You can also create a Git PR.


was (Author: broustant):
[~smk] could you retry to attach the patch? The current one seems empty.

> QueryElevationComponent does not handle escaped query terms
> ---
>
> Key: SOLR-14782
> URL: https://issues.apache.org/jira/browse/SOLR-14782
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 8.2
>Reporter: Thomas Schmiereck
>Assignee: Bruno Roustant
>Priority: Major
>  Labels: elevation
> Attachments: SOLR-14782.patch
>
>
> h1. Description
> if the elevate.xml contains a entry with spaces:
> <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa 
> bbb"{color}><{color:#0033b3}doc 
> {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" 
> {color}/>
> and the Solr query term is escaped:
> {{?q=aaa+bbb}}
> the Solr search itself handels this correctly, but the elevate component 
> "QueryElevationComponent" does not unescape the query term bevor the lookup 
> in the elevate.xml.
> Result is that the entry is not elevated.
> A also valid (not escaped) query like:
> {{?q=aaa%20bbb}}
> is working.
> h1. Technical Notes
> see:
> org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14782) QueryElevationComponent does not handle escaped query terms

2020-08-31 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187563#comment-17187563
 ] 

Bruno Roustant commented on SOLR-14782:
---

[~smk] could you retry to attach the patch? The current one seems empty.

> QueryElevationComponent does not handle escaped query terms
> ---
>
> Key: SOLR-14782
> URL: https://issues.apache.org/jira/browse/SOLR-14782
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 8.2
>Reporter: Thomas Schmiereck
>Assignee: Bruno Roustant
>Priority: Major
>  Labels: elevation
> Attachments: SOLR-14782.patch
>
>
> h1. Description
> if the elevate.xml contains a entry with spaces:
> <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa 
> bbb"{color}><{color:#0033b3}doc 
> {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" 
> {color}/>
> and the Solr query term is escaped:
> {{?q=aaa+bbb}}
> the Solr search itself handels this correctly, but the elevate component 
> "QueryElevationComponent" does not unescape the query term bevor the lookup 
> in the elevate.xml.
> Result is that the entry is not elevated.
> A also valid (not escaped) query like:
> {{?q=aaa%20bbb}}
> is working.
> h1. Technical Notes
> see:
> org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-14782) QueryElevationComponent does not handle escaped query terms

2020-08-31 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187479#comment-17187479
 ] 

Bruno Roustant commented on SOLR-14782:
---

Looking into this.

> QueryElevationComponent does not handle escaped query terms
> ---
>
> Key: SOLR-14782
> URL: https://issues.apache.org/jira/browse/SOLR-14782
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 8.2
>Reporter: Thomas Schmiereck
>Assignee: Bruno Roustant
>Priority: Major
>  Labels: elevation
>
> h1. Description
> if the elevate.xml contains a entry with spaces:
> <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa 
> bbb"{color}><{color:#0033b3}doc 
> {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" 
> {color}/>
> and the Solr query term is escaped:
> {{?q=aaa\+bbb}}
> the Solr search itself handels this correctly, but the elevate component 
> "QueryElevationComponent" does not unescape the query term bevor the lookup 
> in the elevate.xml.
> Result is that the entry is not elevated.
> A also valid (not escaped) query like:
> {{?q=aaa%20bbb}}
> is working.
> h1. Technical Notes
> see:
> org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (SOLR-14782) QueryElevationComponent does not handle escaped query terms

2020-08-31 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant reassigned SOLR-14782:
-

Assignee: Bruno Roustant

> QueryElevationComponent does not handle escaped query terms
> ---
>
> Key: SOLR-14782
> URL: https://issues.apache.org/jira/browse/SOLR-14782
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: query parsers
>Affects Versions: 8.2
>Reporter: Thomas Schmiereck
>Assignee: Bruno Roustant
>Priority: Major
>  Labels: elevation
>
> h1. Description
> if the elevate.xml contains a entry with spaces:
> <{color:#0033b3}query {color}{color:#174ad4}text{color}{color:#067d17}="aaa 
> bbb"{color}><{color:#0033b3}doc 
> {color}{color:#174ad4}id{color}{color:#067d17}="core2docId2" 
> {color}/>
> and the Solr query term is escaped:
> {{?q=aaa\+bbb}}
> the Solr search itself handels this correctly, but the elevate component 
> "QueryElevationComponent" does not unescape the query term bevor the lookup 
> in the elevate.xml.
> Result is that the entry is not elevated.
> A also valid (not escaped) query like:
> {{?q=aaa%20bbb}}
> is working.
> h1. Technical Notes
> see:
> org.apache.solr.handler.component.QueryElevationComponent.MapElevationProvider#getElevationForQuery
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9455) ExitableTermsEnum (in ExitableDirectoryReader) should sample next()

2020-08-12 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176316#comment-17176316
 ] 

Bruno Roustant commented on LUCENE-9455:


+1

> ExitableTermsEnum (in ExitableDirectoryReader) should sample next()
> ---
>
> Key: LUCENE-9455
> URL: https://issues.apache.org/jira/browse/LUCENE-9455
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/other
>Reporter: David Smiley
>Priority: Major
>
> ExitableTermsEnum calls "checkAndThrow" on *every* call to next().  This is 
> too expensive; it should sample.  I observed ElasticSearch uses the same 
> approach; I think Lucene would benefit from this:
> https://github.com/elastic/elasticsearch/blob/4af4eb99e18fdaadac879b1223e986227dd2ee71/server/src/main/java/org/elasticsearch/search/internal/ExitableDirectoryReader.java#L151
> CC [~jimczi]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption

2020-08-11 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175583#comment-17175583
 ] 

Bruno Roustant commented on LUCENE-9379:


[~Raji] maybe a better approach would be to have one tenant per collection, but 
you might have many tenants so the performance for many collection is poor? If 
this is the case, then I think the root problem is the perf for many 
collections. Without composite id router you could use an OS encryption per 
collection.

> Directory based approach for index encryption
> -
>
> Key: LUCENE-9379
> URL: https://issues.apache.org/jira/browse/LUCENE-9379
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> +Important+: This Lucene Directory wrapper approach is to be considered only 
> if an OS level encryption is not possible. OS level encryption better fits 
> Lucene usage of OS cache, and thus is more performant.
> But there are some use-case where OS level encryption is not possible. This 
> Jira issue was created to address those.
> 
>  
> The goal is to provide optional encryption of the index, with a scope limited 
> to an encryptable Lucene Directory wrapper.
> Encryption is at rest on disk, not in memory.
> This simple approach should fit any Codec as it would be orthogonal, without 
> modifying APIs as much as possible.
> Use a standard encryption method. Limit perf/memory impact as much as 
> possible.
> Determine how callers provide encryption keys. They must not be stored on 
> disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9379) Directory based approach for index encryption

2020-07-13 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated LUCENE-9379:
---
Description: 
+Important+: This Lucene Directory wrapper approach is to be considered only if 
an OS level encryption is not possible. OS level encryption better fits Lucene 
usage of OS cache, and thus is more performant.
But there are some use-case where OS level encryption is not possible. This 
Jira issue was created to address those.



 

The goal is to provide optional encryption of the index, with a scope limited 
to an encryptable Lucene Directory wrapper.

Encryption is at rest on disk, not in memory.

This simple approach should fit any Codec as it would be orthogonal, without 
modifying APIs as much as possible.

Use a standard encryption method. Limit perf/memory impact as much as possible.

Determine how callers provide encryption keys. They must not be stored on disk.

  was:
The goal is to provide optional encryption of the index, with a scope limited 
to an encryptable Lucene Directory wrapper.

Encryption is at rest on disk, not in memory.

This simple approach should fit any Codec as it would be orthogonal, without 
modifying APIs as much as possible.

Use a standard encryption method. Limit perf/memory impact as much as possible.

Determine how callers provide encryption keys. They must not be stored on disk.


> Directory based approach for index encryption
> -
>
> Key: LUCENE-9379
> URL: https://issues.apache.org/jira/browse/LUCENE-9379
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> +Important+: This Lucene Directory wrapper approach is to be considered only 
> if an OS level encryption is not possible. OS level encryption better fits 
> Lucene usage of OS cache, and thus is more performant.
> But there are some use-case where OS level encryption is not possible. This 
> Jira issue was created to address those.
> 
>  
> The goal is to provide optional encryption of the index, with a scope limited 
> to an encryptable Lucene Directory wrapper.
> Encryption is at rest on disk, not in memory.
> This simple approach should fit any Codec as it would be orthogonal, without 
> modifying APIs as much as possible.
> Use a standard encryption method. Limit perf/memory impact as much as 
> possible.
> Determine how callers provide encryption keys. They must not be stored on 
> disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption

2020-07-13 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17156566#comment-17156566
 ] 

Bruno Roustant commented on LUCENE-9379:


I'm going to pause my work on this for some time, until there are comments 
added here that share use-cases where OS level encryption is not possible.
If you can use OS level encryption, do so, it will be faster. If not, share 
your use-case here.

> Directory based approach for index encryption
> -
>
> Key: LUCENE-9379
> URL: https://issues.apache.org/jira/browse/LUCENE-9379
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The goal is to provide optional encryption of the index, with a scope limited 
> to an encryptable Lucene Directory wrapper.
> Encryption is at rest on disk, not in memory.
> This simple approach should fit any Codec as it would be orthogonal, without 
> modifying APIs as much as possible.
> Use a standard encryption method. Limit perf/memory impact as much as 
> possible.
> Determine how callers provide encryption keys. They must not be stored on 
> disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9356) Add tests for corruptions caused by byte flips

2020-07-06 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151951#comment-17151951
 ] 

Bruno Roustant commented on LUCENE-9356:


[8.6 release manager] Is this issue resolved [~jpountz]? I'm checking to 
prepare 8.6 RC tomorrow.

> Add tests for corruptions caused by byte flips
> --
>
> Key: LUCENE-9356
> URL: https://issues.apache.org/jira/browse/LUCENE-9356
> Project: Lucene - Core
>  Issue Type: Test
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.6
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> We already have tests that file truncation and modification of the index 
> headers are caught correctly. I'd like to add another test that flipping a 
> byte in a way that modifies the checksum of the file is always caught 
> gracefully by Lucene.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9191) Fix linefiledocs compression or replace in tests

2020-07-06 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved LUCENE-9191.

Resolution: Fixed

[8.6 release manager] It seems to me this issue is resolved. I change its 
status for 8.6 RC.

> Fix linefiledocs compression or replace in tests
> 
>
> Key: LUCENE-9191
> URL: https://issues.apache.org/jira/browse/LUCENE-9191
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Assignee: Michael McCandless
>Priority: Major
> Fix For: 8.6
>
> Attachments: LUCENE-9191.patch, LUCENE-9191.patch, LUCENE-9191.patch
>
>
> LineFileDocs(random) is very slow, even to open. It does a very slow "random 
> skip" through a gzip compressed file.
> For the analyzers tests, in LUCENE-9186 I simply removed its usage, since 
> TestUtil.randomAnalysisString is superior, and fast. But we should address 
> other tests using it, since LineFileDocs(random) is slow!
> I think it is also the case that every lucene test has probably tested every 
> LineFileDocs line many times now, whereas randomAnalysisString will invent 
> new ones.
> Alternatively, we could "fix" LineFileDocs(random), e.g. special compression 
> options (in blocks)... deflate supports such stuff. But it would make it even 
> hairier than it is now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption

2020-07-06 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151920#comment-17151920
 ] 

Bruno Roustant commented on LUCENE-9379:


I tested with FST ON-HEAP: we gain +15% to +20% perf on all queries.

I tested my Light version of javax.crypto.Cipher. It is indeed much faster for 
construction and cloning, but not for the core encryption. The reason is that 
two internal classes in com.sun.crypto have an @HotSpotIntrinsicCandidate 
annotation that makes the encryption extremely fast.

I tested with a hack version that takes the best of the two versions. It brings 
a cumulative +10% perf improvement.

So as a conclusion for the perf benchmark:
 * An OS level encryption is best and fastest.
 * If really it’s not possible, expect an average of -20% perf impact on most 
queries, -60% on multiterm queries.
 * If you need more you can make FST on-heap and expect +15% perf.
 * If you need more you can use a Cipher hack to get +10% perf.

> Directory based approach for index encryption
> -
>
> Key: LUCENE-9379
> URL: https://issues.apache.org/jira/browse/LUCENE-9379
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The goal is to provide optional encryption of the index, with a scope limited 
> to an encryptable Lucene Directory wrapper.
> Encryption is at rest on disk, not in memory.
> This simple approach should fit any Codec as it would be orthogonal, without 
> modifying APIs as much as possible.
> Use a standard encryption method. Limit perf/memory impact as much as 
> possible.
> Determine how callers provide encryption keys. They must not be stored on 
> disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption

2020-07-06 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151918#comment-17151918
 ] 

Bruno Roustant commented on LUCENE-9379:


TaskQPS Lucene86 StdDevQPS EncryptionTim StdDev Pct diff
 Respell 41.55 (2.7%) 10.76 (0.9%) -74.1% ( -75% - -72%)
 Fuzzy2 44.81 (9.0%) 12.00 (1.1%) -73.2% ( -76% - -69%)
 Fuzzy1 41.03 (7.3%) 16.24 (1.9%) -60.4% ( -64% - -55%)
 Wildcard 28.02 (4.0%) 14.94 (2.0%) -46.7% ( -50% - -42%)
 OrHighNotLow 747.43 (4.2%) 485.90 (3.5%) -35.0% ( -40% - -28%)
 OrNotHighMed 524.60 (4.2%) 344.06 (2.9%) -34.4% ( -39% - -28%)
 OrHighNotHigh 576.32 (5.0%) 382.60 (4.0%) -33.6% ( -40% - -25%)
 OrHighNotMed 553.85 (4.1%) 371.73 (3.4%) -32.9% ( -38% - -26%)
 MedTerm 1116.53 (3.6%) 766.39 (2.6%) -31.4% ( -36% - -26%)
 LowTerm 1376.31 (4.2%) 947.48 (3.0%) -31.2% ( -36% - -25%)
 OrNotHighLow 492.68 (4.7%) 342.05 (4.7%) -30.6% ( -38% - -22%)
 AndHighLow 482.97 (3.8%) 342.18 (3.4%) -29.2% ( -34% - -22%)
 OrHighLow 410.23 (3.7%) 294.38 (3.8%) -28.2% ( -34% - -21%)
 HighTerm 971.63 (5.3%) 701.77 (3.2%) -27.8% ( -34% - -20%)
 OrNotHighHigh 493.99 (5.1%) 358.95 (3.9%) -27.3% ( -34% - -19%)
 LowPhrase 286.03 (2.9%) 246.04 (2.8%) -14.0% ( -19% - -8%)
 HighPhrase 290.25 (3.3%) 252.54 (3.4%) -13.0% ( -18% - -6%)
 Prefix3 51.36 (4.8%) 45.20 (4.1%) -12.0% ( -19% - -3%)
 AndHighMed 113.34 (4.0%) 105.77 (4.0%) -6.7% ( -14% - 1%)
 MedSloppyPhrase 79.83 (3.5%) 74.78 (3.6%) -6.3% ( -13% - 0%)
 HighTermDayOfYearSort 63.32 (13.3%) 59.34 (14.6%) -6.3% ( -30% - 24%)
 HighTermTitleBDVSort 86.16 (10.3%) 81.63 (10.0%) -5.3% ( -23% - 16%)
 LowSpanNear 58.07 (3.1%) 55.13 (3.2%) -5.1% ( -10% - 1%)
 AndHighHigh 44.58 (4.1%) 42.92 (4.2%) -3.7% ( -11% - 4%)
 OrHighMed 56.53 (4.4%) 54.65 (4.1%) -3.3% ( -11% - 5%)
 BrowseDateTaxoFacets 1.54 (4.6%) 1.50 (5.2%) -2.5% ( -11% - 7%)
 HighTermMonthSort 18.51 (10.5%) 18.06 (10.1%) -2.4% ( -20% - 20%)
BrowseDayOfYearTaxoFacets 1.53 (4.7%) 1.49 (5.3%) -2.3% ( -11% - 8%)
 BrowseMonthTaxoFacets 1.77 (3.5%) 1.74 (4.2%) -2.1% ( -9% - 5%)
 HighSpanNear 12.75 (3.6%) 12.50 (4.1%) -2.0% ( -9% - 5%)
 MedPhrase 107.89 (3.2%) 106.01 (3.9%) -1.7% ( -8% - 5%)
 HighSloppyPhrase 12.86 (4.0%) 12.71 (4.7%) -1.2% ( -9% - 7%)
 MedSpanNear 11.76 (3.1%) 11.62 (3.4%) -1.1% ( -7% - 5%)
 HighIntervalsOrdered 13.61 (3.2%) 13.46 (3.3%) -1.1% ( -7% - 5%)
 OrHighHigh 11.12 (3.7%) 11.12 (4.1%) -0.1% ( -7% - 8%)
 BrowseMonthSSDVFacets 4.28 (3.9%) 4.29 (3.9%) 0.2% ( -7% - 8%)
BrowseDayOfYearSSDVFacets 3.82 (3.7%) 3.84 (3.4%) 0.3% ( -6% - 7%)
 IntNRQ 25.54 (3.1%) 26.34 (3.4%) 3.1% ( -3% - 9%)
 PKLookup 174.98 (3.0%) 183.78 (4.5%) 5.0% ( -2% - 12%)
 LowSloppyPhrase 6.29 (3.5%) 6.89 (4.5%) 9.6% ( 1% - 18%)

> Directory based approach for index encryption
> -
>
> Key: LUCENE-9379
> URL: https://issues.apache.org/jira/browse/LUCENE-9379
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The goal is to provide optional encryption of the index, with a scope limited 
> to an encryptable Lucene Directory wrapper.
> Encryption is at rest on disk, not in memory.
> This simple approach should fit any Codec as it would be orthogonal, without 
> modifying APIs as much as possible.
> Use a standard encryption method. Limit perf/memory impact as much as 
> possible.
> Determine how callers provide encryption keys. They must not be stored on 
> disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption

2020-07-06 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151915#comment-17151915
 ] 

Bruno Roustant commented on LUCENE-9379:


I ran the benchmarks to measure the perf impact of this IndexInput-level 
encryption on the PostingsFormat (luceneutil on wikimediumall).

When encrypting only the terms file, FST file and metadata file (.tim .tip 
.tmd) (not doc id nor postings):
 Most queries run between -0% to -35%
 Wildcard -47%
 Fuzzy/Respell between -60% to -74%

It is possible to encrypt all files, but the perf drops considerably, -60% for 
most queries, -90% for fuzzy queries.

> Directory based approach for index encryption
> -
>
> Key: LUCENE-9379
> URL: https://issues.apache.org/jira/browse/LUCENE-9379
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The goal is to provide optional encryption of the index, with a scope limited 
> to an encryptable Lucene Directory wrapper.
> Encryption is at rest on disk, not in memory.
> This simple approach should fit any Codec as it would be orthogonal, without 
> modifying APIs as much as possible.
> Use a standard encryption method. Limit perf/memory impact as much as 
> possible.
> Determine how callers provide encryption keys. They must not be stored on 
> disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption

2020-07-06 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17151914#comment-17151914
 ] 

Bruno Roustant commented on LUCENE-9379:


[~rcmuir] makes an important callout in the PR. A better approach is by 
leveraging the OS encryption at filesystem level because it fits the OS 
filesystem cache. That way the cached pages are decrypted in the cache.

So whenever it is possible, we must use OS level encryption. An OS filesystem 
encryption allows to encrypt differently per directory/file, and some allow to 
manage multiple keys.

But OS level encryption is not always possible. The example I can think of is 
running on computing engines on public cloud. In this case we don't have access 
to the OS level encryption (there is one but we cannot manage keys).

So this Jira issue propose a solution in the case we cannot use OS level 
encryption and we need to manage multiple keys. It should be stated well in the 
doc/javadoc. It is sub-optimal because it has to decrypt each time it accesses 
a cached IO page. So expect more performance impact.

 

> Directory based approach for index encryption
> -
>
> Key: LUCENE-9379
> URL: https://issues.apache.org/jira/browse/LUCENE-9379
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> The goal is to provide optional encryption of the index, with a scope limited 
> to an encryptable Lucene Directory wrapper.
> Encryption is at rest on disk, not in memory.
> This simple approach should fit any Codec as it would be orthogonal, without 
> modifying APIs as much as possible.
> Use a standard encryption method. Limit perf/memory impact as much as 
> possible.
> Determine how callers provide encryption keys. They must not be stored on 
> disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption

2020-07-01 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149694#comment-17149694
 ] 

Bruno Roustant commented on LUCENE-9379:


Watchers, I need your help.

I need to know how you would use the encryption, and more precisely how you 
would provide the keys.
Is my approach of using either an EncryptingDirectory (in the PR look at 
SimpleEncryptingDirectory) or a custom Codec (in the PR look at 
EncryptingCodec) appropriate for your use-case?

Note that both SimpleEncryptingDirectory and EncryptingCodec are only in test 
packages as I expect the users to write some custom code to use encryption. If 
you have an idea of a standard code that could be added to make encryption 
easy, please share your idea here.

> Directory based approach for index encryption
> -
>
> Key: LUCENE-9379
> URL: https://issues.apache.org/jira/browse/LUCENE-9379
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The goal is to provide optional encryption of the index, with a scope limited 
> to an encryptable Lucene Directory wrapper.
> Encryption is at rest on disk, not in memory.
> This simple approach should fit any Codec as it would be orthogonal, without 
> modifying APIs as much as possible.
> Use a standard encryption method. Limit perf/memory impact as much as 
> possible.
> Determine how callers provide encryption keys. They must not be stored on 
> disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption

2020-07-01 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17149691#comment-17149691
 ] 

Bruno Roustant commented on LUCENE-9379:


I updated the PR. Now it is functional and complete, with javadoc.

There should be no perf issue anymore because I replaced javax.crypto.Cipher by 
a much lighter code that is strictly equivalent, encryption/decryption is the 
same (tested randomly by 3 different tests).

For reviewers, there are 33 changed files in the PR but only 10 source classes, 
the other are for tests. Look for the classes in store package (e.g. 
EncryptingDirectory, EncryptingIndexOutput, EncryptingIndexInput) and the new 
util.crypto package (e.g. AesCtrEncrypter).

Now all tests pass when enabling the encryption with a test codec or a test 
directory.

Next step:
 * Run luceneutil benchmark to evaluate the perf impact.

> Directory based approach for index encryption
> -
>
> Key: LUCENE-9379
> URL: https://issues.apache.org/jira/browse/LUCENE-9379
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The goal is to provide optional encryption of the index, with a scope limited 
> to an encryptable Lucene Directory wrapper.
> Encryption is at rest on disk, not in memory.
> This simple approach should fit any Codec as it would be orthogonal, without 
> modifying APIs as much as possible.
> Use a standard encryption method. Limit perf/memory impact as much as 
> possible.
> Determine how callers provide encryption keys. They must not be stored on 
> disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (SOLR-14537) Improve performance of ExportWriter

2020-06-30 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-14537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-14537:
--
Fix Version/s: (was: 8.6)

> Improve performance of ExportWriter
> ---
>
> Key: SOLR-14537
> URL: https://issues.apache.org/jira/browse/SOLR-14537
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Reporter: Andrzej Bialecki
>Assignee: Andrzej Bialecki
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Retrieving, sorting and writing out documents in {{ExportWriter}} are three 
> aspects of the /export handler that can be further optimized.
> SOLR-14470 introduced some level of caching in {{StringValue}}. Further 
> options for caching and speedups should be explored.
> Currently the sort/retrieve and write operations are done sequentially, but 
> they could be parallelized, considering that they block on different channels 
> - the first is index reading & CPU bound, the other is bound by the receiving 
> end because it uses blocking IO. The sorting and retrieving of values could 
> be done in parallel with the operation of writing out the current batch of 
> results.
> One possible approach here would be to use "double buffering" where one 
> buffered batch that is ready (already sorted and retrieved) is being written 
> out, while the other batch is being prepared in a background thread, and when 
> both are done the buffers are swapped. This wouldn't complicate the current 
> code too much but it should instantly give up to 2x higher throughput.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption

2020-06-24 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143886#comment-17143886
 ] 

Bruno Roustant commented on LUCENE-9379:


First PR, functional but incomplete. The idea of using a pool of Cipher does 
not work in Lucene.

To run the tests, two options:

test -Dtests.codec=Encrypting
It executes the tests with the EncryptingCodec in test-framework. Currently it 
encrypts a delegate PostingsFormat. This option shows how to provide the 
encryption key depending on the SegmentInfo.

test 
-Dtests.directory=org.apache.lucene.codecs.encrypting.SimpleEncryptingDirectory
It executes the tests with the SimpleEncryptingDirectory in test-framework. 
This option is the simplest; it shows how to provide the encryption key as a 
constant (could be a property) or only depending on the name of the file to 
encrypt (no SegmentInfo).

 

There is a performance issue because of too many new Ciphers when slicing 
IndexInput.
javax.crypto.Cipher is heavy weight to create and is stateful. I tried a 
CipherPool, but actually there are many cases where we need to get lots of 
slices of the IndexInput so we have to create lots of new stateful Cipher. The 
pool turns out to be a no-go, there are too many Cipher in it.

TODO:
 * find a lighter alternative to Cipher if it exists.
 * fix a couple of tests still failing because of unclosed IndexOutput.

> Directory based approach for index encryption
> -
>
> Key: LUCENE-9379
> URL: https://issues.apache.org/jira/browse/LUCENE-9379
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The goal is to provide optional encryption of the index, with a scope limited 
> to an encryptable Lucene Directory wrapper.
> Encryption is at rest on disk, not in memory.
> This simple approach should fit any Codec as it would be orthogonal, without 
> modifying APIs as much as possible.
> Use a standard encryption method. Limit perf/memory impact as much as 
> possible.
> Determine how callers provide encryption keys. They must not be stored on 
> disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9379) Directory based approach for index encryption

2020-06-23 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136783#comment-17136783
 ] 

Bruno Roustant edited comment on LUCENE-9379 at 6/23/20, 12:51 PM:
---

So I plan to implement an EncryptingDirectory extending FilterDirectory.

 

+Encryption method:+

AES CTR (counter)
 * This mode is approved by NIST. 
([https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29])
 * AES encryption has the same size as the original clear text (no padding). So 
we can use the same file pointers.
 * CTR mode allows random access to encrypted blocks (128 bits blocks).
 * IV (initialisation vector) must be random, and is stored at the beginning of 
the encrypted file because it can be public. No need to repeat the IV for each 
block (less disk impact compared to CBC mode).
 * It is appropriate to encrypt streams.

 

+API:+ 

I don’t anticipate any API change.

 

+How to provide encryption keys:+

EncryptingDirectory would require a delegate Directory, an encryption key 
supplier, and a Cipher pool (for performance).

For the callers to pass the encryption keys, I see two ways:

1- In Solr, declare a DirectoryFactory in solrconfig.xml that creates 
EncryptingDirectory. This factory is able to determine the encryption key per 
file based on the path. It is the responsibility of this factory to access the 
keys (e.g. stored in safe DB, received with an admin handler, read from 
properties, etc). The Cipher pool is hold by the DirectoryFactory.

2- More generally the EncryptingDirectory can be created to wrap a Directory 
when opening a segment (e.g. in PostingsFormat/DocValuesFormat 
fieldsConsumer()/fieldsProducer(), in StoredFieldFormat 
fieldsReader()/fieldsWriter(), etc). In this case the 
PostingsFormat/DocValuesFormat/StoredFieldFormat extension determines the 
encryption key based on the SegmentInfo. A custom Codec can be created to 
handle encrypting formats. The Cipher pool is hold either in the Codec or in 
the Format.

 

+Code:+

I will inspire from Apache commons-crypto CtrCryptoOutputStream, although not 
directly using it because it is an OutputStream while we need an IndexOutput. 
And we can probably simplify since we have a specific use-case compared to this 
lib wide usage.


was (Author: broustant):
So I plan to implement an EncryptingDirectory extending FilterDirectory.

 

+Encryption method:+

AES CTR (counter)
 * This mode is approved by NIST. 
([https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29])
 * AES encryption has the same size as the original clear text (though the last 
block is padded to 128 bits). So we can use the same file pointers.
 * CTR mode allows random access to encrypted blocks (128 bits blocks).
 * IV (initialisation vector) must be random, and is stored at the beginning of 
the encrypted file because it can be public. No need to repeat the IV for each 
block (less disk impact compared to CBC mode).
 * It is appropriate to encrypt streams.

 

+API:+ 

I don’t anticipate any API change.

 

+How to provide encryption keys:+

EncryptingDirectory would require a delegate Directory, an encryption key 
supplier, and a Cipher pool (for performance).

For the callers to pass the encryption keys, I see two ways:

1- In Solr, declare a DirectoryFactory in solrconfig.xml that creates 
EncryptingDirectory. This factory is able to determine the encryption key per 
file based on the path. It is the responsibility of this factory to access the 
keys (e.g. stored in safe DB, received with an admin handler, read from 
properties, etc). The Cipher pool is hold by the DirectoryFactory.

2- More generally the EncryptingDirectory can be created to wrap a Directory 
when opening a segment (e.g. in PostingsFormat/DocValuesFormat 
fieldsConsumer()/fieldsProducer(), in StoredFieldFormat 
fieldsReader()/fieldsWriter(), etc). In this case the 
PostingsFormat/DocValuesFormat/StoredFieldFormat extension determines the 
encryption key based on the SegmentInfo. A custom Codec can be created to 
handle encrypting formats. The Cipher pool is hold either in the Codec or in 
the Format.

 

+Code:+

I will inspire from Apache commons-crypto CtrCryptoOutputStream, although not 
directly using it because it is an OutputStream while we need an IndexOutput. 
And we can probably simplify since we have a specific use-case compared to this 
lib wide usage.

> Directory based approach for index encryption
> -
>
> Key: LUCENE-9379
> URL: https://issues.apache.org/jira/browse/LUCENE-9379
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>
> The goal is to provide optional encryption of the index, with a scope limited 
> to an encryptable 

[jira] [Commented] (LUCENE-9286) FST arc.copyOf clones BitTables and this can lead to excessive memory use

2020-06-19 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140505#comment-17140505
 ] 

Bruno Roustant commented on LUCENE-9286:


Maybe we miss a benchmark on FSTEnum traversal speed? Although I don't know 
where we could put it.

> FST arc.copyOf clones BitTables and this can lead to excessive memory use
> -
>
> Key: LUCENE-9286
> URL: https://issues.apache.org/jira/browse/LUCENE-9286
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.5
>Reporter: Dawid Weiss
>Assignee: Bruno Roustant
>Priority: Major
> Fix For: 8.6
>
> Attachments: screen-[1].png
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> I see a dramatic increase in the amount of memory required for construction 
> of (arguably large) automata. It currently OOMs with 8GB of memory consumed 
> for bit tables. I am pretty sure this didn't require so much memory before 
> (the automaton is ~50MB after construction).
> Something bad happened in between. Thoughts, [~broustant], [~sokolov]?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9379) Directory based approach for index encryption

2020-06-16 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136783#comment-17136783
 ] 

Bruno Roustant edited comment on LUCENE-9379 at 6/16/20, 4:25 PM:
--

So I plan to implement an EncryptingDirectory extending FilterDirectory.

 

+Encryption method:+

AES CTR (counter)
 * This mode is approved by NIST. 
([https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29])
 * AES encryption has the same size as the original clear text (though the last 
block is padded to 128 bits). So we can use the same file pointers.
 * CTR mode allows random access to encrypted blocks (128 bits blocks).
 * IV (initialisation vector) must be random, and is stored at the beginning of 
the encrypted file because it can be public. No need to repeat the IV for each 
block (less disk impact compared to CBC mode).
 * It is appropriate to encrypt streams.

 

+API:+ 

I don’t anticipate any API change.

 

+How to provide encryption keys:+

EncryptingDirectory would require a delegate Directory, an encryption key 
supplier, and a Cipher pool (for performance).

For the callers to pass the encryption keys, I see two ways:

1- In Solr, declare a DirectoryFactory in solrconfig.xml that creates 
EncryptingDirectory. This factory is able to determine the encryption key per 
file based on the path. It is the responsibility of this factory to access the 
keys (e.g. stored in safe DB, received with an admin handler, read from 
properties, etc). The Cipher pool is hold by the DirectoryFactory.

2- More generally the EncryptingDirectory can be created to wrap a Directory 
when opening a segment (e.g. in PostingsFormat/DocValuesFormat 
fieldsConsumer()/fieldsProducer(), in StoredFieldFormat 
fieldsReader()/fieldsWriter(), etc). In this case the 
PostingsFormat/DocValuesFormat/StoredFieldFormat extension determines the 
encryption key based on the SegmentInfo. A custom Codec can be created to 
handle encrypting formats. The Cipher pool is hold either in the Codec or in 
the Format.

 

+Code:+

I will inspire from Apache commons-crypto CtrCryptoOutputStream, although not 
directly using it because it is an OutputStream while we need an IndexOutput. 
And we can probably simplify since we have a specific use-case compared to this 
lib wide usage.


was (Author: broustant):
So I plan to implement an EncryptingDirectory extending FilterDirectory.

 

+Encryption method:+

AES CTR (counter)
 * This mode is approved by NIST. 
([https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29])
 * AES encryption has the same size as the original clear text (though the last 
block is padded to 128 bits). So we can use the same file pointers.
 * CTR mode allows random access to encrypted blocks (128 bits blocks).
 * IV (initialisation vector) must be random, and is stored at the beginning of 
the encrypted file because it can be public.
 * It is appropriate to encrypt streams.

 

+API:+ 

I don’t anticipate any API change.

 

+How to provide encryption keys:+

EncryptingDirectory would require a delegate Directory, an encryption key 
supplier, and a Cipher pool (for performance).

For the callers to pass the encryption keys, I see two ways:

1- In Solr, declare a DirectoryFactory in solrconfig.xml that creates 
EncryptingDirectory. This factory is able to determine the encryption key per 
file based on the path. It is the responsibility of this factory to access the 
keys (e.g. stored in safe DB, received with an admin handler, read from 
properties, etc). The Cipher pool is hold by the DirectoryFactory.

2- More generally the EncryptingDirectory can be created to wrap a Directory 
when opening a segment (e.g. in PostingsFormat/DocValuesFormat 
fieldsConsumer()/fieldsProducer(), in StoredFieldFormat 
fieldsReader()/fieldsWriter(), etc). In this case the 
PostingsFormat/DocValuesFormat/StoredFieldFormat extension determines the 
encryption key based on the SegmentInfo. A custom Codec can be created to 
handle encrypting formats. The Cipher pool is hold either in the Codec or in 
the Format.

 

+Code:+

I will inspire from Apache commons-crypto CtrCryptoOutputStream, although not 
directly using it because it is an OutputStream while we need an IndexOutput. 
And we can probably simplify since we have a specific use-case compared to this 
lib wide usage.

> Directory based approach for index encryption
> -
>
> Key: LUCENE-9379
> URL: https://issues.apache.org/jira/browse/LUCENE-9379
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>
> The goal is to provide optional encryption of the index, with a scope limited 
> to an encryptable Lucene Directory wrapper.
> Encryption is at rest on 

[jira] [Commented] (LUCENE-9379) Directory based approach for index encryption

2020-06-16 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136783#comment-17136783
 ] 

Bruno Roustant commented on LUCENE-9379:


So I plan to implement an EncryptingDirectory extending FilterDirectory.

 

+Encryption method:+

AES CTR (counter)
 * This mode is approved by NIST. 
([https://en.wikipedia.org/wiki/Block_cipher_mode_of_operation#Counter_.28CTR.29])
 * AES encryption has the same size as the original clear text (though the last 
block is padded to 128 bits). So we can use the same file pointers.
 * CTR mode allows random access to encrypted blocks (128 bits blocks).
 * IV (initialisation vector) must be random, and is stored at the beginning of 
the encrypted file because it can be public.
 * It is appropriate to encrypt streams.

 

+API:+ 

I don’t anticipate any API change.

 

+How to provide encryption keys:+

EncryptingDirectory would require a delegate Directory, an encryption key 
supplier, and a Cipher pool (for performance).

For the callers to pass the encryption keys, I see two ways:

1- In Solr, declare a DirectoryFactory in solrconfig.xml that creates 
EncryptingDirectory. This factory is able to determine the encryption key per 
file based on the path. It is the responsibility of this factory to access the 
keys (e.g. stored in safe DB, received with an admin handler, read from 
properties, etc). The Cipher pool is hold by the DirectoryFactory.

2- More generally the EncryptingDirectory can be created to wrap a Directory 
when opening a segment (e.g. in PostingsFormat/DocValuesFormat 
fieldsConsumer()/fieldsProducer(), in StoredFieldFormat 
fieldsReader()/fieldsWriter(), etc). In this case the 
PostingsFormat/DocValuesFormat/StoredFieldFormat extension determines the 
encryption key based on the SegmentInfo. A custom Codec can be created to 
handle encrypting formats. The Cipher pool is hold either in the Codec or in 
the Format.

 

+Code:+

I will inspire from Apache commons-crypto CtrCryptoOutputStream, although not 
directly using it because it is an OutputStream while we need an IndexOutput. 
And we can probably simplify since we have a specific use-case compared to this 
lib wide usage.

> Directory based approach for index encryption
> -
>
> Key: LUCENE-9379
> URL: https://issues.apache.org/jira/browse/LUCENE-9379
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>
> The goal is to provide optional encryption of the index, with a scope limited 
> to an encryptable Lucene Directory wrapper.
> Encryption is at rest on disk, not in memory.
> This simple approach should fit any Codec as it would be orthogonal, without 
> modifying APIs as much as possible.
> Use a standard encryption method. Limit perf/memory impact as much as 
> possible.
> Determine how callers provide encryption keys. They must not be stored on 
> disk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9397) UniformSplit supports encodable fields metadata

2020-06-11 Thread Bruno Roustant (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant resolved LUCENE-9397.

Fix Version/s: 8.6
   Resolution: Fixed

Thanks [~dsmiley] for the review.

> UniformSplit supports encodable fields metadata
> ---
>
> Key: LUCENE-9397
> URL: https://issues.apache.org/jira/browse/LUCENE-9397
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
> Fix For: 8.6
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> UniformSplit already supports custom encoding for term blocks. This is an 
> extension to also support encodable fields metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9397) UniformSplit supports encodable fields metadata

2020-06-11 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133028#comment-17133028
 ] 

Bruno Roustant commented on LUCENE-9397:


Currently we use the encoder interface to cypher term blocks, FST and fields 
metadata. We don't attach more data.
However I'm going to work on LUCENE-9379 for a directory-based approach of 
encryption that would not be tied to a postings format. Eventually we would 
like to move to that solution.

> UniformSplit supports encodable fields metadata
> ---
>
> Key: LUCENE-9397
> URL: https://issues.apache.org/jira/browse/LUCENE-9397
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> UniformSplit already supports custom encoding for term blocks. This is an 
> extension to also support encodable fields metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



  1   2   3   >