[jira] [Comment Edited] (LUCENE-9902) Update faceting API to use modern Java features

2021-04-02 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314176#comment-17314176
 ] 

Greg Miller edited comment on LUCENE-9902 at 4/3/21, 3:10 AM:
--

+1. It looks like the increment methods in IntTaxonomyFacets are already 
protected, so making getValue protected as well seems pretty reasonable (and 
more consistent).

Since this change would be fully backwards-compatible, should we consider 
making it to 8.9?


was (Author: gsmiller):
+1. It looks like the increment methods in IntTaxonomyFacets are already 
protected, so making getValue protected as well seems pretty reasonable (and 
more consistent).

> Update faceting API to use modern Java features
> ---
>
> Key: LUCENE-9902
> URL: https://issues.apache.org/jira/browse/LUCENE-9902
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Gautam Worah
>Priority: Minor
>
> I was using the {{public int getOrdinal(String dim, String[] path)}} API for 
> a single {{path}} String and found myself creating an array with a single 
> element. We can start using variable length args for this method.
> I also propose this change:
>  I wanted to know the specific count of an ordinal using using the 
> {{getValue}} API from {{IntTaxonomyFacets}} but the method is private. It 
> would be good if we could change it to {{protected}} so that users can know 
> the value of an ordinal without looking up the {{FacetLabel}} and then 
> checking its value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9902) Update faceting API to use modern Java features

2021-04-02 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314176#comment-17314176
 ] 

Greg Miller commented on LUCENE-9902:
-

+1. It looks like the increment methods in IntTaxonomyFacets are already 
protected, so making getValue protected as well seems pretty reasonable (and 
more consistent).

> Update faceting API to use modern Java features
> ---
>
> Key: LUCENE-9902
> URL: https://issues.apache.org/jira/browse/LUCENE-9902
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Gautam Worah
>Priority: Minor
>
> I was using the {{public int getOrdinal(String dim, String[] path)}} API for 
> a single {{path}} String and found myself creating an array with a single 
> element. We can start using variable length args for this method.
> I also propose this change:
>  I wanted to know the specific count of an ordinal using using the 
> {{getValue}} API from {{IntTaxonomyFacets}} but the method is private. It 
> would be good if we could change it to {{protected}} so that users can know 
> the value of an ordinal without looking up the {{FacetLabel}} and then 
> checking its value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9902) Update faceting API to use modern Java features

2021-04-02 Thread Gautam Worah (Jira)
Gautam Worah created LUCENE-9902:


 Summary: Update faceting API to use modern Java features
 Key: LUCENE-9902
 URL: https://issues.apache.org/jira/browse/LUCENE-9902
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/facet
Reporter: Gautam Worah


I was using the {{public int getOrdinal(String dim, String[] path)}} API for a 
single {{path}} String and found myself creating an array with a single 
element. We can start using variable length args for this method.

I also propose this change:
 I wanted to know the specific count of an ordinal using using the {{getValue}} 
API from {{IntTaxonomyFacets}} but the method is private. It would be good if 
we could change it to {{protected}} so that users can know the value of an 
ordinal without looking up the {{FacetLabel}} and then checking its value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9648) Add a numbers query in sandbox that takes advantage of index sorting.

2021-04-02 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314123#comment-17314123
 ] 

Julie Tibshirani commented on LUCENE-9648:
--

Thanks [~gf2121] for the interesting ticket and sorry for the slow reply!

In general it seems hard to effectively bound the range of query values. The 
benchmark you posted randomly generates a set of query values from the range 
[0, 1_000_000_000]. As an experiment I drew 1000 random values and looked at 
their range: in a typical example we get min: 128_082, max: 999_833_281. So the 
range covers 0.999% of the possible values. Even with only 2 random query 
values I think the range would cover 1/3 of the possible values in expectation.

So it seems the speed-up in the benchmark must be from something else -- maybe 
we just check against the query values more efficiently? It'd be interesting to 
dig into this to understand what makes it faster.

I'm also curious if you have a specific use case in mind. My assumption was 
that the query values don't tend to cluster together, but maybe they do in your 
case?

> Add a numbers query in sandbox that takes advantage of index sorting.
> -
>
> Key: LUCENE-9648
> URL: https://issues.apache.org/jira/browse/LUCENE-9648
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Feng Guo
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> LUCENE-6539 introduced DocValuesNumbersQuery to benefit some special cases 
> (many terms/numbers and fewish matching hits). Inspired by LUCENE-7714, maybe 
> we can speed it up when meeting index sort since we can do a binary search to 
> bound the docvalues before iter it.
> Here are some simple benchmarks:
> *Based on 50,000,000 doc, each long in query can hit at least one doc:*
> |LongsCount|PointInSetQuery(s/op)|IndexSortDocValuesNumbersQuery(s/op)|DocValuesNumbersQuery(s/op)|
> |1000|0.144 ± 0.001|0.428 ± 0.008|1.159 ± 0.031|
> |5000|0.404 ± 0.003|0.424 ± 0.008|1.343 ± 0.022|
> |1|0.661 ± 0.012|0.436 ± 0.008|1.372 ± 0.011|
> |5|1.319 ± 0.011|0.450 ± 0.005|1.173 ± 0.017|
> |10|1.328 ± 0.007|0.459 ± 0.003|1.160 ± 0.098|
> *Based on 100,000,000 doc, each long in query can hit at least one doc:*
> |LongsCount|PointInSetQuery(s/op)|IndexSortDocValuesNumbersQuery(s/op)|DocValuesNumbersQuery(s/op)|
> |1000|0.189 ± 0.003|0.860 ± 0.028|2.408 ± 0.087|
> |5000|0.715 ± 0.010|0.879 ± 0.010|2.642 ± 0.105|
> |1|1.075 ± 0.034|0.859 ± 0.010|2.629 ± 0.077|
> |5|2.652 ± 0.114|0.865 ± 0.009|2.737 ± 0.099|
> |10|2.905 ± 0.027|0.881 ± 0.005|2.820 ± 0.474|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

2021-04-02 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314116#comment-17314116
 ] 

Greg Miller commented on LUCENE-9850:
-

I think I've actually made good progress on optimizing this. The last bit to 
crack was reading the exception bytes in all at once instead of byte-by-byte. 
The benchmark results look pretty promising to me. I'd be curious what others 
think of these results. If others agree they look promising, I'll tidy up the 
[code|https://github.com/apache/lucene/compare/main...gsmiller:LUCENE-9850/pfordocids2],
 write some tests and put a PR out there. And just as a reminder, this change 
reduces the doc ID data in the Wikipedia benchmark index by *~12%*, resulting 
in an *~3.3% overall index size reduction*.

Benchmark results are as follows. It does look like there's a significant 
regression in some task, but much smaller than before. But now there are also 
some significant QPS improvements for some other tasks. Seems like decent 
results overall, but others will have more opinions on this I'm sure.
{code:java}
TaskQPS baseline  StdDevQPS pfordocids  StdDev  
  Pct diff p-value
   HighTermDayOfYearSort   76.66 (13.5%)   74.15 (12.0%)   
-3.3% ( -25% -   25%) 0.418
   HighTermMonthSort  105.44 (10.3%)  102.66  (8.5%)   
-2.6% ( -19% -   18%) 0.379
 LowSpanNear   38.95  (2.3%)   38.05  (1.7%)   
-2.3% (  -6% -1%) 0.000
  AndHighMed   85.41  (2.6%)   83.58  (2.5%)   
-2.1% (  -7% -3%) 0.008
  OrHighHigh   16.29  (1.9%)   16.00  (6.1%)   
-1.8% (  -9% -6%) 0.209
  TermDTSort  186.08 (11.7%)  182.87 (10.0%)   
-1.7% ( -20% -   22%) 0.616
 MedSpanNear   12.45  (2.2%)   12.24  (1.6%)   
-1.7% (  -5% -2%) 0.005
HighIntervalsOrdered6.11  (2.8%)6.00  (2.8%)   
-1.7% (  -7% -4%) 0.061
HighTermTitleBDVSort   61.20 (10.8%)   60.19 (10.2%)   
-1.7% ( -20% -   21%) 0.617
 MedSloppyPhrase   29.64  (4.6%)   29.16  (4.5%)   
-1.6% ( -10% -7%) 0.262
 AndHighHigh   58.98  (1.9%)   58.05  (2.2%)   
-1.6% (  -5% -2%) 0.015
   OrHighMed   62.28  (2.6%)   61.33  (6.0%)   
-1.5% (  -9% -7%) 0.293
 LowSloppyPhrase   11.15  (3.2%)   10.98  (3.2%)   
-1.5% (  -7% -5%) 0.134
HighSpanNear7.33  (2.7%)7.25  (2.3%)   
-1.2% (  -6% -3%) 0.143
 Prefix3  239.67 (15.4%)  237.28 (18.6%)   
-1.0% ( -30% -   38%) 0.853
   LowPhrase  131.88  (2.3%)  130.75  (2.6%)   
-0.9% (  -5% -4%) 0.264
   OrHighLow  264.48  (5.4%)  262.67  (5.7%)   
-0.7% ( -11% -   10%) 0.695
BrowseDayOfYearSSDVFacets   11.57  (4.5%)   11.49  (5.0%)   
-0.7% (  -9% -9%) 0.661
   BrowseMonthSSDVFacets   12.38  (6.6%)   12.31  (7.9%)   
-0.6% ( -14% -   14%) 0.799
  Fuzzy2   52.44 (12.0%)   52.24 (12.3%)   
-0.4% ( -22% -   27%) 0.920
PKLookup  146.58  (3.8%)  146.05  (4.1%)   
-0.4% (  -7% -7%) 0.774
HighSloppyPhrase1.78  (5.1%)1.77  (4.2%)   
-0.3% (  -9% -9%) 0.839
BrowseDayOfYearTaxoFacets4.15  (3.4%)4.15  (2.9%)   
-0.0% (  -6% -6%) 0.962
BrowseDateTaxoFacets4.15  (3.4%)4.14  (2.8%)   
-0.0% (  -6% -6%) 0.964
   BrowseMonthTaxoFacets5.02  (2.7%)5.02  (2.4%)
0.1% (  -4% -5%) 0.920
   MedPhrase   87.31  (3.1%)   87.43  (3.1%)
0.1% (  -5% -6%) 0.893
 Respell   44.49  (2.2%)   44.74  (2.8%)
0.6% (  -4% -5%) 0.484
Wildcard   54.13  (3.3%)   54.49  (3.8%)
0.7% (  -6% -7%) 0.557
OrHighNotMed  521.75  (5.5%)  526.04  (6.2%)
0.8% ( -10% -   13%) 0.659
  Fuzzy1   54.77  (9.9%)   55.25  (7.8%)
0.9% ( -15% -   20%) 0.753
  IntNRQ   94.71  (0.5%)   95.68  (1.3%)
1.0% (   0% -2%) 0.001
OrNotHighLow  618.18  (5.1%)  624.93  (6.1%)
1.1% (  -9% -   12%) 0.537
OrNotHighMed  473.89  (3.2%)  479.65  (5.4%)
1.2% (  -7% -   10%) 0.388
 MedTerm 1115.74  (4.5%) 1130.77  (6.8%)
1.3% (  -9% -   13%) 0.463
   OrHighNotHigh  428.11  (4.0%)  436.25  (6.9%)
1.9% (  -8% -   13%) 0.286
  HighPhrase   86.34  (2.6%)   87.99  (3.8%)
1.9% (  -4% -

[jira] [Updated] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

2021-04-02 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-9850:

Attachment: bulk_read_2.png

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> ---
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
> Attachments: apply_exceptions.png, bulk_read_1.png, bulk_read_2.png, 
> for.png, pfor.png
>
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

2021-04-02 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-9850:

Attachment: bulk_read_1.png

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> ---
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
> Attachments: apply_exceptions.png, bulk_read_1.png, for.png, pfor.png
>
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

2021-04-02 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314007#comment-17314007
 ] 

Greg Miller edited comment on LUCENE-9850 at 4/2/21, 6:55 PM:
--

I'm getting in the weeds a bit on this probably, but since it appears JDK 
Mission Control now provides flame charts, we can now see where the PFOR 
approach is spending extra time (on top of FOR). 

Here's FOR as a baseline:

!for.png|width=850,height=134!

And here's my latest version with PFOR:

!pfor.png|width=855,height=133!

It looks like applying exceptions (PForUtil.applyExceptionsIn32Space) is 
accounting for a good chunk of where decoding is spending its time. This is 
pure extra overhead. 

And digging into that, it looks like ~2/3 of that time is being spent actually 
reading the bytes (position and high-order bits of the exceptions).

!apply_exceptions.png|width=852,height=94!

So I don't know. It feels like the algorithm piece of this is close to as 
optimized as it's going to get right now. Not sure how much can be done about 
reading those bytes. No way around the need to read them. I know there's also a 
bulk byte reading method though on DataInput, so I wonder if reading all of the 
bytes at once would help. I'm going to experiment there a little bit.


was (Author: gsmiller):
I'm getting in the weeds a bit on this probably, but since it appears JDK 
Mission Control now provides flame charts, we can now see where the PFOR 
approach is spending extra time (on top of FOR). 

Here's FOR as a baseline:

!for.png|width=850,height=134!

And here's my latest version with PFOR:

!pfor.png|width=855,height=133!

It looks like applying exceptions (PForUtil.applyExceptionsIn32Space) is 
accounting for a good chunk of where decoding is spending its time. This is 
pure extra overhead. 

And digging into that, it looks like ~2/3 of that time is being spent actually 
reading the bytes (position and high-order bits of the exceptions).

!apply_exceptions.png|width=852,height=94!

So I don't know. It feels like the algorithm piece of this is close to as 
optimized as it's going to get right now. Not sure how much can be done about 
reading those bytes. No way around the need to read them.

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> ---
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
> Attachments: apply_exceptions.png, for.png, pfor.png
>
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9823) SynonymQuery rewrite can change field boost calculation

2021-04-02 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-9823:
-
Labels: newdev  (was: )

> SynonymQuery rewrite can change field boost calculation
> ---
>
> Key: LUCENE-9823
> URL: https://issues.apache.org/jira/browse/LUCENE-9823
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
>  Labels: newdev
>
> SynonymQuery accepts a boost per term, which acts as a multiplier on the term 
> frequency in the document. When rewriting a SynonymQuery with a single term, 
> we create a BoostQuery wrapping a TermQuery. This changes the meaning of the 
> boost: it now multiplies the final TermQuery score instead of multiplying the 
> term frequency before it's passed to the score calculation.
> This is a small point, but maybe it's worth avoiding rewriting a single-term 
> SynonymQuery unless the boost is 1.0.
> The same consideration affects CombinedFieldQuery in sandbox.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9657) Unified Highlighter throws too_complex_to_determinize_exception with >288 filter terms

2021-04-02 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314008#comment-17314008
 ] 

Julie Tibshirani edited comment on LUCENE-9657 at 4/2/21, 6:08 PM:
---

Closing this out for now, feel free to follow up in Elasticsearch!


was (Author: julietibs):
Closing this out for now, feel free to follow-up in Elasticsearch!

> Unified Highlighter throws too_complex_to_determinize_exception with >288 
> filter terms
> --
>
> Key: LUCENE-9657
> URL: https://issues.apache.org/jira/browse/LUCENE-9657
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.6.2
>Reporter: Isaac Doub
>Priority: Major
>
> There seems to be a problem with the Unified Highlighter in Lucene 8.6.2 that 
> is affecting ElasticSearch 7.9.1. If a search is performed with >288 filter 
> terms using the unified highlighter it throws a 
> too_complex_to_determinize_exception error, but if you switch to the plain 
> highlighter it works fine. Alternatively, if you filter on a "copy_to" field 
> instead of the indexed field, it also works.
>  
> This throws the error
> {code:java}
>  {
>     "highlight": {
>         "type": "unified",
>         "fields": {
>             "title": {
>                 "require_field_match": false
>             }
>         }
>     },
>     "query": {
>         "bool": {
>             "must": [{
>                 "query_string": {
>                     "query": "*"
>                 }
>             }],
>             "filter": [{
>                 "bool": {
>                     "must": [{
>                         "terms": {
>                             "id": [ ">288 terms here" ]
>                         }
>                     }]
>                 }
>             }]
>         }
>     }
> }{code}
>  
>  
> But this works fine
> {code:java}
>  {
>     "highlight": {
>         "type": "plain",
>         "fields": {
>             "title": {
>                 "require_field_match": false
>             }
>         }
>     },
>     "query": {
>         "bool": {
>             "must": [{
>                 "query_string": {
>                     "query": "*"
>                 }
>             }],
>             "filter": [{
>                 "bool": {
>                     "must": [{
>                         "terms": {
>                             "id": [ ">288 terms here" ]
>                         }
>                     }]
>                 }
>             }]
>         }
>     }
> }{code}
>  
>  
> Or if I adjust the search to use the copy_to field it works as well (note 
> "id" is now "_id")
> {code:java}
>  {
>     "highlight": {
>         "type": "unified",
>         "fields": {
>             "title": {
>                 "require_field_match": false
>             }
>         }
>     },
>     "query": {
>         "bool": {
>             "must": [{
>                 "query_string": {
>                     "query": "*"
>                 }
>             }],
>             "filter": [{
>                 "bool": {
>                     "must": [{
>                         "terms": {
>                             "_id": [ ">288 terms here" ]
>                         }
>                     }]
>                 }
>             }]
>         }
>     }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9657) Unified Highlighter throws too_complex_to_determinize_exception with >288 filter terms

2021-04-02 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani resolved LUCENE-9657.
--
Resolution: Resolved

> Unified Highlighter throws too_complex_to_determinize_exception with >288 
> filter terms
> --
>
> Key: LUCENE-9657
> URL: https://issues.apache.org/jira/browse/LUCENE-9657
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.6.2
>Reporter: Isaac Doub
>Priority: Major
>
> There seems to be a problem with the Unified Highlighter in Lucene 8.6.2 that 
> is affecting ElasticSearch 7.9.1. If a search is performed with >288 filter 
> terms using the unified highlighter it throws a 
> too_complex_to_determinize_exception error, but if you switch to the plain 
> highlighter it works fine. Alternatively, if you filter on a "copy_to" field 
> instead of the indexed field, it also works.
>  
> This throws the error
> {code:java}
>  {
>     "highlight": {
>         "type": "unified",
>         "fields": {
>             "title": {
>                 "require_field_match": false
>             }
>         }
>     },
>     "query": {
>         "bool": {
>             "must": [{
>                 "query_string": {
>                     "query": "*"
>                 }
>             }],
>             "filter": [{
>                 "bool": {
>                     "must": [{
>                         "terms": {
>                             "id": [ ">288 terms here" ]
>                         }
>                     }]
>                 }
>             }]
>         }
>     }
> }{code}
>  
>  
> But this works fine
> {code:java}
>  {
>     "highlight": {
>         "type": "plain",
>         "fields": {
>             "title": {
>                 "require_field_match": false
>             }
>         }
>     },
>     "query": {
>         "bool": {
>             "must": [{
>                 "query_string": {
>                     "query": "*"
>                 }
>             }],
>             "filter": [{
>                 "bool": {
>                     "must": [{
>                         "terms": {
>                             "id": [ ">288 terms here" ]
>                         }
>                     }]
>                 }
>             }]
>         }
>     }
> }{code}
>  
>  
> Or if I adjust the search to use the copy_to field it works as well (note 
> "id" is now "_id")
> {code:java}
>  {
>     "highlight": {
>         "type": "unified",
>         "fields": {
>             "title": {
>                 "require_field_match": false
>             }
>         }
>     },
>     "query": {
>         "bool": {
>             "must": [{
>                 "query_string": {
>                     "query": "*"
>                 }
>             }],
>             "filter": [{
>                 "bool": {
>                     "must": [{
>                         "terms": {
>                             "_id": [ ">288 terms here" ]
>                         }
>                     }]
>                 }
>             }]
>         }
>     }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9657) Unified Highlighter throws too_complex_to_determinize_exception with >288 filter terms

2021-04-02 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314008#comment-17314008
 ] 

Julie Tibshirani commented on LUCENE-9657:
--

Closing this out for now, feel free to follow-up in Elasticsearch!

> Unified Highlighter throws too_complex_to_determinize_exception with >288 
> filter terms
> --
>
> Key: LUCENE-9657
> URL: https://issues.apache.org/jira/browse/LUCENE-9657
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.6.2
>Reporter: Isaac Doub
>Priority: Major
>
> There seems to be a problem with the Unified Highlighter in Lucene 8.6.2 that 
> is affecting ElasticSearch 7.9.1. If a search is performed with >288 filter 
> terms using the unified highlighter it throws a 
> too_complex_to_determinize_exception error, but if you switch to the plain 
> highlighter it works fine. Alternatively, if you filter on a "copy_to" field 
> instead of the indexed field, it also works.
>  
> This throws the error
> {code:java}
>  {
>     "highlight": {
>         "type": "unified",
>         "fields": {
>             "title": {
>                 "require_field_match": false
>             }
>         }
>     },
>     "query": {
>         "bool": {
>             "must": [{
>                 "query_string": {
>                     "query": "*"
>                 }
>             }],
>             "filter": [{
>                 "bool": {
>                     "must": [{
>                         "terms": {
>                             "id": [ ">288 terms here" ]
>                         }
>                     }]
>                 }
>             }]
>         }
>     }
> }{code}
>  
>  
> But this works fine
> {code:java}
>  {
>     "highlight": {
>         "type": "plain",
>         "fields": {
>             "title": {
>                 "require_field_match": false
>             }
>         }
>     },
>     "query": {
>         "bool": {
>             "must": [{
>                 "query_string": {
>                     "query": "*"
>                 }
>             }],
>             "filter": [{
>                 "bool": {
>                     "must": [{
>                         "terms": {
>                             "id": [ ">288 terms here" ]
>                         }
>                     }]
>                 }
>             }]
>         }
>     }
> }{code}
>  
>  
> Or if I adjust the search to use the copy_to field it works as well (note 
> "id" is now "_id")
> {code:java}
>  {
>     "highlight": {
>         "type": "unified",
>         "fields": {
>             "title": {
>                 "require_field_match": false
>             }
>         }
>     },
>     "query": {
>         "bool": {
>             "must": [{
>                 "query_string": {
>                     "query": "*"
>                 }
>             }],
>             "filter": [{
>                 "bool": {
>                     "must": [{
>                         "terms": {
>                             "_id": [ ">288 terms here" ]
>                         }
>                     }]
>                 }
>             }]
>         }
>     }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

2021-04-02 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314007#comment-17314007
 ] 

Greg Miller commented on LUCENE-9850:
-

I'm getting in the weeds a bit on this probably, but since it appears JDK 
Mission Control now provides flame charts, we can now see where the PFOR 
approach is spending extra time (on top of FOR). 

Here's FOR as a baseline:

!for.png|width=850,height=134!

And here's my latest version with PFOR:

!pfor.png|width=855,height=133!

It looks like applying exceptions (PForUtil.applyExceptionsIn32Space) is 
accounting for a good chunk of where decoding is spending its time. This is 
pure extra overhead. 

And digging into that, it looks like ~2/3 of that time is being spent actually 
reading the bytes (position and high-order bits of the exceptions).

!apply_exceptions.png|width=852,height=94!

So I don't know. It feels like the algorithm piece of this is close to as 
optimized as it's going to get right now. Not sure how much can be done about 
reading those bytes. No way around the need to read them.

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> ---
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
> Attachments: apply_exceptions.png, for.png, pfor.png
>
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

2021-04-02 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-9850:

Attachment: apply_exceptions.png

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> ---
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
> Attachments: apply_exceptions.png, for.png, pfor.png
>
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #61: LUCENE-9900: Regenerate/ run ICU only if inputs changed

2021-04-02 Thread GitBox


dweiss commented on a change in pull request #61:
URL: https://github.com/apache/lucene/pull/61#discussion_r606353297



##
File path: gradle/generation/icu.gradle
##
@@ -49,6 +49,13 @@ configure(project(":lucene:analysis:icu")) {
 // May be undefined yet, so use a provider.
 dependsOn { sourceSets.tools.runtimeClasspath }
 
+// gennorm generates file order-dependent output, so make it constant here.

Review comment:
   Thanks Robert. Enjoy your holidays!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9857) Skip cache building if IndexOrDocValuesQuery choose the dvQuery

2021-04-02 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314006#comment-17314006
 ] 

Julie Tibshirani commented on LUCENE-9857:
--

Hello [~gf2121]! I think that when we cache an IndexOrDocValueQuery, we always 
make sure to use the 'index' version. This is because we pass a lead cost of 
{{Long.MAX_VALUE}} when building the scorer: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L769.

Are you seeing a surprising behavior in tests/ benchmarks that led you to file 
the issue? It could be that we're still missing some detail.

> Skip cache building if IndexOrDocValuesQuery choose the dvQuery
> ---
>
> Key: LUCENE-9857
> URL: https://issues.apache.org/jira/browse/LUCENE-9857
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: main (9.0)
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> IndexOrDocValuesQuery can automatically use dvQueries when the cost > 8 *l 
> eadcost, And the LRUQueryCache skips cache building when cost > 250(By 
> default) * leadcost. There is a gap between 8 and 250, which means if the 
> factor is just between 8 and 250 (e.g. cost = 10 * leadcost), the 
> IndexOrDocValueQuery will choose the dvQueries but LRUQueryCache still build 
> cache for it.
> IndexOrDocValuesQuery aims to speed up queries when the leadcost is small, 
> but building cache by dvScorers can make it meaningless because it needs to 
> scan all the docvalues. This can be rather slow for big segments, so maybe we 
> should skip the cache building for IndexOrDocValuesQuery when it chooses 
> dvQueries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

2021-04-02 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-9850:

Attachment: pfor.png

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> ---
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
> Attachments: for.png, pfor.png
>
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

2021-04-02 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-9850:

Attachment: for.png

> Explore PFOR for Doc ID delta encoding (instead of FOR)
> ---
>
> Key: LUCENE-9850
> URL: https://issues.apache.org/jira/browse/LUCENE-9850
> Project: Lucene - Core
>  Issue Type: Task
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
> Attachments: for.png
>
>
> It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. 
> Right now PFOR is used for positions, frequencies and payloads, but FOR is 
> used for doc ID deltas. From a recent 
> [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E]
>  on the dev mailing list, it sounds like this decision was made based on the 
> optimization possible when expanding the deltas.
> I'd be interesting in measuring the index size reduction possible with 
> switching to PFOR compared to the performance reduction we might see by no 
> longer being able to apply the deltas in as optimal a way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat

2021-04-02 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313993#comment-17313993
 ] 

Tomoko Uchida commented on LUCENE-9855:
---

Thank you for clarifying it!
Personally I always would prefer "one vector format and multiple ann 
strategies" concept and love to see we have the unified vector codec. 
In the meantime, to me it could be a bit heavy load for one codec writer and/or 
reader to have various implementation for multiple search strategies. I think 
it would be great if we can factor out the search algorithm specific metadata 
building/en(de)coding logic into separated classes (the good old strategy 
design pattern or something like it could work; perhaps you have already 
considered it though?) But it would be off-topic here and should be handed at 
separated issues.

> Reconsider codec name VectorFormat
> --
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Tomoko Uchida
>Priority: Blocker
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat

2021-04-02 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313963#comment-17313963
 ] 

Michael Sokolov commented on LUCENE-9855:
-

> If we'd like to have another ANN algorithm, say LSH, we would switch the 
>metadata building and encoding logic in the codec writer/reader 
>implementation, is my understanding correct?

Right, we'd have conditional logic in  `Lucene90VectorReader' and 
Lucene90VectorWriter` that would do different things depending on the 
`SearchStrategy`. The reader would have to be able to read a header from the 
metadata including the strategy, and then read the remainder of the metadata 
and other data differently, depending on the strategy.

Or we could make a new format for LSH and for any other ANN strategy that has a 
different index file format.

> Reconsider codec name VectorFormat
> --
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Tomoko Uchida
>Priority: Blocker
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat

2021-04-02 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313962#comment-17313962
 ] 

Tomoko Uchida commented on LUCENE-9855:
---

Answering my question... maybe I missed this if statement. 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorWriter.java#L124

> Reconsider codec name VectorFormat
> --
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Tomoko Uchida
>Priority: Blocker
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat

2021-04-02 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313955#comment-17313955
 ] 

Tomoko Uchida commented on LUCENE-9855:
---

{quote}bq. I think it will be helpful to consider how we would handle a 
different ANN implementation. Say LSH. In that case, we would no longer store 
this graph file (what is currently in .vex files). We would need to add files 
to store LSH's hash tables, and the metadata would change. Is it the same 
format? A variant of this format? The current conception is that we would make 
variations of this single vector format to handle multiple ANN algorithms.
{quote}

I think I am still not fully understand your thoughts about handling multiple 
ANN algorithms by single format in details so would you please help me to 
figure out the strategy to switch algorithms. If we'd like to have another ANN 
algorithm, say LSH, we would switch the metadata building and encoding logic in 
the codec writer/reader implementation, is my understanding correct?

> Reconsider codec name VectorFormat
> --
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Tomoko Uchida
>Priority: Blocker
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat

2021-04-02 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313940#comment-17313940
 ] 

Michael Sokolov commented on LUCENE-9855:
-

I think it will be helpful to consider how we would handle a different ANN 
implementation. Say LSH. In that case, we would no longer store this graph file 
(what is currently in .vex files). We would need to add files to store LSH's 
hash tables, and the metadata would change. Is it the same format? A variant of 
this format? The current conception is that we would make variations of this 
single vector format to handle multiple ANN algorithms. We currently only have 
one, so it doesn't look that way, but anyway with that background a generic 
name like VectorsFormat (or NumericVectorsFormat to distinguish from DocVectors 
etc) makes sense.

On the other hand if you think we would create a new Format to represent this 
different kind of data that we are storing to disk, which will have its own 
de/serialization code (even if some of it would be the same), then we should 
pick a name that incorporates the algorithm, and by the way also get rid of the 
whole concept of {{SearchStrategy}}.

I think this is the fundamental question here: one format, multiple ANN 
strategies, or one format per ANN strategy? I thought it had been sorted out in 
our earlier discussions, but not everybody may have been following that closely.

> Reconsider codec name VectorFormat
> --
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Tomoko Uchida
>Priority: Blocker
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat

2021-04-02 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313937#comment-17313937
 ] 

Tomoko Uchida commented on LUCENE-9855:
---

I was almost throwing away this in facing the difficulty of naming; but 
[~julietibs] would be right, we should pursue to give proper name to our 
"vector"s (yes, I think it is certainly nothing other than "vector", although I 
understand the term is already overloaded many times). NumericVectorsFormat or 
VectorValuesFormat, otherwise DenseVectorsFormat, either one looks fine to me 
if "VectorsFormat" is too vague for us.

On the other hand, when we closely look at the current the codec Writer/Reader 
implementation, it seems to be tightly coupled with HNSW graph building and 
encoding and searching. From my viewpoint we should manage to decouple HNSW 
specific code from the Codec to resolve inconsistency between its name and 
implementation in the near future...

> Reconsider codec name VectorFormat
> --
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Tomoko Uchida
>Priority: Blocker
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a change in pull request #61: LUCENE-9900: Regenerate/ run ICU only if inputs changed

2021-04-02 Thread GitBox


rmuir commented on a change in pull request #61:
URL: https://github.com/apache/lucene/pull/61#discussion_r606244542



##
File path: gradle/generation/icu.gradle
##
@@ -49,6 +49,13 @@ configure(project(":lucene:analysis:icu")) {
 // May be undefined yet, so use a provider.
 dependsOn { sourceSets.tools.runtimeClasspath }
 
+// gennorm generates file order-dependent output, so make it constant here.

Review comment:
   @dweiss it is my understanding order doesnt matter. so `a > b` followed 
by `b > c` is the same as `b > c` followed by `a > b`. the normalization will 
happen recursively so that 'a' maps to 'c' in either case. this recursive 
process happens at gennorm2 'compile time'. but in version 2 of the file 
format, they added a mechanism to recover the original non-recursive mappings, 
so this may cause a data file change even though semanticw are unchanged? just 
my guess... more details: 
http://site.icu-project.org/design/normalization/custom




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

2021-04-02 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313877#comment-17313877
 ] 

Greg Miller edited comment on LUCENE-9850 at 4/2/21, 1:41 PM:
--

I haven't really been following along with what's going on in JDK17, but being 
able to more explicitly generate vectorized instructions will be nice!

I optimized my branch a bit further. At this point I think it has all the 
optimizations of ForDeltaUtil, and the only extra work is applying the 
exceptions. I even pulled in the optimization to apply the prefix to two values 
at a time (packed in a long). That code is over 
[here|https://github.com/apache/lucene/compare/main...gsmiller:pfordocid-opto3].
 (It's not polished at all... a bit hacky)

If you look at the code, one thing that may not be obvious is how 
encodePrefixSum differs from encode. In ForDeltaUtil, the 0 bpv case implies 
that all values were "1", while PForUtil encodes the actual "same" value, 
allowing it to take on values other than "1". While it's pretty trivial, this 
meant checking that value in decoding and special-casing 1 vs. non-1 values. I 
took on the ForDeltaUtil logic here and assume it's always 1. It's minor and I 
suspect we could remove this optimization, but I'm trying to get the decode 
logic as close to ForDeltaUtil as possible.

The benchmark results are looking at lot better now, but maybe still some 
regressions. I've seen a little variability in these results, so I'm not sure 
how often they might present false-regression results on individual tasks? 
Here's what I've got at this point:
{code:java}
TaskQPS baseline  StdDevQPS pfordocids  StdDev  
  Pct diff p-value
 LowSpanNear   98.61  (2.2%)   95.57  (1.7%)   
-3.1% (  -6% -0%) 0.000
   OrNotHighHigh  545.01  (3.8%)  531.11  (5.3%)   
-2.6% ( -11% -6%) 0.078
Wildcard   40.83  (4.1%)   40.05  (3.9%)   
-1.9% (  -9% -6%) 0.132
   OrHighMed  102.39  (2.5%)  100.50  (2.5%)   
-1.8% (  -6% -3%) 0.021
 AndHighHigh   50.93  (3.3%)   50.03  (3.1%)   
-1.8% (  -7% -4%) 0.079
  TermDTSort   98.42 (11.6%)   96.72 (14.4%)   
-1.7% ( -24% -   27%) 0.676
  AndHighMed   68.10  (2.9%)   66.94  (2.9%)   
-1.7% (  -7% -4%) 0.063
HighTerm 1169.43  (4.4%) 1151.70  (5.1%)   
-1.5% ( -10% -8%) 0.314
   BrowseMonthSSDVFacets   12.50  (5.6%)   12.31  (7.7%)   
-1.5% ( -14% -   12%) 0.480
HighTermTitleBDVSort  157.42 (14.7%)  155.08 (15.6%)   
-1.5% ( -27% -   33%) 0.757
OrHighNotLow  545.83  (5.7%)  537.85  (7.0%)   
-1.5% ( -13% -   12%) 0.472
 MedSpanNear   28.75  (2.4%)   28.34  (1.9%)   
-1.4% (  -5% -2%) 0.038
   OrHighNotHigh  533.41  (4.6%)  526.33  (5.3%)   
-1.3% ( -10% -8%) 0.394
  Fuzzy1   59.47  (6.0%)   58.72  (6.8%)   
-1.3% ( -13% -   12%) 0.533
HighSpanNear   21.27  (2.6%)   21.03  (2.2%)   
-1.1% (  -5% -3%) 0.153
   HighTermMonthSort  128.50 (12.2%)  127.11 (11.0%)   
-1.1% ( -21% -   25%) 0.769
OrNotHighLow  640.89  (4.0%)  634.43  (3.5%)   
-1.0% (  -8% -6%) 0.395
  OrHighHigh   21.11  (2.0%)   20.91  (1.8%)   
-1.0% (  -4% -2%) 0.113
   MedPhrase  103.90  (3.0%)  103.05  (2.9%)   
-0.8% (  -6% -5%) 0.381
  HighPhrase  172.59  (2.5%)  171.22  (2.5%)   
-0.8% (  -5% -4%) 0.320
OrHighNotMed  535.67  (4.8%)  531.54  (4.7%)   
-0.8% (  -9% -9%) 0.607
 LowTerm 1094.41  (2.9%) 1087.97  (3.1%)   
-0.6% (  -6% -5%) 0.535
 MedSloppyPhrase   12.91  (2.4%)   12.85  (2.5%)   
-0.5% (  -5% -4%) 0.542
  IntNRQ  101.21  (0.5%)  100.81  (0.7%)   
-0.4% (  -1% -0%) 0.040
PKLookup  144.62  (3.0%)  144.11  (3.1%)   
-0.4% (  -6% -5%) 0.715
HighSloppyPhrase3.75  (2.9%)3.74  (3.0%)   
-0.3% (  -6% -5%) 0.726
HighIntervalsOrdered   16.00  (2.1%)   15.95  (1.8%)   
-0.3% (  -4% -3%) 0.597
   HighTermDayOfYearSort  109.37 (11.0%)  109.03 (15.2%)   
-0.3% ( -23% -   29%) 0.941
 LowSloppyPhrase   41.05  (1.9%)   40.93  (2.1%)   
-0.3% (  -4% -3%) 0.635
 MedTerm 1137.13  (4.1%) 1134.84  (4.2%)   
-0.2% (  -8% -8%) 0.877
BrowseDayOfYearTaxoFacets4.24  (3.4%)4.23  

[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)

2021-04-02 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313877#comment-17313877
 ] 

Greg Miller commented on LUCENE-9850:
-

I haven't really been following along with what's going on in JDK17, but being 
able to more explicitly generate vectorized instructions will be nice!

I optimized my branch a bit further. At this point I think it has all the 
optimizations of ForDeltaUtil, and the only extra work is applying the 
exceptions. I even pulled in the optimization to apply the prefix to two values 
at a time (packed in a long). That code is over 
[here|https://github.com/apache/lucene/compare/main...gsmiller:pfordocid-opto3].
 (It's not polished at all... a bit hacky)

The benchmark results are looking at lot better now, but maybe still some 
regressions. I've seen a little variability in these results, so I'm not sure 
how often they might present false-regression results on individual tasks? 
Here's what I've got at this point:
{code:java}
TaskQPS baseline  StdDevQPS pfordocids  StdDev  
  Pct diff p-value
 LowSpanNear   98.61  (2.2%)   95.57  (1.7%)   
-3.1% (  -6% -0%) 0.000
   OrNotHighHigh  545.01  (3.8%)  531.11  (5.3%)   
-2.6% ( -11% -6%) 0.078
Wildcard   40.83  (4.1%)   40.05  (3.9%)   
-1.9% (  -9% -6%) 0.132
   OrHighMed  102.39  (2.5%)  100.50  (2.5%)   
-1.8% (  -6% -3%) 0.021
 AndHighHigh   50.93  (3.3%)   50.03  (3.1%)   
-1.8% (  -7% -4%) 0.079
  TermDTSort   98.42 (11.6%)   96.72 (14.4%)   
-1.7% ( -24% -   27%) 0.676
  AndHighMed   68.10  (2.9%)   66.94  (2.9%)   
-1.7% (  -7% -4%) 0.063
HighTerm 1169.43  (4.4%) 1151.70  (5.1%)   
-1.5% ( -10% -8%) 0.314
   BrowseMonthSSDVFacets   12.50  (5.6%)   12.31  (7.7%)   
-1.5% ( -14% -   12%) 0.480
HighTermTitleBDVSort  157.42 (14.7%)  155.08 (15.6%)   
-1.5% ( -27% -   33%) 0.757
OrHighNotLow  545.83  (5.7%)  537.85  (7.0%)   
-1.5% ( -13% -   12%) 0.472
 MedSpanNear   28.75  (2.4%)   28.34  (1.9%)   
-1.4% (  -5% -2%) 0.038
   OrHighNotHigh  533.41  (4.6%)  526.33  (5.3%)   
-1.3% ( -10% -8%) 0.394
  Fuzzy1   59.47  (6.0%)   58.72  (6.8%)   
-1.3% ( -13% -   12%) 0.533
HighSpanNear   21.27  (2.6%)   21.03  (2.2%)   
-1.1% (  -5% -3%) 0.153
   HighTermMonthSort  128.50 (12.2%)  127.11 (11.0%)   
-1.1% ( -21% -   25%) 0.769
OrNotHighLow  640.89  (4.0%)  634.43  (3.5%)   
-1.0% (  -8% -6%) 0.395
  OrHighHigh   21.11  (2.0%)   20.91  (1.8%)   
-1.0% (  -4% -2%) 0.113
   MedPhrase  103.90  (3.0%)  103.05  (2.9%)   
-0.8% (  -6% -5%) 0.381
  HighPhrase  172.59  (2.5%)  171.22  (2.5%)   
-0.8% (  -5% -4%) 0.320
OrHighNotMed  535.67  (4.8%)  531.54  (4.7%)   
-0.8% (  -9% -9%) 0.607
 LowTerm 1094.41  (2.9%) 1087.97  (3.1%)   
-0.6% (  -6% -5%) 0.535
 MedSloppyPhrase   12.91  (2.4%)   12.85  (2.5%)   
-0.5% (  -5% -4%) 0.542
  IntNRQ  101.21  (0.5%)  100.81  (0.7%)   
-0.4% (  -1% -0%) 0.040
PKLookup  144.62  (3.0%)  144.11  (3.1%)   
-0.4% (  -6% -5%) 0.715
HighSloppyPhrase3.75  (2.9%)3.74  (3.0%)   
-0.3% (  -6% -5%) 0.726
HighIntervalsOrdered   16.00  (2.1%)   15.95  (1.8%)   
-0.3% (  -4% -3%) 0.597
   HighTermDayOfYearSort  109.37 (11.0%)  109.03 (15.2%)   
-0.3% ( -23% -   29%) 0.941
 LowSloppyPhrase   41.05  (1.9%)   40.93  (2.1%)   
-0.3% (  -4% -3%) 0.635
 MedTerm 1137.13  (4.1%) 1134.84  (4.2%)   
-0.2% (  -8% -8%) 0.877
BrowseDayOfYearTaxoFacets4.24  (3.4%)4.23  (3.2%)   
-0.2% (  -6% -6%) 0.885
 Prefix3  263.31  (9.1%)  263.08  (9.2%)   
-0.1% ( -16% -   20%) 0.976
BrowseDateTaxoFacets4.23  (3.4%)4.23  (3.2%)   
-0.1% (  -6% -6%) 0.941
  Fuzzy2   44.14 (13.1%)   44.17 (14.2%)
0.1% ( -24% -   31%) 0.990
   BrowseMonthTaxoFacets5.00  (2.6%)5.01  (2.2%)
0.1% (  -4% -4%) 0.878
OrNotHighMed  508.22  (3.6%)  508.82  (3.9%)
0.1% (  -7% -7%) 0.921
 Respell   38.75  (2.3%)   38.82  (2.2%)

[jira] [Assigned] (LUCENE-9901) UnicodeData.java has no regeneration task

2021-04-02 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss reassigned LUCENE-9901:
---

Assignee: Dawid Weiss

> UnicodeData.java has no regeneration task
> -
>
> Key: LUCENE-9901
> URL: https://issues.apache.org/jira/browse/LUCENE-9901
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: modules/analysis
>Reporter: Uwe Schindler
>Assignee: Dawid Weiss
>Priority: Major
>
> When moving build system to gradle, we lost the following groovy script, 
> which is used to regenerate the UnicodeData.java file to be in line with the 
> actually used ICU4J version. The groovy script is still in the repository, 
> but it's no longer used by the build system:
> https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/tools/groovy/generate-unicode-data.groovy
> To execute it, we need to convert it to a Gradle task that depends on the 
> same version of ICU4J that we use as analysis/icu dependency Not sure how to 
> do this, maybe it's easy using palantir).
> The file should also be hashed and put into the regenerated file hases:
> https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java
> Old Ant task is here:
> https://github.com/apache/lucene-solr/blob/branch_8x/lucene/analysis/common/build.xml#L91-L94



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9901) UnicodeData.java has no regeneration task

2021-04-02 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313871#comment-17313871
 ] 

Dawid Weiss commented on LUCENE-9901:
-

Thanks Uwe. I wonder if it's the only missing bit - there may be more!

> UnicodeData.java has no regeneration task
> -
>
> Key: LUCENE-9901
> URL: https://issues.apache.org/jira/browse/LUCENE-9901
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: modules/analysis
>Reporter: Uwe Schindler
>Assignee: Dawid Weiss
>Priority: Major
>
> When moving build system to gradle, we lost the following groovy script, 
> which is used to regenerate the UnicodeData.java file to be in line with the 
> actually used ICU4J version. The groovy script is still in the repository, 
> but it's no longer used by the build system:
> https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/tools/groovy/generate-unicode-data.groovy
> To execute it, we need to convert it to a Gradle task that depends on the 
> same version of ICU4J that we use as analysis/icu dependency Not sure how to 
> do this, maybe it's easy using palantir).
> The file should also be hashed and put into the regenerated file hases:
> https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java
> Old Ant task is here:
> https://github.com/apache/lucene-solr/blob/branch_8x/lucene/analysis/common/build.xml#L91-L94



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9901) UnicodeData.java has no regeneration task

2021-04-02 Thread Uwe Schindler (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler updated LUCENE-9901:
--
Description: 
When moving build system to gradle, we lost the following groovy script, which 
is used to regenerate the UnicodeData.java file to be in line with the actually 
used ICU4J version. The groovy script is still in the repository, but it's no 
longer used by the build system:
https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/tools/groovy/generate-unicode-data.groovy

To execute it, we need to convert it to a Gradle task that depends on the same 
version of ICU4J that we use as analysis/icu dependency Not sure how to do 
this, maybe it's easy using palantir).

The file should also be hashed and put into the regenerated file hases:
https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java

Old Ant task is here:
https://github.com/apache/lucene-solr/blob/branch_8x/lucene/analysis/common/build.xml#L91-L94

  was:
When proting to gradle, we lost the following groovy script, which is used to 
regenerate the UnicodeData.java file to be in line with the actually used ICU4J 
version.

The groovy script is still in the repository, but it's no longer used by the 
build system:
https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/tools/groovy/generate-unicode-data.groovy

To execute it, we need to convert it to a Gradle task that depends on the same 
version of ICU4J that we use as analysis/icu dependency.

The file should also be hashed and put into the regenerated file hases:
https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java

Old Ant task is here:
https://github.com/apache/lucene-solr/blob/branch_8x/lucene/analysis/common/build.xml#L91-L94


> UnicodeData.java has no regeneration task
> -
>
> Key: LUCENE-9901
> URL: https://issues.apache.org/jira/browse/LUCENE-9901
> Project: Lucene - Core
>  Issue Type: Sub-task
>  Components: modules/analysis
>Reporter: Uwe Schindler
>Priority: Major
>
> When moving build system to gradle, we lost the following groovy script, 
> which is used to regenerate the UnicodeData.java file to be in line with the 
> actually used ICU4J version. The groovy script is still in the repository, 
> but it's no longer used by the build system:
> https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/tools/groovy/generate-unicode-data.groovy
> To execute it, we need to convert it to a Gradle task that depends on the 
> same version of ICU4J that we use as analysis/icu dependency Not sure how to 
> do this, maybe it's easy using palantir).
> The file should also be hashed and put into the regenerated file hases:
> https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java
> Old Ant task is here:
> https://github.com/apache/lucene-solr/blob/branch_8x/lucene/analysis/common/build.xml#L91-L94



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9901) UnicodeData.java has no regeneration task

2021-04-02 Thread Uwe Schindler (Jira)
Uwe Schindler created LUCENE-9901:
-

 Summary: UnicodeData.java has no regeneration task
 Key: LUCENE-9901
 URL: https://issues.apache.org/jira/browse/LUCENE-9901
 Project: Lucene - Core
  Issue Type: Sub-task
  Components: modules/analysis
Reporter: Uwe Schindler


When proting to gradle, we lost the following groovy script, which is used to 
regenerate the UnicodeData.java file to be in line with the actually used ICU4J 
version.

The groovy script is still in the repository, but it's no longer used by the 
build system:
https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/tools/groovy/generate-unicode-data.groovy

To execute it, we need to convert it to a Gradle task that depends on the same 
version of ICU4J that we use as analysis/icu dependency.

The file should also be hashed and put into the regenerated file hases:
https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java

Old Ant task is here:
https://github.com/apache/lucene-solr/blob/branch_8x/lucene/analysis/common/build.xml#L91-L94



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9623) Add module descriptor (module-info.java) to lucene jars

2021-04-02 Thread Tomoko Uchida (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313763#comment-17313763
 ] 

Tomoko Uchida commented on LUCENE-9623:
---

bq. What packages are exported and what are kept private - and how do we make 
the decision or get consensus about it (upon such a large codebase)?

Still I haven't come up with good idea for that, but - we might be able to get 
some statistics/numbers by searching OSS repositories that uses Lucene public 
APIs. For example, GitHub provides APIs to allow programmatically search 
repositories that are hosted on it. Of course OSS code may be only a small 
fraction of the whole though, concrete real world use cases could give us some 
clues to decide what classes/packages should be kept public? Just a random 
thought.

> Add module descriptor (module-info.java) to lucene jars
> ---
>
> Key: LUCENE-9623
> URL: https://issues.apache.org/jira/browse/LUCENE-9623
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: general/build
>Affects Versions: main (9.0)
>Reporter: Tomoko Uchida
>Priority: Major
> Attachments: generate-all-module-info.sh
>
>
> For a starter, module descriptors can be automatically generated by jdeps 
> utility.
> There are two choices.
> 1. generate "open" modules which allows reflective accesses with 
> --generate-open-module option
> 2. generate non-open modules with --generate-module-info option
> Which is the better - not fully sure, but maybe 2 (non-open modules)?
> Also, we need to choose proper module names - just using gradle project path 
> for it is OK?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9900) Regenerate/ run ICU only if inputs changed

2021-04-02 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-9900.
-
Fix Version/s: main (9.0)
   Resolution: Fixed

> Regenerate/ run ICU only if inputs changed
> --
>
> Key: LUCENE-9900
> URL: https://issues.apache.org/jira/browse/LUCENE-9900
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: main (9.0)
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9900) Regenerate/ run ICU only if inputs changed

2021-04-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313756#comment-17313756
 ] 

ASF subversion and git services commented on LUCENE-9900:
-

Commit 010e3a1ba960ea0dcd7e5092073d2a55a21b0ac9 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=010e3a1 ]

LUCENE-9900: Regenerate/ run ICU only if inputs changed (#61)



> Regenerate/ run ICU only if inputs changed
> --
>
> Key: LUCENE-9900
> URL: https://issues.apache.org/jira/browse/LUCENE-9900
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss merged pull request #61: LUCENE-9900: Regenerate/ run ICU only if inputs changed

2021-04-02 Thread GitBox


dweiss merged pull request #61:
URL: https://github.com/apache/lucene/pull/61


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] noblepaul merged pull request #2479: SOLR-15288: Hardening NODEDOWN event event in PRS mode

2021-04-02 Thread GitBox


noblepaul merged pull request #2479:
URL: https://github.com/apache/lucene-solr/pull/2479


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #61: LUCENE-9900: Regenerate/ run ICU only if inputs changed

2021-04-02 Thread GitBox


dweiss commented on a change in pull request #61:
URL: https://github.com/apache/lucene/pull/61#discussion_r606134465



##
File path: gradle/generation/icu.gradle
##
@@ -49,6 +49,13 @@ configure(project(":lucene:analysis:icu")) {
 // May be undefined yet, so use a provider.
 dependsOn { sourceSets.tools.runtimeClasspath }
 
+// gennorm generates file order-dependent output, so make it constant here.

Review comment:
   @rmuir gennorm seems to be order-sensitive - I changed the order to 
sorted file names - that's why it's regenerated as part of this patch. Is this 
order somehow important? I can revert to what it was before but this way it's 
simpler.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss opened a new pull request #61: LUCENE-9900: Regenerate/ run ICU only if inputs changed

2021-04-02 Thread GitBox


dweiss opened a new pull request #61:
URL: https://github.com/apache/lucene/pull/61


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9900) Regenerate/ run ICU only if inputs changed

2021-04-02 Thread Dawid Weiss (Jira)
Dawid Weiss created LUCENE-9900:
---

 Summary: Regenerate/ run ICU only if inputs changed
 Key: LUCENE-9900
 URL: https://issues.apache.org/jira/browse/LUCENE-9900
 Project: Lucene - Core
  Issue Type: Sub-task
Reporter: Dawid Weiss
Assignee: Dawid Weiss






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9872) Make the most painful tasks in regenerate fully incremental

2021-04-02 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-9872.
-
Fix Version/s: main (9.0)
   Resolution: Fixed

> Make the most painful tasks in regenerate fully incremental
> ---
>
> Key: LUCENE-9872
> URL: https://issues.apache.org/jira/browse/LUCENE-9872
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: main (9.0)
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: main (9.0)
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is particularly important for that one jflex task that is currently 
> mood-killer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9872) Make the most painful tasks in regenerate fully incremental

2021-04-02 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313690#comment-17313690
 ] 

ASF subversion and git services commented on LUCENE-9872:
-

Commit e3ae57a3c1ac59a1b56a36d965000f45abbc43e5 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e3ae57a ]

LUCENE-9872: Make the most painful tasks in regenerate fully incremental (#60)



> Make the most painful tasks in regenerate fully incremental
> ---
>
> Key: LUCENE-9872
> URL: https://issues.apache.org/jira/browse/LUCENE-9872
> Project: Lucene - Core
>  Issue Type: Sub-task
>Affects Versions: main (9.0)
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is particularly important for that one jflex task that is currently 
> mood-killer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss merged pull request #60: LUCENE-9872: Make the most painful tasks in regenerate fully incremental

2021-04-02 Thread GitBox


dweiss merged pull request #60:
URL: https://github.com/apache/lucene/pull/60


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org