[jira] [Comment Edited] (LUCENE-9902) Update faceting API to use modern Java features
[ https://issues.apache.org/jira/browse/LUCENE-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314176#comment-17314176 ] Greg Miller edited comment on LUCENE-9902 at 4/3/21, 3:10 AM: -- +1. It looks like the increment methods in IntTaxonomyFacets are already protected, so making getValue protected as well seems pretty reasonable (and more consistent). Since this change would be fully backwards-compatible, should we consider making it to 8.9? was (Author: gsmiller): +1. It looks like the increment methods in IntTaxonomyFacets are already protected, so making getValue protected as well seems pretty reasonable (and more consistent). > Update faceting API to use modern Java features > --- > > Key: LUCENE-9902 > URL: https://issues.apache.org/jira/browse/LUCENE-9902 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Gautam Worah >Priority: Minor > > I was using the {{public int getOrdinal(String dim, String[] path)}} API for > a single {{path}} String and found myself creating an array with a single > element. We can start using variable length args for this method. > I also propose this change: > I wanted to know the specific count of an ordinal using using the > {{getValue}} API from {{IntTaxonomyFacets}} but the method is private. It > would be good if we could change it to {{protected}} so that users can know > the value of an ordinal without looking up the {{FacetLabel}} and then > checking its value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9902) Update faceting API to use modern Java features
[ https://issues.apache.org/jira/browse/LUCENE-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314176#comment-17314176 ] Greg Miller commented on LUCENE-9902: - +1. It looks like the increment methods in IntTaxonomyFacets are already protected, so making getValue protected as well seems pretty reasonable (and more consistent). > Update faceting API to use modern Java features > --- > > Key: LUCENE-9902 > URL: https://issues.apache.org/jira/browse/LUCENE-9902 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Gautam Worah >Priority: Minor > > I was using the {{public int getOrdinal(String dim, String[] path)}} API for > a single {{path}} String and found myself creating an array with a single > element. We can start using variable length args for this method. > I also propose this change: > I wanted to know the specific count of an ordinal using using the > {{getValue}} API from {{IntTaxonomyFacets}} but the method is private. It > would be good if we could change it to {{protected}} so that users can know > the value of an ordinal without looking up the {{FacetLabel}} and then > checking its value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9902) Update faceting API to use modern Java features
Gautam Worah created LUCENE-9902: Summary: Update faceting API to use modern Java features Key: LUCENE-9902 URL: https://issues.apache.org/jira/browse/LUCENE-9902 Project: Lucene - Core Issue Type: Improvement Components: modules/facet Reporter: Gautam Worah I was using the {{public int getOrdinal(String dim, String[] path)}} API for a single {{path}} String and found myself creating an array with a single element. We can start using variable length args for this method. I also propose this change: I wanted to know the specific count of an ordinal using using the {{getValue}} API from {{IntTaxonomyFacets}} but the method is private. It would be good if we could change it to {{protected}} so that users can know the value of an ordinal without looking up the {{FacetLabel}} and then checking its value. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9648) Add a numbers query in sandbox that takes advantage of index sorting.
[ https://issues.apache.org/jira/browse/LUCENE-9648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314123#comment-17314123 ] Julie Tibshirani commented on LUCENE-9648: -- Thanks [~gf2121] for the interesting ticket and sorry for the slow reply! In general it seems hard to effectively bound the range of query values. The benchmark you posted randomly generates a set of query values from the range [0, 1_000_000_000]. As an experiment I drew 1000 random values and looked at their range: in a typical example we get min: 128_082, max: 999_833_281. So the range covers 0.999% of the possible values. Even with only 2 random query values I think the range would cover 1/3 of the possible values in expectation. So it seems the speed-up in the benchmark must be from something else -- maybe we just check against the query values more efficiently? It'd be interesting to dig into this to understand what makes it faster. I'm also curious if you have a specific use case in mind. My assumption was that the query values don't tend to cluster together, but maybe they do in your case? > Add a numbers query in sandbox that takes advantage of index sorting. > - > > Key: LUCENE-9648 > URL: https://issues.apache.org/jira/browse/LUCENE-9648 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Feng Guo >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > LUCENE-6539 introduced DocValuesNumbersQuery to benefit some special cases > (many terms/numbers and fewish matching hits). Inspired by LUCENE-7714, maybe > we can speed it up when meeting index sort since we can do a binary search to > bound the docvalues before iter it. > Here are some simple benchmarks: > *Based on 50,000,000 doc, each long in query can hit at least one doc:* > |LongsCount|PointInSetQuery(s/op)|IndexSortDocValuesNumbersQuery(s/op)|DocValuesNumbersQuery(s/op)| > |1000|0.144 ± 0.001|0.428 ± 0.008|1.159 ± 0.031| > |5000|0.404 ± 0.003|0.424 ± 0.008|1.343 ± 0.022| > |1|0.661 ± 0.012|0.436 ± 0.008|1.372 ± 0.011| > |5|1.319 ± 0.011|0.450 ± 0.005|1.173 ± 0.017| > |10|1.328 ± 0.007|0.459 ± 0.003|1.160 ± 0.098| > *Based on 100,000,000 doc, each long in query can hit at least one doc:* > |LongsCount|PointInSetQuery(s/op)|IndexSortDocValuesNumbersQuery(s/op)|DocValuesNumbersQuery(s/op)| > |1000|0.189 ± 0.003|0.860 ± 0.028|2.408 ± 0.087| > |5000|0.715 ± 0.010|0.879 ± 0.010|2.642 ± 0.105| > |1|1.075 ± 0.034|0.859 ± 0.010|2.629 ± 0.077| > |5|2.652 ± 0.114|0.865 ± 0.009|2.737 ± 0.099| > |10|2.905 ± 0.027|0.881 ± 0.005|2.820 ± 0.474| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)
[ https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314116#comment-17314116 ] Greg Miller commented on LUCENE-9850: - I think I've actually made good progress on optimizing this. The last bit to crack was reading the exception bytes in all at once instead of byte-by-byte. The benchmark results look pretty promising to me. I'd be curious what others think of these results. If others agree they look promising, I'll tidy up the [code|https://github.com/apache/lucene/compare/main...gsmiller:LUCENE-9850/pfordocids2], write some tests and put a PR out there. And just as a reminder, this change reduces the doc ID data in the Wikipedia benchmark index by *~12%*, resulting in an *~3.3% overall index size reduction*. Benchmark results are as follows. It does look like there's a significant regression in some task, but much smaller than before. But now there are also some significant QPS improvements for some other tasks. Seems like decent results overall, but others will have more opinions on this I'm sure. {code:java} TaskQPS baseline StdDevQPS pfordocids StdDev Pct diff p-value HighTermDayOfYearSort 76.66 (13.5%) 74.15 (12.0%) -3.3% ( -25% - 25%) 0.418 HighTermMonthSort 105.44 (10.3%) 102.66 (8.5%) -2.6% ( -19% - 18%) 0.379 LowSpanNear 38.95 (2.3%) 38.05 (1.7%) -2.3% ( -6% -1%) 0.000 AndHighMed 85.41 (2.6%) 83.58 (2.5%) -2.1% ( -7% -3%) 0.008 OrHighHigh 16.29 (1.9%) 16.00 (6.1%) -1.8% ( -9% -6%) 0.209 TermDTSort 186.08 (11.7%) 182.87 (10.0%) -1.7% ( -20% - 22%) 0.616 MedSpanNear 12.45 (2.2%) 12.24 (1.6%) -1.7% ( -5% -2%) 0.005 HighIntervalsOrdered6.11 (2.8%)6.00 (2.8%) -1.7% ( -7% -4%) 0.061 HighTermTitleBDVSort 61.20 (10.8%) 60.19 (10.2%) -1.7% ( -20% - 21%) 0.617 MedSloppyPhrase 29.64 (4.6%) 29.16 (4.5%) -1.6% ( -10% -7%) 0.262 AndHighHigh 58.98 (1.9%) 58.05 (2.2%) -1.6% ( -5% -2%) 0.015 OrHighMed 62.28 (2.6%) 61.33 (6.0%) -1.5% ( -9% -7%) 0.293 LowSloppyPhrase 11.15 (3.2%) 10.98 (3.2%) -1.5% ( -7% -5%) 0.134 HighSpanNear7.33 (2.7%)7.25 (2.3%) -1.2% ( -6% -3%) 0.143 Prefix3 239.67 (15.4%) 237.28 (18.6%) -1.0% ( -30% - 38%) 0.853 LowPhrase 131.88 (2.3%) 130.75 (2.6%) -0.9% ( -5% -4%) 0.264 OrHighLow 264.48 (5.4%) 262.67 (5.7%) -0.7% ( -11% - 10%) 0.695 BrowseDayOfYearSSDVFacets 11.57 (4.5%) 11.49 (5.0%) -0.7% ( -9% -9%) 0.661 BrowseMonthSSDVFacets 12.38 (6.6%) 12.31 (7.9%) -0.6% ( -14% - 14%) 0.799 Fuzzy2 52.44 (12.0%) 52.24 (12.3%) -0.4% ( -22% - 27%) 0.920 PKLookup 146.58 (3.8%) 146.05 (4.1%) -0.4% ( -7% -7%) 0.774 HighSloppyPhrase1.78 (5.1%)1.77 (4.2%) -0.3% ( -9% -9%) 0.839 BrowseDayOfYearTaxoFacets4.15 (3.4%)4.15 (2.9%) -0.0% ( -6% -6%) 0.962 BrowseDateTaxoFacets4.15 (3.4%)4.14 (2.8%) -0.0% ( -6% -6%) 0.964 BrowseMonthTaxoFacets5.02 (2.7%)5.02 (2.4%) 0.1% ( -4% -5%) 0.920 MedPhrase 87.31 (3.1%) 87.43 (3.1%) 0.1% ( -5% -6%) 0.893 Respell 44.49 (2.2%) 44.74 (2.8%) 0.6% ( -4% -5%) 0.484 Wildcard 54.13 (3.3%) 54.49 (3.8%) 0.7% ( -6% -7%) 0.557 OrHighNotMed 521.75 (5.5%) 526.04 (6.2%) 0.8% ( -10% - 13%) 0.659 Fuzzy1 54.77 (9.9%) 55.25 (7.8%) 0.9% ( -15% - 20%) 0.753 IntNRQ 94.71 (0.5%) 95.68 (1.3%) 1.0% ( 0% -2%) 0.001 OrNotHighLow 618.18 (5.1%) 624.93 (6.1%) 1.1% ( -9% - 12%) 0.537 OrNotHighMed 473.89 (3.2%) 479.65 (5.4%) 1.2% ( -7% - 10%) 0.388 MedTerm 1115.74 (4.5%) 1130.77 (6.8%) 1.3% ( -9% - 13%) 0.463 OrHighNotHigh 428.11 (4.0%) 436.25 (6.9%) 1.9% ( -8% - 13%) 0.286 HighPhrase 86.34 (2.6%) 87.99 (3.8%) 1.9% ( -4% -
[jira] [Updated] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)
[ https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-9850: Attachment: bulk_read_2.png > Explore PFOR for Doc ID delta encoding (instead of FOR) > --- > > Key: LUCENE-9850 > URL: https://issues.apache.org/jira/browse/LUCENE-9850 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > Attachments: apply_exceptions.png, bulk_read_1.png, bulk_read_2.png, > for.png, pfor.png > > > It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. > Right now PFOR is used for positions, frequencies and payloads, but FOR is > used for doc ID deltas. From a recent > [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E] > on the dev mailing list, it sounds like this decision was made based on the > optimization possible when expanding the deltas. > I'd be interesting in measuring the index size reduction possible with > switching to PFOR compared to the performance reduction we might see by no > longer being able to apply the deltas in as optimal a way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)
[ https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-9850: Attachment: bulk_read_1.png > Explore PFOR for Doc ID delta encoding (instead of FOR) > --- > > Key: LUCENE-9850 > URL: https://issues.apache.org/jira/browse/LUCENE-9850 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > Attachments: apply_exceptions.png, bulk_read_1.png, for.png, pfor.png > > > It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. > Right now PFOR is used for positions, frequencies and payloads, but FOR is > used for doc ID deltas. From a recent > [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E] > on the dev mailing list, it sounds like this decision was made based on the > optimization possible when expanding the deltas. > I'd be interesting in measuring the index size reduction possible with > switching to PFOR compared to the performance reduction we might see by no > longer being able to apply the deltas in as optimal a way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)
[ https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314007#comment-17314007 ] Greg Miller edited comment on LUCENE-9850 at 4/2/21, 6:55 PM: -- I'm getting in the weeds a bit on this probably, but since it appears JDK Mission Control now provides flame charts, we can now see where the PFOR approach is spending extra time (on top of FOR). Here's FOR as a baseline: !for.png|width=850,height=134! And here's my latest version with PFOR: !pfor.png|width=855,height=133! It looks like applying exceptions (PForUtil.applyExceptionsIn32Space) is accounting for a good chunk of where decoding is spending its time. This is pure extra overhead. And digging into that, it looks like ~2/3 of that time is being spent actually reading the bytes (position and high-order bits of the exceptions). !apply_exceptions.png|width=852,height=94! So I don't know. It feels like the algorithm piece of this is close to as optimized as it's going to get right now. Not sure how much can be done about reading those bytes. No way around the need to read them. I know there's also a bulk byte reading method though on DataInput, so I wonder if reading all of the bytes at once would help. I'm going to experiment there a little bit. was (Author: gsmiller): I'm getting in the weeds a bit on this probably, but since it appears JDK Mission Control now provides flame charts, we can now see where the PFOR approach is spending extra time (on top of FOR). Here's FOR as a baseline: !for.png|width=850,height=134! And here's my latest version with PFOR: !pfor.png|width=855,height=133! It looks like applying exceptions (PForUtil.applyExceptionsIn32Space) is accounting for a good chunk of where decoding is spending its time. This is pure extra overhead. And digging into that, it looks like ~2/3 of that time is being spent actually reading the bytes (position and high-order bits of the exceptions). !apply_exceptions.png|width=852,height=94! So I don't know. It feels like the algorithm piece of this is close to as optimized as it's going to get right now. Not sure how much can be done about reading those bytes. No way around the need to read them. > Explore PFOR for Doc ID delta encoding (instead of FOR) > --- > > Key: LUCENE-9850 > URL: https://issues.apache.org/jira/browse/LUCENE-9850 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > Attachments: apply_exceptions.png, for.png, pfor.png > > > It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. > Right now PFOR is used for positions, frequencies and payloads, but FOR is > used for doc ID deltas. From a recent > [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E] > on the dev mailing list, it sounds like this decision was made based on the > optimization possible when expanding the deltas. > I'd be interesting in measuring the index size reduction possible with > switching to PFOR compared to the performance reduction we might see by no > longer being able to apply the deltas in as optimal a way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9823) SynonymQuery rewrite can change field boost calculation
[ https://issues.apache.org/jira/browse/LUCENE-9823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani updated LUCENE-9823: - Labels: newdev (was: ) > SynonymQuery rewrite can change field boost calculation > --- > > Key: LUCENE-9823 > URL: https://issues.apache.org/jira/browse/LUCENE-9823 > Project: Lucene - Core > Issue Type: Bug >Reporter: Julie Tibshirani >Priority: Minor > Labels: newdev > > SynonymQuery accepts a boost per term, which acts as a multiplier on the term > frequency in the document. When rewriting a SynonymQuery with a single term, > we create a BoostQuery wrapping a TermQuery. This changes the meaning of the > boost: it now multiplies the final TermQuery score instead of multiplying the > term frequency before it's passed to the score calculation. > This is a small point, but maybe it's worth avoiding rewriting a single-term > SynonymQuery unless the boost is 1.0. > The same consideration affects CombinedFieldQuery in sandbox. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9657) Unified Highlighter throws too_complex_to_determinize_exception with >288 filter terms
[ https://issues.apache.org/jira/browse/LUCENE-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314008#comment-17314008 ] Julie Tibshirani edited comment on LUCENE-9657 at 4/2/21, 6:08 PM: --- Closing this out for now, feel free to follow up in Elasticsearch! was (Author: julietibs): Closing this out for now, feel free to follow-up in Elasticsearch! > Unified Highlighter throws too_complex_to_determinize_exception with >288 > filter terms > -- > > Key: LUCENE-9657 > URL: https://issues.apache.org/jira/browse/LUCENE-9657 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.6.2 >Reporter: Isaac Doub >Priority: Major > > There seems to be a problem with the Unified Highlighter in Lucene 8.6.2 that > is affecting ElasticSearch 7.9.1. If a search is performed with >288 filter > terms using the unified highlighter it throws a > too_complex_to_determinize_exception error, but if you switch to the plain > highlighter it works fine. Alternatively, if you filter on a "copy_to" field > instead of the indexed field, it also works. > > This throws the error > {code:java} > { > "highlight": { > "type": "unified", > "fields": { > "title": { > "require_field_match": false > } > } > }, > "query": { > "bool": { > "must": [{ > "query_string": { > "query": "*" > } > }], > "filter": [{ > "bool": { > "must": [{ > "terms": { > "id": [ ">288 terms here" ] > } > }] > } > }] > } > } > }{code} > > > But this works fine > {code:java} > { > "highlight": { > "type": "plain", > "fields": { > "title": { > "require_field_match": false > } > } > }, > "query": { > "bool": { > "must": [{ > "query_string": { > "query": "*" > } > }], > "filter": [{ > "bool": { > "must": [{ > "terms": { > "id": [ ">288 terms here" ] > } > }] > } > }] > } > } > }{code} > > > Or if I adjust the search to use the copy_to field it works as well (note > "id" is now "_id") > {code:java} > { > "highlight": { > "type": "unified", > "fields": { > "title": { > "require_field_match": false > } > } > }, > "query": { > "bool": { > "must": [{ > "query_string": { > "query": "*" > } > }], > "filter": [{ > "bool": { > "must": [{ > "terms": { > "_id": [ ">288 terms here" ] > } > }] > } > }] > } > } > }{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9657) Unified Highlighter throws too_complex_to_determinize_exception with >288 filter terms
[ https://issues.apache.org/jira/browse/LUCENE-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani resolved LUCENE-9657. -- Resolution: Resolved > Unified Highlighter throws too_complex_to_determinize_exception with >288 > filter terms > -- > > Key: LUCENE-9657 > URL: https://issues.apache.org/jira/browse/LUCENE-9657 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.6.2 >Reporter: Isaac Doub >Priority: Major > > There seems to be a problem with the Unified Highlighter in Lucene 8.6.2 that > is affecting ElasticSearch 7.9.1. If a search is performed with >288 filter > terms using the unified highlighter it throws a > too_complex_to_determinize_exception error, but if you switch to the plain > highlighter it works fine. Alternatively, if you filter on a "copy_to" field > instead of the indexed field, it also works. > > This throws the error > {code:java} > { > "highlight": { > "type": "unified", > "fields": { > "title": { > "require_field_match": false > } > } > }, > "query": { > "bool": { > "must": [{ > "query_string": { > "query": "*" > } > }], > "filter": [{ > "bool": { > "must": [{ > "terms": { > "id": [ ">288 terms here" ] > } > }] > } > }] > } > } > }{code} > > > But this works fine > {code:java} > { > "highlight": { > "type": "plain", > "fields": { > "title": { > "require_field_match": false > } > } > }, > "query": { > "bool": { > "must": [{ > "query_string": { > "query": "*" > } > }], > "filter": [{ > "bool": { > "must": [{ > "terms": { > "id": [ ">288 terms here" ] > } > }] > } > }] > } > } > }{code} > > > Or if I adjust the search to use the copy_to field it works as well (note > "id" is now "_id") > {code:java} > { > "highlight": { > "type": "unified", > "fields": { > "title": { > "require_field_match": false > } > } > }, > "query": { > "bool": { > "must": [{ > "query_string": { > "query": "*" > } > }], > "filter": [{ > "bool": { > "must": [{ > "terms": { > "_id": [ ">288 terms here" ] > } > }] > } > }] > } > } > }{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9657) Unified Highlighter throws too_complex_to_determinize_exception with >288 filter terms
[ https://issues.apache.org/jira/browse/LUCENE-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314008#comment-17314008 ] Julie Tibshirani commented on LUCENE-9657: -- Closing this out for now, feel free to follow-up in Elasticsearch! > Unified Highlighter throws too_complex_to_determinize_exception with >288 > filter terms > -- > > Key: LUCENE-9657 > URL: https://issues.apache.org/jira/browse/LUCENE-9657 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.6.2 >Reporter: Isaac Doub >Priority: Major > > There seems to be a problem with the Unified Highlighter in Lucene 8.6.2 that > is affecting ElasticSearch 7.9.1. If a search is performed with >288 filter > terms using the unified highlighter it throws a > too_complex_to_determinize_exception error, but if you switch to the plain > highlighter it works fine. Alternatively, if you filter on a "copy_to" field > instead of the indexed field, it also works. > > This throws the error > {code:java} > { > "highlight": { > "type": "unified", > "fields": { > "title": { > "require_field_match": false > } > } > }, > "query": { > "bool": { > "must": [{ > "query_string": { > "query": "*" > } > }], > "filter": [{ > "bool": { > "must": [{ > "terms": { > "id": [ ">288 terms here" ] > } > }] > } > }] > } > } > }{code} > > > But this works fine > {code:java} > { > "highlight": { > "type": "plain", > "fields": { > "title": { > "require_field_match": false > } > } > }, > "query": { > "bool": { > "must": [{ > "query_string": { > "query": "*" > } > }], > "filter": [{ > "bool": { > "must": [{ > "terms": { > "id": [ ">288 terms here" ] > } > }] > } > }] > } > } > }{code} > > > Or if I adjust the search to use the copy_to field it works as well (note > "id" is now "_id") > {code:java} > { > "highlight": { > "type": "unified", > "fields": { > "title": { > "require_field_match": false > } > } > }, > "query": { > "bool": { > "must": [{ > "query_string": { > "query": "*" > } > }], > "filter": [{ > "bool": { > "must": [{ > "terms": { > "_id": [ ">288 terms here" ] > } > }] > } > }] > } > } > }{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)
[ https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314007#comment-17314007 ] Greg Miller commented on LUCENE-9850: - I'm getting in the weeds a bit on this probably, but since it appears JDK Mission Control now provides flame charts, we can now see where the PFOR approach is spending extra time (on top of FOR). Here's FOR as a baseline: !for.png|width=850,height=134! And here's my latest version with PFOR: !pfor.png|width=855,height=133! It looks like applying exceptions (PForUtil.applyExceptionsIn32Space) is accounting for a good chunk of where decoding is spending its time. This is pure extra overhead. And digging into that, it looks like ~2/3 of that time is being spent actually reading the bytes (position and high-order bits of the exceptions). !apply_exceptions.png|width=852,height=94! So I don't know. It feels like the algorithm piece of this is close to as optimized as it's going to get right now. Not sure how much can be done about reading those bytes. No way around the need to read them. > Explore PFOR for Doc ID delta encoding (instead of FOR) > --- > > Key: LUCENE-9850 > URL: https://issues.apache.org/jira/browse/LUCENE-9850 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > Attachments: apply_exceptions.png, for.png, pfor.png > > > It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. > Right now PFOR is used for positions, frequencies and payloads, but FOR is > used for doc ID deltas. From a recent > [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E] > on the dev mailing list, it sounds like this decision was made based on the > optimization possible when expanding the deltas. > I'd be interesting in measuring the index size reduction possible with > switching to PFOR compared to the performance reduction we might see by no > longer being able to apply the deltas in as optimal a way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)
[ https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-9850: Attachment: apply_exceptions.png > Explore PFOR for Doc ID delta encoding (instead of FOR) > --- > > Key: LUCENE-9850 > URL: https://issues.apache.org/jira/browse/LUCENE-9850 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > Attachments: apply_exceptions.png, for.png, pfor.png > > > It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. > Right now PFOR is used for positions, frequencies and payloads, but FOR is > used for doc ID deltas. From a recent > [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E] > on the dev mailing list, it sounds like this decision was made based on the > optimization possible when expanding the deltas. > I'd be interesting in measuring the index size reduction possible with > switching to PFOR compared to the performance reduction we might see by no > longer being able to apply the deltas in as optimal a way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a change in pull request #61: LUCENE-9900: Regenerate/ run ICU only if inputs changed
dweiss commented on a change in pull request #61: URL: https://github.com/apache/lucene/pull/61#discussion_r606353297 ## File path: gradle/generation/icu.gradle ## @@ -49,6 +49,13 @@ configure(project(":lucene:analysis:icu")) { // May be undefined yet, so use a provider. dependsOn { sourceSets.tools.runtimeClasspath } +// gennorm generates file order-dependent output, so make it constant here. Review comment: Thanks Robert. Enjoy your holidays! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9857) Skip cache building if IndexOrDocValuesQuery choose the dvQuery
[ https://issues.apache.org/jira/browse/LUCENE-9857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314006#comment-17314006 ] Julie Tibshirani commented on LUCENE-9857: -- Hello [~gf2121]! I think that when we cache an IndexOrDocValueQuery, we always make sure to use the 'index' version. This is because we pass a lead cost of {{Long.MAX_VALUE}} when building the scorer: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L769. Are you seeing a surprising behavior in tests/ benchmarks that led you to file the issue? It could be that we're still missing some detail. > Skip cache building if IndexOrDocValuesQuery choose the dvQuery > --- > > Key: LUCENE-9857 > URL: https://issues.apache.org/jira/browse/LUCENE-9857 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: main (9.0) >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > IndexOrDocValuesQuery can automatically use dvQueries when the cost > 8 *l > eadcost, And the LRUQueryCache skips cache building when cost > 250(By > default) * leadcost. There is a gap between 8 and 250, which means if the > factor is just between 8 and 250 (e.g. cost = 10 * leadcost), the > IndexOrDocValueQuery will choose the dvQueries but LRUQueryCache still build > cache for it. > IndexOrDocValuesQuery aims to speed up queries when the leadcost is small, > but building cache by dvScorers can make it meaningless because it needs to > scan all the docvalues. This can be rather slow for big segments, so maybe we > should skip the cache building for IndexOrDocValuesQuery when it chooses > dvQueries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)
[ https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-9850: Attachment: pfor.png > Explore PFOR for Doc ID delta encoding (instead of FOR) > --- > > Key: LUCENE-9850 > URL: https://issues.apache.org/jira/browse/LUCENE-9850 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > Attachments: for.png, pfor.png > > > It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. > Right now PFOR is used for positions, frequencies and payloads, but FOR is > used for doc ID deltas. From a recent > [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E] > on the dev mailing list, it sounds like this decision was made based on the > optimization possible when expanding the deltas. > I'd be interesting in measuring the index size reduction possible with > switching to PFOR compared to the performance reduction we might see by no > longer being able to apply the deltas in as optimal a way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)
[ https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-9850: Attachment: for.png > Explore PFOR for Doc ID delta encoding (instead of FOR) > --- > > Key: LUCENE-9850 > URL: https://issues.apache.org/jira/browse/LUCENE-9850 > Project: Lucene - Core > Issue Type: Task > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Greg Miller >Priority: Minor > Attachments: for.png > > > It'd be interesting to explore using PFOR instead of FOR for doc ID encoding. > Right now PFOR is used for positions, frequencies and payloads, but FOR is > used for doc ID deltas. From a recent > [conversation|http://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOp7d_GxNosB5r%3DQMPA-v0SteHWjXUmG3gwQot4gkubWw%40mail.gmail.com%3E] > on the dev mailing list, it sounds like this decision was made based on the > optimization possible when expanding the deltas. > I'd be interesting in measuring the index size reduction possible with > switching to PFOR compared to the performance reduction we might see by no > longer being able to apply the deltas in as optimal a way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat
[ https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313993#comment-17313993 ] Tomoko Uchida commented on LUCENE-9855: --- Thank you for clarifying it! Personally I always would prefer "one vector format and multiple ann strategies" concept and love to see we have the unified vector codec. In the meantime, to me it could be a bit heavy load for one codec writer and/or reader to have various implementation for multiple search strategies. I think it would be great if we can factor out the search algorithm specific metadata building/en(de)coding logic into separated classes (the good old strategy design pattern or something like it could work; perhaps you have already considered it though?) But it would be off-topic here and should be handed at separated issues. > Reconsider codec name VectorFormat > -- > > Key: LUCENE-9855 > URL: https://issues.apache.org/jira/browse/LUCENE-9855 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Tomoko Uchida >Priority: Blocker > > There is some discussion about the codec name for ann search. > https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E > Main points here are 1) use plural form for consistency, and 2) use more > specific name for ann search (second point could be optional). > A few alternatives were proposed: > - VectorsFormat > - VectorValuesFormat > - NeighborsFormat > - DenseVectorsFormat -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat
[ https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313963#comment-17313963 ] Michael Sokolov commented on LUCENE-9855: - > If we'd like to have another ANN algorithm, say LSH, we would switch the >metadata building and encoding logic in the codec writer/reader >implementation, is my understanding correct? Right, we'd have conditional logic in `Lucene90VectorReader' and Lucene90VectorWriter` that would do different things depending on the `SearchStrategy`. The reader would have to be able to read a header from the metadata including the strategy, and then read the remainder of the metadata and other data differently, depending on the strategy. Or we could make a new format for LSH and for any other ANN strategy that has a different index file format. > Reconsider codec name VectorFormat > -- > > Key: LUCENE-9855 > URL: https://issues.apache.org/jira/browse/LUCENE-9855 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Tomoko Uchida >Priority: Blocker > > There is some discussion about the codec name for ann search. > https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E > Main points here are 1) use plural form for consistency, and 2) use more > specific name for ann search (second point could be optional). > A few alternatives were proposed: > - VectorsFormat > - VectorValuesFormat > - NeighborsFormat > - DenseVectorsFormat -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat
[ https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313962#comment-17313962 ] Tomoko Uchida commented on LUCENE-9855: --- Answering my question... maybe I missed this if statement. https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorWriter.java#L124 > Reconsider codec name VectorFormat > -- > > Key: LUCENE-9855 > URL: https://issues.apache.org/jira/browse/LUCENE-9855 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Tomoko Uchida >Priority: Blocker > > There is some discussion about the codec name for ann search. > https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E > Main points here are 1) use plural form for consistency, and 2) use more > specific name for ann search (second point could be optional). > A few alternatives were proposed: > - VectorsFormat > - VectorValuesFormat > - NeighborsFormat > - DenseVectorsFormat -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat
[ https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313955#comment-17313955 ] Tomoko Uchida commented on LUCENE-9855: --- {quote}bq. I think it will be helpful to consider how we would handle a different ANN implementation. Say LSH. In that case, we would no longer store this graph file (what is currently in .vex files). We would need to add files to store LSH's hash tables, and the metadata would change. Is it the same format? A variant of this format? The current conception is that we would make variations of this single vector format to handle multiple ANN algorithms. {quote} I think I am still not fully understand your thoughts about handling multiple ANN algorithms by single format in details so would you please help me to figure out the strategy to switch algorithms. If we'd like to have another ANN algorithm, say LSH, we would switch the metadata building and encoding logic in the codec writer/reader implementation, is my understanding correct? > Reconsider codec name VectorFormat > -- > > Key: LUCENE-9855 > URL: https://issues.apache.org/jira/browse/LUCENE-9855 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Tomoko Uchida >Priority: Blocker > > There is some discussion about the codec name for ann search. > https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E > Main points here are 1) use plural form for consistency, and 2) use more > specific name for ann search (second point could be optional). > A few alternatives were proposed: > - VectorsFormat > - VectorValuesFormat > - NeighborsFormat > - DenseVectorsFormat -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat
[ https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313940#comment-17313940 ] Michael Sokolov commented on LUCENE-9855: - I think it will be helpful to consider how we would handle a different ANN implementation. Say LSH. In that case, we would no longer store this graph file (what is currently in .vex files). We would need to add files to store LSH's hash tables, and the metadata would change. Is it the same format? A variant of this format? The current conception is that we would make variations of this single vector format to handle multiple ANN algorithms. We currently only have one, so it doesn't look that way, but anyway with that background a generic name like VectorsFormat (or NumericVectorsFormat to distinguish from DocVectors etc) makes sense. On the other hand if you think we would create a new Format to represent this different kind of data that we are storing to disk, which will have its own de/serialization code (even if some of it would be the same), then we should pick a name that incorporates the algorithm, and by the way also get rid of the whole concept of {{SearchStrategy}}. I think this is the fundamental question here: one format, multiple ANN strategies, or one format per ANN strategy? I thought it had been sorted out in our earlier discussions, but not everybody may have been following that closely. > Reconsider codec name VectorFormat > -- > > Key: LUCENE-9855 > URL: https://issues.apache.org/jira/browse/LUCENE-9855 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Tomoko Uchida >Priority: Blocker > > There is some discussion about the codec name for ann search. > https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E > Main points here are 1) use plural form for consistency, and 2) use more > specific name for ann search (second point could be optional). > A few alternatives were proposed: > - VectorsFormat > - VectorValuesFormat > - NeighborsFormat > - DenseVectorsFormat -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9855) Reconsider codec name VectorFormat
[ https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313937#comment-17313937 ] Tomoko Uchida commented on LUCENE-9855: --- I was almost throwing away this in facing the difficulty of naming; but [~julietibs] would be right, we should pursue to give proper name to our "vector"s (yes, I think it is certainly nothing other than "vector", although I understand the term is already overloaded many times). NumericVectorsFormat or VectorValuesFormat, otherwise DenseVectorsFormat, either one looks fine to me if "VectorsFormat" is too vague for us. On the other hand, when we closely look at the current the codec Writer/Reader implementation, it seems to be tightly coupled with HNSW graph building and encoding and searching. From my viewpoint we should manage to decouple HNSW specific code from the Codec to resolve inconsistency between its name and implementation in the near future... > Reconsider codec name VectorFormat > -- > > Key: LUCENE-9855 > URL: https://issues.apache.org/jira/browse/LUCENE-9855 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: main (9.0) >Reporter: Tomoko Uchida >Priority: Blocker > > There is some discussion about the codec name for ann search. > https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E > Main points here are 1) use plural form for consistency, and 2) use more > specific name for ann search (second point could be optional). > A few alternatives were proposed: > - VectorsFormat > - VectorValuesFormat > - NeighborsFormat > - DenseVectorsFormat -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a change in pull request #61: LUCENE-9900: Regenerate/ run ICU only if inputs changed
rmuir commented on a change in pull request #61: URL: https://github.com/apache/lucene/pull/61#discussion_r606244542 ## File path: gradle/generation/icu.gradle ## @@ -49,6 +49,13 @@ configure(project(":lucene:analysis:icu")) { // May be undefined yet, so use a provider. dependsOn { sourceSets.tools.runtimeClasspath } +// gennorm generates file order-dependent output, so make it constant here. Review comment: @dweiss it is my understanding order doesnt matter. so `a > b` followed by `b > c` is the same as `b > c` followed by `a > b`. the normalization will happen recursively so that 'a' maps to 'c' in either case. this recursive process happens at gennorm2 'compile time'. but in version 2 of the file format, they added a mechanism to recover the original non-recursive mappings, so this may cause a data file change even though semanticw are unchanged? just my guess... more details: http://site.icu-project.org/design/normalization/custom -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)
[ https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313877#comment-17313877 ] Greg Miller edited comment on LUCENE-9850 at 4/2/21, 1:41 PM: -- I haven't really been following along with what's going on in JDK17, but being able to more explicitly generate vectorized instructions will be nice! I optimized my branch a bit further. At this point I think it has all the optimizations of ForDeltaUtil, and the only extra work is applying the exceptions. I even pulled in the optimization to apply the prefix to two values at a time (packed in a long). That code is over [here|https://github.com/apache/lucene/compare/main...gsmiller:pfordocid-opto3]. (It's not polished at all... a bit hacky) If you look at the code, one thing that may not be obvious is how encodePrefixSum differs from encode. In ForDeltaUtil, the 0 bpv case implies that all values were "1", while PForUtil encodes the actual "same" value, allowing it to take on values other than "1". While it's pretty trivial, this meant checking that value in decoding and special-casing 1 vs. non-1 values. I took on the ForDeltaUtil logic here and assume it's always 1. It's minor and I suspect we could remove this optimization, but I'm trying to get the decode logic as close to ForDeltaUtil as possible. The benchmark results are looking at lot better now, but maybe still some regressions. I've seen a little variability in these results, so I'm not sure how often they might present false-regression results on individual tasks? Here's what I've got at this point: {code:java} TaskQPS baseline StdDevQPS pfordocids StdDev Pct diff p-value LowSpanNear 98.61 (2.2%) 95.57 (1.7%) -3.1% ( -6% -0%) 0.000 OrNotHighHigh 545.01 (3.8%) 531.11 (5.3%) -2.6% ( -11% -6%) 0.078 Wildcard 40.83 (4.1%) 40.05 (3.9%) -1.9% ( -9% -6%) 0.132 OrHighMed 102.39 (2.5%) 100.50 (2.5%) -1.8% ( -6% -3%) 0.021 AndHighHigh 50.93 (3.3%) 50.03 (3.1%) -1.8% ( -7% -4%) 0.079 TermDTSort 98.42 (11.6%) 96.72 (14.4%) -1.7% ( -24% - 27%) 0.676 AndHighMed 68.10 (2.9%) 66.94 (2.9%) -1.7% ( -7% -4%) 0.063 HighTerm 1169.43 (4.4%) 1151.70 (5.1%) -1.5% ( -10% -8%) 0.314 BrowseMonthSSDVFacets 12.50 (5.6%) 12.31 (7.7%) -1.5% ( -14% - 12%) 0.480 HighTermTitleBDVSort 157.42 (14.7%) 155.08 (15.6%) -1.5% ( -27% - 33%) 0.757 OrHighNotLow 545.83 (5.7%) 537.85 (7.0%) -1.5% ( -13% - 12%) 0.472 MedSpanNear 28.75 (2.4%) 28.34 (1.9%) -1.4% ( -5% -2%) 0.038 OrHighNotHigh 533.41 (4.6%) 526.33 (5.3%) -1.3% ( -10% -8%) 0.394 Fuzzy1 59.47 (6.0%) 58.72 (6.8%) -1.3% ( -13% - 12%) 0.533 HighSpanNear 21.27 (2.6%) 21.03 (2.2%) -1.1% ( -5% -3%) 0.153 HighTermMonthSort 128.50 (12.2%) 127.11 (11.0%) -1.1% ( -21% - 25%) 0.769 OrNotHighLow 640.89 (4.0%) 634.43 (3.5%) -1.0% ( -8% -6%) 0.395 OrHighHigh 21.11 (2.0%) 20.91 (1.8%) -1.0% ( -4% -2%) 0.113 MedPhrase 103.90 (3.0%) 103.05 (2.9%) -0.8% ( -6% -5%) 0.381 HighPhrase 172.59 (2.5%) 171.22 (2.5%) -0.8% ( -5% -4%) 0.320 OrHighNotMed 535.67 (4.8%) 531.54 (4.7%) -0.8% ( -9% -9%) 0.607 LowTerm 1094.41 (2.9%) 1087.97 (3.1%) -0.6% ( -6% -5%) 0.535 MedSloppyPhrase 12.91 (2.4%) 12.85 (2.5%) -0.5% ( -5% -4%) 0.542 IntNRQ 101.21 (0.5%) 100.81 (0.7%) -0.4% ( -1% -0%) 0.040 PKLookup 144.62 (3.0%) 144.11 (3.1%) -0.4% ( -6% -5%) 0.715 HighSloppyPhrase3.75 (2.9%)3.74 (3.0%) -0.3% ( -6% -5%) 0.726 HighIntervalsOrdered 16.00 (2.1%) 15.95 (1.8%) -0.3% ( -4% -3%) 0.597 HighTermDayOfYearSort 109.37 (11.0%) 109.03 (15.2%) -0.3% ( -23% - 29%) 0.941 LowSloppyPhrase 41.05 (1.9%) 40.93 (2.1%) -0.3% ( -4% -3%) 0.635 MedTerm 1137.13 (4.1%) 1134.84 (4.2%) -0.2% ( -8% -8%) 0.877 BrowseDayOfYearTaxoFacets4.24 (3.4%)4.23
[jira] [Commented] (LUCENE-9850) Explore PFOR for Doc ID delta encoding (instead of FOR)
[ https://issues.apache.org/jira/browse/LUCENE-9850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313877#comment-17313877 ] Greg Miller commented on LUCENE-9850: - I haven't really been following along with what's going on in JDK17, but being able to more explicitly generate vectorized instructions will be nice! I optimized my branch a bit further. At this point I think it has all the optimizations of ForDeltaUtil, and the only extra work is applying the exceptions. I even pulled in the optimization to apply the prefix to two values at a time (packed in a long). That code is over [here|https://github.com/apache/lucene/compare/main...gsmiller:pfordocid-opto3]. (It's not polished at all... a bit hacky) The benchmark results are looking at lot better now, but maybe still some regressions. I've seen a little variability in these results, so I'm not sure how often they might present false-regression results on individual tasks? Here's what I've got at this point: {code:java} TaskQPS baseline StdDevQPS pfordocids StdDev Pct diff p-value LowSpanNear 98.61 (2.2%) 95.57 (1.7%) -3.1% ( -6% -0%) 0.000 OrNotHighHigh 545.01 (3.8%) 531.11 (5.3%) -2.6% ( -11% -6%) 0.078 Wildcard 40.83 (4.1%) 40.05 (3.9%) -1.9% ( -9% -6%) 0.132 OrHighMed 102.39 (2.5%) 100.50 (2.5%) -1.8% ( -6% -3%) 0.021 AndHighHigh 50.93 (3.3%) 50.03 (3.1%) -1.8% ( -7% -4%) 0.079 TermDTSort 98.42 (11.6%) 96.72 (14.4%) -1.7% ( -24% - 27%) 0.676 AndHighMed 68.10 (2.9%) 66.94 (2.9%) -1.7% ( -7% -4%) 0.063 HighTerm 1169.43 (4.4%) 1151.70 (5.1%) -1.5% ( -10% -8%) 0.314 BrowseMonthSSDVFacets 12.50 (5.6%) 12.31 (7.7%) -1.5% ( -14% - 12%) 0.480 HighTermTitleBDVSort 157.42 (14.7%) 155.08 (15.6%) -1.5% ( -27% - 33%) 0.757 OrHighNotLow 545.83 (5.7%) 537.85 (7.0%) -1.5% ( -13% - 12%) 0.472 MedSpanNear 28.75 (2.4%) 28.34 (1.9%) -1.4% ( -5% -2%) 0.038 OrHighNotHigh 533.41 (4.6%) 526.33 (5.3%) -1.3% ( -10% -8%) 0.394 Fuzzy1 59.47 (6.0%) 58.72 (6.8%) -1.3% ( -13% - 12%) 0.533 HighSpanNear 21.27 (2.6%) 21.03 (2.2%) -1.1% ( -5% -3%) 0.153 HighTermMonthSort 128.50 (12.2%) 127.11 (11.0%) -1.1% ( -21% - 25%) 0.769 OrNotHighLow 640.89 (4.0%) 634.43 (3.5%) -1.0% ( -8% -6%) 0.395 OrHighHigh 21.11 (2.0%) 20.91 (1.8%) -1.0% ( -4% -2%) 0.113 MedPhrase 103.90 (3.0%) 103.05 (2.9%) -0.8% ( -6% -5%) 0.381 HighPhrase 172.59 (2.5%) 171.22 (2.5%) -0.8% ( -5% -4%) 0.320 OrHighNotMed 535.67 (4.8%) 531.54 (4.7%) -0.8% ( -9% -9%) 0.607 LowTerm 1094.41 (2.9%) 1087.97 (3.1%) -0.6% ( -6% -5%) 0.535 MedSloppyPhrase 12.91 (2.4%) 12.85 (2.5%) -0.5% ( -5% -4%) 0.542 IntNRQ 101.21 (0.5%) 100.81 (0.7%) -0.4% ( -1% -0%) 0.040 PKLookup 144.62 (3.0%) 144.11 (3.1%) -0.4% ( -6% -5%) 0.715 HighSloppyPhrase3.75 (2.9%)3.74 (3.0%) -0.3% ( -6% -5%) 0.726 HighIntervalsOrdered 16.00 (2.1%) 15.95 (1.8%) -0.3% ( -4% -3%) 0.597 HighTermDayOfYearSort 109.37 (11.0%) 109.03 (15.2%) -0.3% ( -23% - 29%) 0.941 LowSloppyPhrase 41.05 (1.9%) 40.93 (2.1%) -0.3% ( -4% -3%) 0.635 MedTerm 1137.13 (4.1%) 1134.84 (4.2%) -0.2% ( -8% -8%) 0.877 BrowseDayOfYearTaxoFacets4.24 (3.4%)4.23 (3.2%) -0.2% ( -6% -6%) 0.885 Prefix3 263.31 (9.1%) 263.08 (9.2%) -0.1% ( -16% - 20%) 0.976 BrowseDateTaxoFacets4.23 (3.4%)4.23 (3.2%) -0.1% ( -6% -6%) 0.941 Fuzzy2 44.14 (13.1%) 44.17 (14.2%) 0.1% ( -24% - 31%) 0.990 BrowseMonthTaxoFacets5.00 (2.6%)5.01 (2.2%) 0.1% ( -4% -4%) 0.878 OrNotHighMed 508.22 (3.6%) 508.82 (3.9%) 0.1% ( -7% -7%) 0.921 Respell 38.75 (2.3%) 38.82 (2.2%)
[jira] [Assigned] (LUCENE-9901) UnicodeData.java has no regeneration task
[ https://issues.apache.org/jira/browse/LUCENE-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss reassigned LUCENE-9901: --- Assignee: Dawid Weiss > UnicodeData.java has no regeneration task > - > > Key: LUCENE-9901 > URL: https://issues.apache.org/jira/browse/LUCENE-9901 > Project: Lucene - Core > Issue Type: Sub-task > Components: modules/analysis >Reporter: Uwe Schindler >Assignee: Dawid Weiss >Priority: Major > > When moving build system to gradle, we lost the following groovy script, > which is used to regenerate the UnicodeData.java file to be in line with the > actually used ICU4J version. The groovy script is still in the repository, > but it's no longer used by the build system: > https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/tools/groovy/generate-unicode-data.groovy > To execute it, we need to convert it to a Gradle task that depends on the > same version of ICU4J that we use as analysis/icu dependency Not sure how to > do this, maybe it's easy using palantir). > The file should also be hashed and put into the regenerated file hases: > https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java > Old Ant task is here: > https://github.com/apache/lucene-solr/blob/branch_8x/lucene/analysis/common/build.xml#L91-L94 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9901) UnicodeData.java has no regeneration task
[ https://issues.apache.org/jira/browse/LUCENE-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313871#comment-17313871 ] Dawid Weiss commented on LUCENE-9901: - Thanks Uwe. I wonder if it's the only missing bit - there may be more! > UnicodeData.java has no regeneration task > - > > Key: LUCENE-9901 > URL: https://issues.apache.org/jira/browse/LUCENE-9901 > Project: Lucene - Core > Issue Type: Sub-task > Components: modules/analysis >Reporter: Uwe Schindler >Assignee: Dawid Weiss >Priority: Major > > When moving build system to gradle, we lost the following groovy script, > which is used to regenerate the UnicodeData.java file to be in line with the > actually used ICU4J version. The groovy script is still in the repository, > but it's no longer used by the build system: > https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/tools/groovy/generate-unicode-data.groovy > To execute it, we need to convert it to a Gradle task that depends on the > same version of ICU4J that we use as analysis/icu dependency Not sure how to > do this, maybe it's easy using palantir). > The file should also be hashed and put into the regenerated file hases: > https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java > Old Ant task is here: > https://github.com/apache/lucene-solr/blob/branch_8x/lucene/analysis/common/build.xml#L91-L94 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9901) UnicodeData.java has no regeneration task
[ https://issues.apache.org/jira/browse/LUCENE-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler updated LUCENE-9901: -- Description: When moving build system to gradle, we lost the following groovy script, which is used to regenerate the UnicodeData.java file to be in line with the actually used ICU4J version. The groovy script is still in the repository, but it's no longer used by the build system: https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/tools/groovy/generate-unicode-data.groovy To execute it, we need to convert it to a Gradle task that depends on the same version of ICU4J that we use as analysis/icu dependency Not sure how to do this, maybe it's easy using palantir). The file should also be hashed and put into the regenerated file hases: https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java Old Ant task is here: https://github.com/apache/lucene-solr/blob/branch_8x/lucene/analysis/common/build.xml#L91-L94 was: When proting to gradle, we lost the following groovy script, which is used to regenerate the UnicodeData.java file to be in line with the actually used ICU4J version. The groovy script is still in the repository, but it's no longer used by the build system: https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/tools/groovy/generate-unicode-data.groovy To execute it, we need to convert it to a Gradle task that depends on the same version of ICU4J that we use as analysis/icu dependency. The file should also be hashed and put into the regenerated file hases: https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java Old Ant task is here: https://github.com/apache/lucene-solr/blob/branch_8x/lucene/analysis/common/build.xml#L91-L94 > UnicodeData.java has no regeneration task > - > > Key: LUCENE-9901 > URL: https://issues.apache.org/jira/browse/LUCENE-9901 > Project: Lucene - Core > Issue Type: Sub-task > Components: modules/analysis >Reporter: Uwe Schindler >Priority: Major > > When moving build system to gradle, we lost the following groovy script, > which is used to regenerate the UnicodeData.java file to be in line with the > actually used ICU4J version. The groovy script is still in the repository, > but it's no longer used by the build system: > https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/tools/groovy/generate-unicode-data.groovy > To execute it, we need to convert it to a Gradle task that depends on the > same version of ICU4J that we use as analysis/icu dependency Not sure how to > do this, maybe it's easy using palantir). > The file should also be hashed and put into the regenerated file hases: > https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java > Old Ant task is here: > https://github.com/apache/lucene-solr/blob/branch_8x/lucene/analysis/common/build.xml#L91-L94 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9901) UnicodeData.java has no regeneration task
Uwe Schindler created LUCENE-9901: - Summary: UnicodeData.java has no regeneration task Key: LUCENE-9901 URL: https://issues.apache.org/jira/browse/LUCENE-9901 Project: Lucene - Core Issue Type: Sub-task Components: modules/analysis Reporter: Uwe Schindler When proting to gradle, we lost the following groovy script, which is used to regenerate the UnicodeData.java file to be in line with the actually used ICU4J version. The groovy script is still in the repository, but it's no longer used by the build system: https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/tools/groovy/generate-unicode-data.groovy To execute it, we need to convert it to a Gradle task that depends on the same version of ICU4J that we use as analysis/icu dependency. The file should also be hashed and put into the regenerated file hases: https://github.com/apache/lucene/blob/d5d6dc079395c47cd6d12dcce3bcfdd2c7d9dc63/lucene/analysis/common/src/java/org/apache/lucene/analysis/util/UnicodeProps.java Old Ant task is here: https://github.com/apache/lucene-solr/blob/branch_8x/lucene/analysis/common/build.xml#L91-L94 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9623) Add module descriptor (module-info.java) to lucene jars
[ https://issues.apache.org/jira/browse/LUCENE-9623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313763#comment-17313763 ] Tomoko Uchida commented on LUCENE-9623: --- bq. What packages are exported and what are kept private - and how do we make the decision or get consensus about it (upon such a large codebase)? Still I haven't come up with good idea for that, but - we might be able to get some statistics/numbers by searching OSS repositories that uses Lucene public APIs. For example, GitHub provides APIs to allow programmatically search repositories that are hosted on it. Of course OSS code may be only a small fraction of the whole though, concrete real world use cases could give us some clues to decide what classes/packages should be kept public? Just a random thought. > Add module descriptor (module-info.java) to lucene jars > --- > > Key: LUCENE-9623 > URL: https://issues.apache.org/jira/browse/LUCENE-9623 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build >Affects Versions: main (9.0) >Reporter: Tomoko Uchida >Priority: Major > Attachments: generate-all-module-info.sh > > > For a starter, module descriptors can be automatically generated by jdeps > utility. > There are two choices. > 1. generate "open" modules which allows reflective accesses with > --generate-open-module option > 2. generate non-open modules with --generate-module-info option > Which is the better - not fully sure, but maybe 2 (non-open modules)? > Also, we need to choose proper module names - just using gradle project path > for it is OK? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9900) Regenerate/ run ICU only if inputs changed
[ https://issues.apache.org/jira/browse/LUCENE-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss resolved LUCENE-9900. - Fix Version/s: main (9.0) Resolution: Fixed > Regenerate/ run ICU only if inputs changed > -- > > Key: LUCENE-9900 > URL: https://issues.apache.org/jira/browse/LUCENE-9900 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: main (9.0) > > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9900) Regenerate/ run ICU only if inputs changed
[ https://issues.apache.org/jira/browse/LUCENE-9900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313756#comment-17313756 ] ASF subversion and git services commented on LUCENE-9900: - Commit 010e3a1ba960ea0dcd7e5092073d2a55a21b0ac9 in lucene's branch refs/heads/main from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=010e3a1 ] LUCENE-9900: Regenerate/ run ICU only if inputs changed (#61) > Regenerate/ run ICU only if inputs changed > -- > > Key: LUCENE-9900 > URL: https://issues.apache.org/jira/browse/LUCENE-9900 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss merged pull request #61: LUCENE-9900: Regenerate/ run ICU only if inputs changed
dweiss merged pull request #61: URL: https://github.com/apache/lucene/pull/61 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] noblepaul merged pull request #2479: SOLR-15288: Hardening NODEDOWN event event in PRS mode
noblepaul merged pull request #2479: URL: https://github.com/apache/lucene-solr/pull/2479 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a change in pull request #61: LUCENE-9900: Regenerate/ run ICU only if inputs changed
dweiss commented on a change in pull request #61: URL: https://github.com/apache/lucene/pull/61#discussion_r606134465 ## File path: gradle/generation/icu.gradle ## @@ -49,6 +49,13 @@ configure(project(":lucene:analysis:icu")) { // May be undefined yet, so use a provider. dependsOn { sourceSets.tools.runtimeClasspath } +// gennorm generates file order-dependent output, so make it constant here. Review comment: @rmuir gennorm seems to be order-sensitive - I changed the order to sorted file names - that's why it's regenerated as part of this patch. Is this order somehow important? I can revert to what it was before but this way it's simpler. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss opened a new pull request #61: LUCENE-9900: Regenerate/ run ICU only if inputs changed
dweiss opened a new pull request #61: URL: https://github.com/apache/lucene/pull/61 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9900) Regenerate/ run ICU only if inputs changed
Dawid Weiss created LUCENE-9900: --- Summary: Regenerate/ run ICU only if inputs changed Key: LUCENE-9900 URL: https://issues.apache.org/jira/browse/LUCENE-9900 Project: Lucene - Core Issue Type: Sub-task Reporter: Dawid Weiss Assignee: Dawid Weiss -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9872) Make the most painful tasks in regenerate fully incremental
[ https://issues.apache.org/jira/browse/LUCENE-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss resolved LUCENE-9872. - Fix Version/s: main (9.0) Resolution: Fixed > Make the most painful tasks in regenerate fully incremental > --- > > Key: LUCENE-9872 > URL: https://issues.apache.org/jira/browse/LUCENE-9872 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: main (9.0) >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: main (9.0) > > Time Spent: 20m > Remaining Estimate: 0h > > This is particularly important for that one jflex task that is currently > mood-killer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9872) Make the most painful tasks in regenerate fully incremental
[ https://issues.apache.org/jira/browse/LUCENE-9872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313690#comment-17313690 ] ASF subversion and git services commented on LUCENE-9872: - Commit e3ae57a3c1ac59a1b56a36d965000f45abbc43e5 in lucene's branch refs/heads/main from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e3ae57a ] LUCENE-9872: Make the most painful tasks in regenerate fully incremental (#60) > Make the most painful tasks in regenerate fully incremental > --- > > Key: LUCENE-9872 > URL: https://issues.apache.org/jira/browse/LUCENE-9872 > Project: Lucene - Core > Issue Type: Sub-task >Affects Versions: main (9.0) >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > This is particularly important for that one jflex task that is currently > mood-killer. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss merged pull request #60: LUCENE-9872: Make the most painful tasks in regenerate fully incremental
dweiss merged pull request #60: URL: https://github.com/apache/lucene/pull/60 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org