[jira] [Comment Edited] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method
[ https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852876#comment-16852876 ] Luca Cavanna edited comment on LUCENE-8796 at 5/31/19 10:08 AM: I updated the PR and addressed all the comments, here are the latest benchmark results (with bitset optimization disabled on both ends): {noformat} Report after iter 19: TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff MedTerm 1510.74 (6.8%) 1457.20 (8.4%) -3.5% ( -17% - 12%) Fuzzy1 70.49 (8.5%) 68.11 (9.8%) -3.4% ( -19% - 16%) OrHighNotMed 650.57 (5.8%) 629.81 (6.0%) -3.2% ( -14% -9%) OrHighLow 447.13 (4.2%) 433.05 (4.5%) -3.2% ( -11% -5%) OrNotHighMed 623.22 (6.3%) 605.19 (6.1%) -2.9% ( -14% - 10%) OrHighNotLow 720.89 (7.0%) 701.26 (7.9%) -2.7% ( -16% - 13%) OrNotHighHigh 558.43 (6.3%) 544.82 (4.9%) -2.4% ( -12% -9%) LowTerm 1279.34 (4.9%) 1248.60 (5.2%) -2.4% ( -11% -8%) AndHighLow 690.75 (4.0%) 675.22 (5.3%) -2.2% ( -11% -7%) LowPhrase 358.90 (2.3%) 351.28 (4.0%) -2.1% ( -8% -4%) PKLookup 139.97 (3.0%) 137.32 (3.5%) -1.9% ( -8% -4%) OrNotHighLow 728.48 (6.8%) 714.79 (6.5%) -1.9% ( -14% - 12%) HighTerm 1222.38 (6.3%) 1199.77 (7.1%) -1.8% ( -14% - 12%) AndHighHigh 58.93 (6.2%) 58.01 (5.8%) -1.6% ( -12% - 11%) Prefix3 152.21 (4.5%) 150.00 (5.0%) -1.5% ( -10% -8%) IntNRQConjMedTerm 79.15 (10.7%) 78.06 (10.5%) -1.4% ( -20% - 22%) HighTermDayOfYearSort 95.28 (5.1%) 94.10 (7.8%) -1.2% ( -13% - 12%) Wildcard 64.23 (2.3%) 63.45 (2.3%) -1.2% ( -5% -3%) MedSpanNear 81.15 (2.2%) 80.19 (2.8%) -1.2% ( -6% -3%) HighSpanNear 10.20 (3.9%) 10.08 (4.2%) -1.2% ( -8% -7%) HighIntervalsOrdered4.07 (1.8%)4.03 (2.2%) -1.1% ( -4% -2%) LowSpanNear 41.62 (3.1%) 41.20 (3.6%) -1.0% ( -7% -5%) IntNRQConjLowTerm 20.36 (4.1%) 20.15 (4.5%) -1.0% ( -9% -7%) IntNRQConjHighTerm 64.84 (9.6%) 64.21 (9.4%) -1.0% ( -18% - 19%) AndHighMed 229.08 (2.8%) 227.00 (2.5%) -0.9% ( -6% -4%) MedPhrase 18.73 (1.5%) 18.57 (2.3%) -0.8% ( -4% -2%) LowSloppyPhrase 124.52 (2.3%) 123.48 (2.6%) -0.8% ( -5% -4%) Respell 69.26 (3.0%) 68.68 (2.9%) -0.8% ( -6% -5%) HighPhrase 12.98 (1.6%) 12.88 (2.2%) -0.7% ( -4% -3%) PrefixConjLowTerm 42.11 (2.6%) 41.81 (3.0%) -0.7% ( -6% -5%) OrHighNotHigh 680.34 (6.1%) 676.16 (7.6%) -0.6% ( -13% - 13%) MedSloppyPhrase 34.06 (4.9%) 33.89 (4.5%) -0.5% ( -9% -9%) IntNRQ 89.97 (12.4%) 89.62 (12.0%) -0.4% ( -22% - 27%) HighSloppyPhrase8.28 (4.0%)8.25 (3.9%) -0.3% ( -7% -7%) WildcardConjLowTerm 36.35 (2.7%) 36.26 (2.7%) -0.3% ( -5% -5%) OrHighHigh 27.89 (2.6%) 27.85 (3.1%) -0.1% ( -5% -5%) Fuzzy2 44.19 (3.8%) 44.17 (3.1%) -0.1% ( -6% -7%) OrHighMed 90.42 (2.8%) 90.57 (2.8%) 0.2% ( -5% -6%) PrefixConjMedTerm 45.56 (2.8%) 45.79 (2.9%) 0.5% ( -5% -6%) WildcardConjHighTerm 33.08 (2.6%) 33.47 (3.0%) 1.2% ( -4% -6%) PrefixConjHighTerm 83.65 (2.6%) 86.23 (3.7%) 3.1% ( -3% -9%) HighTermMonthSort 130.35 (15.8%) 135.08 (12.1%) 3.6% ( -20% - 37%) WildcardConjMedTerm 99.19 (3.6%) 103.37 (4.1%) 4.2% ( -3% - 12%) {noformat} was (Author: lucacavanna): I updated the PR and addressed all the comments, here are the latest benchmark results: {noformat} Report after iter 19: TaskQPS baseline StdDevQPS
[jira] [Commented] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method
[ https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852876#comment-16852876 ] Luca Cavanna commented on LUCENE-8796: -- I updated the PR and addressed all the comments, here are the latest benchmark results: {noformat} Report after iter 19: TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff MedTerm 1510.74 (6.8%) 1457.20 (8.4%) -3.5% ( -17% - 12%) Fuzzy1 70.49 (8.5%) 68.11 (9.8%) -3.4% ( -19% - 16%) OrHighNotMed 650.57 (5.8%) 629.81 (6.0%) -3.2% ( -14% -9%) OrHighLow 447.13 (4.2%) 433.05 (4.5%) -3.2% ( -11% -5%) OrNotHighMed 623.22 (6.3%) 605.19 (6.1%) -2.9% ( -14% - 10%) OrHighNotLow 720.89 (7.0%) 701.26 (7.9%) -2.7% ( -16% - 13%) OrNotHighHigh 558.43 (6.3%) 544.82 (4.9%) -2.4% ( -12% -9%) LowTerm 1279.34 (4.9%) 1248.60 (5.2%) -2.4% ( -11% -8%) AndHighLow 690.75 (4.0%) 675.22 (5.3%) -2.2% ( -11% -7%) LowPhrase 358.90 (2.3%) 351.28 (4.0%) -2.1% ( -8% -4%) PKLookup 139.97 (3.0%) 137.32 (3.5%) -1.9% ( -8% -4%) OrNotHighLow 728.48 (6.8%) 714.79 (6.5%) -1.9% ( -14% - 12%) HighTerm 1222.38 (6.3%) 1199.77 (7.1%) -1.8% ( -14% - 12%) AndHighHigh 58.93 (6.2%) 58.01 (5.8%) -1.6% ( -12% - 11%) Prefix3 152.21 (4.5%) 150.00 (5.0%) -1.5% ( -10% -8%) IntNRQConjMedTerm 79.15 (10.7%) 78.06 (10.5%) -1.4% ( -20% - 22%) HighTermDayOfYearSort 95.28 (5.1%) 94.10 (7.8%) -1.2% ( -13% - 12%) Wildcard 64.23 (2.3%) 63.45 (2.3%) -1.2% ( -5% -3%) MedSpanNear 81.15 (2.2%) 80.19 (2.8%) -1.2% ( -6% -3%) HighSpanNear 10.20 (3.9%) 10.08 (4.2%) -1.2% ( -8% -7%) HighIntervalsOrdered4.07 (1.8%)4.03 (2.2%) -1.1% ( -4% -2%) LowSpanNear 41.62 (3.1%) 41.20 (3.6%) -1.0% ( -7% -5%) IntNRQConjLowTerm 20.36 (4.1%) 20.15 (4.5%) -1.0% ( -9% -7%) IntNRQConjHighTerm 64.84 (9.6%) 64.21 (9.4%) -1.0% ( -18% - 19%) AndHighMed 229.08 (2.8%) 227.00 (2.5%) -0.9% ( -6% -4%) MedPhrase 18.73 (1.5%) 18.57 (2.3%) -0.8% ( -4% -2%) LowSloppyPhrase 124.52 (2.3%) 123.48 (2.6%) -0.8% ( -5% -4%) Respell 69.26 (3.0%) 68.68 (2.9%) -0.8% ( -6% -5%) HighPhrase 12.98 (1.6%) 12.88 (2.2%) -0.7% ( -4% -3%) PrefixConjLowTerm 42.11 (2.6%) 41.81 (3.0%) -0.7% ( -6% -5%) OrHighNotHigh 680.34 (6.1%) 676.16 (7.6%) -0.6% ( -13% - 13%) MedSloppyPhrase 34.06 (4.9%) 33.89 (4.5%) -0.5% ( -9% -9%) IntNRQ 89.97 (12.4%) 89.62 (12.0%) -0.4% ( -22% - 27%) HighSloppyPhrase8.28 (4.0%)8.25 (3.9%) -0.3% ( -7% -7%) WildcardConjLowTerm 36.35 (2.7%) 36.26 (2.7%) -0.3% ( -5% -5%) OrHighHigh 27.89 (2.6%) 27.85 (3.1%) -0.1% ( -5% -5%) Fuzzy2 44.19 (3.8%) 44.17 (3.1%) -0.1% ( -6% -7%) OrHighMed 90.42 (2.8%) 90.57 (2.8%) 0.2% ( -5% -6%) PrefixConjMedTerm 45.56 (2.8%) 45.79 (2.9%) 0.5% ( -5% -6%) WildcardConjHighTerm 33.08 (2.6%) 33.47 (3.0%) 1.2% ( -4% -6%) PrefixConjHighTerm 83.65 (2.6%) 86.23 (3.7%) 3.1% ( -3% -9%) HighTermMonthSort 130.35 (15.8%) 135.08 (12.1%) 3.6% ( -20% - 37%) WildcardConjMedTerm 99.19 (3.6%) 103.37 (4.1%) 4.2% ( -3% - 12%) {noformat} > Use exponential search in IntArrayDocIdSet advance method > - > > Key: LUCENE-8796 > URL: https://issues.apache.org/jira/browse/LUCENE-8796 > Project: Lucene - Core > Issue Type: Improvement
[jira] [Comment Edited] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method
[ https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836467#comment-16836467 ] Luca Cavanna edited comment on LUCENE-8796 at 5/9/19 3:21 PM: -- I have updated the PR after applying Yonik's suggestion and re-run benchmarks a few times. The run with the least noise had these results (note that I disabled the bitset optimization on both sides): {noformat} Report after iter 19: TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff HighTerm 1575.07 (5.9%) 1541.27 (6.9%) -2.1% ( -14% - 11%) MedTerm 1363.22 (6.5%) 1337.03 (7.0%) -1.9% ( -14% - 12%) LowTerm 1441.86 (4.2%) 1420.77 (5.2%) -1.5% ( -10% -8%) IntNRQConjMedTerm 280.55 (4.0%) 277.64 (4.1%) -1.0% ( -8% -7%) MedPhrase 153.84 (3.5%) 152.44 (3.3%) -0.9% ( -7% -6%) Prefix3 224.92 (4.0%) 223.13 (3.7%) -0.8% ( -8% -7%) HighSloppyPhrase 19.70 (3.7%) 19.56 (4.5%) -0.7% ( -8% -7%) MedSloppyPhrase 18.23 (4.3%) 18.11 (4.7%) -0.7% ( -9% -8%) OrNotHighMed 586.33 (3.4%) 582.47 (4.9%) -0.7% ( -8% -7%) LowSloppyPhrase 18.56 (3.6%) 18.46 (3.9%) -0.5% ( -7% -7%) HighPhrase 22.64 (2.7%) 22.54 (3.0%) -0.4% ( -6% -5%) LowPhrase 144.10 (3.8%) 143.55 (3.3%) -0.4% ( -7% -6%) AndHighLow 539.26 (3.7%) 537.25 (3.2%) -0.4% ( -7% -6%) PKLookup 132.96 (3.0%) 132.48 (4.6%) -0.4% ( -7% -7%) OrHighMed 115.79 (2.7%) 115.49 (3.5%) -0.3% ( -6% -6%) PrefixConjHighTerm 36.98 (2.8%) 36.93 (3.4%) -0.1% ( -6% -6%) WildcardConjHighTerm 45.79 (3.0%) 45.73 (3.1%) -0.1% ( -6% -6%) OrHighLow 448.91 (3.7%) 448.70 (6.3%) -0.0% ( -9% - 10%) Wildcard 78.89 (3.2%) 78.95 (3.6%) 0.1% ( -6% -7%) IntNRQConjHighTerm 78.35 (2.3%) 78.48 (2.4%) 0.2% ( -4% -4%) IntNRQ 100.56 (2.7%) 100.84 (2.8%) 0.3% ( -5% -5%) OrHighNotLow 732.45 (2.8%) 734.56 (5.3%) 0.3% ( -7% -8%) OrHighNotHigh 544.87 (2.8%) 546.47 (4.6%) 0.3% ( -6% -7%) IntNRQConjLowTerm 249.20 (4.2%) 249.99 (3.8%) 0.3% ( -7% -8%) Respell 73.05 (3.1%) 73.28 (3.4%) 0.3% ( -6% -7%) OrHighHigh 35.56 (3.0%) 35.68 (4.2%) 0.3% ( -6% -7%) OrNotHighLow 695.41 (4.8%) 697.88 (6.5%) 0.4% ( -10% - 12%) MedSpanNear 59.99 (3.8%) 60.30 (4.0%) 0.5% ( -7% -8%) AndHighMed 190.02 (3.1%) 191.04 (3.6%) 0.5% ( -5% -7%) LowSpanNear 12.73 (3.9%) 12.81 (4.2%) 0.6% ( -7% -8%) HighTermDayOfYearSort 88.42 (7.0%) 89.09 (7.1%) 0.8% ( -12% - 15%) PrefixConjLowTerm 54.95 (3.7%) 55.43 (3.8%) 0.9% ( -6% -8%) OrHighNotMed 628.44 (3.4%) 634.02 (6.1%) 0.9% ( -8% - 10%) HighSpanNear 28.86 (3.2%) 29.11 (3.5%) 0.9% ( -5% -7%) WildcardConjMedTerm 72.48 (3.4%) 73.19 (4.8%) 1.0% ( -7% -9%) Fuzzy2 49.17 (9.9%) 49.68 (11.7%) 1.0% ( -18% - 25%) AndHighHigh 63.44 (3.8%) 64.11 (3.8%) 1.1% ( -6% -9%) Fuzzy1 79.43 (9.9%) 80.55 (9.7%) 1.4% ( -16% - 23%) OrNotHighHigh 574.89 (3.6%) 584.43 (5.5%) 1.7% ( -7% - 11%) PrefixConjMedTerm 79.00 (3.2%) 80.50 (3.6%) 1.9% ( -4% -8%) WildcardConjLowTerm 90.67 (2.9%) 92.49 (3.7%) 2.0% ( -4% -8%) HighTermMonthSort 86.13 (11.8%) 88.79 (12.4%) 3.1% ( -18% - 30%) {noformat} I also ran benchmarks with the bitset optimization in place on both ends: {{{noformat}}} Report after iter 19: TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff IntNRQ 63.46 (24.6%) 62.28 (24.2%) -1.9% ( -40% - 62%)
[jira] [Comment Edited] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method
[ https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835542#comment-16835542 ] Luca Cavanna edited comment on LUCENE-8796 at 5/9/19 3:22 PM: -- I have made the change and played with luceneutil to run some benchmark. I opened a PR here: [https://github.com/apache/lucene-solr/pull/667] . Luceneutil does not currently benchmark the queries that should be affected by this change, hence I added benchmarks for numeric range queries, prefix queries and wildcard queries in conjunction with term queries (low, medium and high frequency). See the changes I made to my luceneutil fork: [https://github.com/mikemccand/luceneutil/compare/master...javanna:conjunctions] . Also, for the benchmarks I temporarily modified DocIdSetBuilder#grow to never call upgradeToBitSet (on both baseline and modified version), so that the updated code is exercised as much as possible during the benchmarks run, otherwise in many cases we would use bitsets instead and the changed code would not be exercised at all. I ran the wikimedium10m benchmarks a few times, here is probably the run with the least noise, results show a little improvement for some queries, and no regressions in general: {noformat} Report after iter 19: TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff WildcardConjMedTerm 75.49 (2.2%) 72.79 (2.0%) -3.6% ( -7% - 0%) OrHighNotMed 607.01 (5.7%) 593.10 (4.4%) -2.3% ( -11% - 8%) WildcardConjHighTerm 64.00 (1.7%) 62.55 (1.4%) -2.3% ( -5% - 0%) Fuzzy2 20.14 (3.4%) 19.72 (4.6%) -2.1% ( -9% - 6%) HighTerm 1174.41 (4.7%) 1150.11 (4.2%) -2.1% ( -10% - 7%) OrHighLow 483.40 (5.1%) 473.69 (6.9%) -2.0% ( -13% - 10%) OrNotHighLow 526.75 (3.6%) 516.47 (3.6%) -2.0% ( -8% - 5%) OrNotHighHigh 600.38 (4.9%) 590.21 (3.7%) -1.7% ( -9% - 7%) HighTermMonthSort 110.05 (11.7%) 108.58 (11.5%) -1.3% ( -21% - 24%) OrHighMed 107.83 (2.6%) 106.48 (4.7%) -1.3% ( -8% - 6%) PrefixConjMedTerm 56.98 (2.5%) 56.33 (1.7%) -1.1% ( -5% - 3%) AndHighLow 432.27 (3.6%) 427.46 (3.2%) -1.1% ( -7% - 5%) PrefixConjLowTerm 44.43 (2.8%) 43.98 (1.8%) -1.0% ( -5% - 3%) MedTerm 1409.97 (5.5%) 1396.33 (4.9%) -1.0% ( -10% - 9%) HighSloppyPhrase 11.98 (4.3%) 11.87 (5.1%) -0.9% ( -9% - 8%) OrNotHighMed 614.19 (4.6%) 608.74 (3.8%) -0.9% ( -8% - 7%) Respell 58.11 (2.4%) 57.61 (2.4%) -0.9% ( -5% - 3%) LowTerm 1342.33 (4.8%) 1330.86 (4.0%) -0.9% ( -9% - 8%) PrefixConjHighTerm 68.50 (2.9%) 67.93 (1.8%) -0.8% ( -5% - 3%) OrHighNotHigh 566.30 (5.2%) 561.88 (4.5%) -0.8% ( -9% - 9%) WildcardConjLowTerm 32.75 (2.5%) 32.56 (2.1%) -0.6% ( -5% - 4%) PKLookup 131.80 (2.4%) 131.28 (2.3%) -0.4% ( -5% - 4%) OrHighHigh 29.90 (3.4%) 29.79 (5.3%) -0.4% ( -8% - 8%) OrHighNotLow 497.65 (6.6%) 495.84 (5.2%) -0.4% ( -11% - 12%) AndHighMed 175.08 (3.5%) 174.58 (3.0%) -0.3% ( -6% - 6%) LowSpanNear 15.17 (1.8%) 15.13 (2.5%) -0.2% ( -4% - 4%) Fuzzy1 71.14 (5.9%) 70.97 (6.3%) -0.2% ( -11% - 12%) LowSloppyPhrase 35.23 (2.0%) 35.16 (2.6%) -0.2% ( -4% - 4%) LowPhrase 74.10 (1.7%) 73.98 (1.8%) -0.2% ( -3% - 3%) HighPhrase 34.18 (2.1%) 34.13 (2.0%) -0.1% ( -4% - 3%) Prefix3 45.33 (2.3%) 45.28 (2.1%) -0.1% ( -4% - 4%) MedPhrase 28.30 (2.1%) 28.27 (1.7%) -0.1% ( -3% - 3%) MedSloppyPhrase 6.80 (3.6%) 6.80 (3.2%) -0.0% ( -6% - 6%) AndHighHigh 53.79 (3.9%) 53.79 (4.0%) -0.0% ( -7% - 8%) MedSpanNear 61.78 (2.2%) 61.83 (1.7%) 0.1% ( -3% - 4%) Wildcard 37.83 (2.5%) 37.91 (1.7%) 0.2% ( -3% - 4%) IntNRQConjHighTerm 20.17 (3.8%) 20.24 (4.9%) 0.3% ( -8% - 9%) HighTermDayOfYearSort 53.55 (7.8%) 53.76 (7.3%) 0.4% ( -13% - 16%) HighSpanNear 5.39 (2.6%) 5.42 (2.6%) 0.5% ( -4% - 5%) IntNRQConjLowTerm 19.69 (4.3%) 19.86 (4.3%) 0.9% ( -7% - 9%) IntNRQConjMedTerm 15.93 (4.5%) 16.12 (5.4%) 1.2% ( -8% - 11%) IntNRQ 114.28 (10.3%) 116.41 (14.0%) 1.9% ( -20% - 29%) {noformat} was (Author: lucacavanna): I have made the change and played with luceneutil to run some benchmark. I opened a PR here: [https://github.com/apache/lucene-solr/pull/667] . Luceneutil does not currently benchmark the queries that should be affected by this change, hence I added benchmarks for numeric range queries, prefix queries and wildcard queries in conjunction with term queries (low, medium and high frequency). See the changes I made to my luceneutil fork: [https://github.com/mikemccand/luceneutil/compare/master...javanna:conjunctions] . Also, for the benchmarks I temporarily modified DocIdSetBuilder#grow to never call upgradeToBitSet (on both baseline and modified version), so that the updated code is exercised as much as possible during the benchmarks run, otherwise in many cases we would use bitsets instead and the changed code would not be exercised at all. I ran the wikimedium10m benchmarks a few times, here is probably the run with the least noise, results show a little improvement for some queries, and no regressions in
[jira] [Comment Edited] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method
[ https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836467#comment-16836467 ] Luca Cavanna edited comment on LUCENE-8796 at 5/9/19 3:22 PM: -- I have updated the PR after applying Yonik's suggestion and re-run benchmarks a few times. The run with the least noise had these results (note that I disabled the bitset optimization on both sides): {noformat} Report after iter 19: TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff HighTerm 1575.07 (5.9%) 1541.27 (6.9%) -2.1% ( -14% - 11%) MedTerm 1363.22 (6.5%) 1337.03 (7.0%) -1.9% ( -14% - 12%) LowTerm 1441.86 (4.2%) 1420.77 (5.2%) -1.5% ( -10% -8%) IntNRQConjMedTerm 280.55 (4.0%) 277.64 (4.1%) -1.0% ( -8% -7%) MedPhrase 153.84 (3.5%) 152.44 (3.3%) -0.9% ( -7% -6%) Prefix3 224.92 (4.0%) 223.13 (3.7%) -0.8% ( -8% -7%) HighSloppyPhrase 19.70 (3.7%) 19.56 (4.5%) -0.7% ( -8% -7%) MedSloppyPhrase 18.23 (4.3%) 18.11 (4.7%) -0.7% ( -9% -8%) OrNotHighMed 586.33 (3.4%) 582.47 (4.9%) -0.7% ( -8% -7%) LowSloppyPhrase 18.56 (3.6%) 18.46 (3.9%) -0.5% ( -7% -7%) HighPhrase 22.64 (2.7%) 22.54 (3.0%) -0.4% ( -6% -5%) LowPhrase 144.10 (3.8%) 143.55 (3.3%) -0.4% ( -7% -6%) AndHighLow 539.26 (3.7%) 537.25 (3.2%) -0.4% ( -7% -6%) PKLookup 132.96 (3.0%) 132.48 (4.6%) -0.4% ( -7% -7%) OrHighMed 115.79 (2.7%) 115.49 (3.5%) -0.3% ( -6% -6%) PrefixConjHighTerm 36.98 (2.8%) 36.93 (3.4%) -0.1% ( -6% -6%) WildcardConjHighTerm 45.79 (3.0%) 45.73 (3.1%) -0.1% ( -6% -6%) OrHighLow 448.91 (3.7%) 448.70 (6.3%) -0.0% ( -9% - 10%) Wildcard 78.89 (3.2%) 78.95 (3.6%) 0.1% ( -6% -7%) IntNRQConjHighTerm 78.35 (2.3%) 78.48 (2.4%) 0.2% ( -4% -4%) IntNRQ 100.56 (2.7%) 100.84 (2.8%) 0.3% ( -5% -5%) OrHighNotLow 732.45 (2.8%) 734.56 (5.3%) 0.3% ( -7% -8%) OrHighNotHigh 544.87 (2.8%) 546.47 (4.6%) 0.3% ( -6% -7%) IntNRQConjLowTerm 249.20 (4.2%) 249.99 (3.8%) 0.3% ( -7% -8%) Respell 73.05 (3.1%) 73.28 (3.4%) 0.3% ( -6% -7%) OrHighHigh 35.56 (3.0%) 35.68 (4.2%) 0.3% ( -6% -7%) OrNotHighLow 695.41 (4.8%) 697.88 (6.5%) 0.4% ( -10% - 12%) MedSpanNear 59.99 (3.8%) 60.30 (4.0%) 0.5% ( -7% -8%) AndHighMed 190.02 (3.1%) 191.04 (3.6%) 0.5% ( -5% -7%) LowSpanNear 12.73 (3.9%) 12.81 (4.2%) 0.6% ( -7% -8%) HighTermDayOfYearSort 88.42 (7.0%) 89.09 (7.1%) 0.8% ( -12% - 15%) PrefixConjLowTerm 54.95 (3.7%) 55.43 (3.8%) 0.9% ( -6% -8%) OrHighNotMed 628.44 (3.4%) 634.02 (6.1%) 0.9% ( -8% - 10%) HighSpanNear 28.86 (3.2%) 29.11 (3.5%) 0.9% ( -5% -7%) WildcardConjMedTerm 72.48 (3.4%) 73.19 (4.8%) 1.0% ( -7% -9%) Fuzzy2 49.17 (9.9%) 49.68 (11.7%) 1.0% ( -18% - 25%) AndHighHigh 63.44 (3.8%) 64.11 (3.8%) 1.1% ( -6% -9%) Fuzzy1 79.43 (9.9%) 80.55 (9.7%) 1.4% ( -16% - 23%) OrNotHighHigh 574.89 (3.6%) 584.43 (5.5%) 1.7% ( -7% - 11%) PrefixConjMedTerm 79.00 (3.2%) 80.50 (3.6%) 1.9% ( -4% -8%) WildcardConjLowTerm 90.67 (2.9%) 92.49 (3.7%) 2.0% ( -4% -8%) HighTermMonthSort 86.13 (11.8%) 88.79 (12.4%) 3.1% ( -18% - 30%) {noformat} I also ran benchmarks with the bitset optimization in place on both ends: {noformat} Report after iter 19: TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff IntNRQ
[jira] [Comment Edited] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method
[ https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835542#comment-16835542 ] Luca Cavanna edited comment on LUCENE-8796 at 5/9/19 3:21 PM: -- I have made the change and played with luceneutil to run some benchmark. I opened a PR here: [https://github.com/apache/lucene-solr/pull/667] . Luceneutil does not currently benchmark the queries that should be affected by this change, hence I added benchmarks for numeric range queries, prefix queries and wildcard queries in conjunction with term queries (low, medium and high frequency). See the changes I made to my luceneutil fork: [https://github.com/mikemccand/luceneutil/compare/master...javanna:conjunctions] . Also, for the benchmarks I temporarily modified DocIdSetBuilder#grow to never call upgradeToBitSet (on both baseline and modified version), so that the updated code is exercised as much as possible during the benchmarks run, otherwise in many cases we would use bitsets instead and the changed code would not be exercised at all. I ran the wikimedium10m benchmarks a few times, here is probably the run with the least noise, results show a little improvement for some queries, and no regressions in general: {{{noformat}}} Report after iter 19: TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff WildcardConjMedTerm 75.49 (2.2%) 72.79 (2.0%) -3.6% ( -7% - 0%) OrHighNotMed 607.01 (5.7%) 593.10 (4.4%) -2.3% ( -11% - 8%) WildcardConjHighTerm 64.00 (1.7%) 62.55 (1.4%) -2.3% ( -5% - 0%) Fuzzy2 20.14 (3.4%) 19.72 (4.6%) -2.1% ( -9% - 6%) HighTerm 1174.41 (4.7%) 1150.11 (4.2%) -2.1% ( -10% - 7%) OrHighLow 483.40 (5.1%) 473.69 (6.9%) -2.0% ( -13% - 10%) OrNotHighLow 526.75 (3.6%) 516.47 (3.6%) -2.0% ( -8% - 5%) OrNotHighHigh 600.38 (4.9%) 590.21 (3.7%) -1.7% ( -9% - 7%) HighTermMonthSort 110.05 (11.7%) 108.58 (11.5%) -1.3% ( -21% - 24%) OrHighMed 107.83 (2.6%) 106.48 (4.7%) -1.3% ( -8% - 6%) PrefixConjMedTerm 56.98 (2.5%) 56.33 (1.7%) -1.1% ( -5% - 3%) AndHighLow 432.27 (3.6%) 427.46 (3.2%) -1.1% ( -7% - 5%) PrefixConjLowTerm 44.43 (2.8%) 43.98 (1.8%) -1.0% ( -5% - 3%) MedTerm 1409.97 (5.5%) 1396.33 (4.9%) -1.0% ( -10% - 9%) HighSloppyPhrase 11.98 (4.3%) 11.87 (5.1%) -0.9% ( -9% - 8%) OrNotHighMed 614.19 (4.6%) 608.74 (3.8%) -0.9% ( -8% - 7%) Respell 58.11 (2.4%) 57.61 (2.4%) -0.9% ( -5% - 3%) LowTerm 1342.33 (4.8%) 1330.86 (4.0%) -0.9% ( -9% - 8%) PrefixConjHighTerm 68.50 (2.9%) 67.93 (1.8%) -0.8% ( -5% - 3%) OrHighNotHigh 566.30 (5.2%) 561.88 (4.5%) -0.8% ( -9% - 9%) WildcardConjLowTerm 32.75 (2.5%) 32.56 (2.1%) -0.6% ( -5% - 4%) PKLookup 131.80 (2.4%) 131.28 (2.3%) -0.4% ( -5% - 4%) OrHighHigh 29.90 (3.4%) 29.79 (5.3%) -0.4% ( -8% - 8%) OrHighNotLow 497.65 (6.6%) 495.84 (5.2%) -0.4% ( -11% - 12%) AndHighMed 175.08 (3.5%) 174.58 (3.0%) -0.3% ( -6% - 6%) LowSpanNear 15.17 (1.8%) 15.13 (2.5%) -0.2% ( -4% - 4%) Fuzzy1 71.14 (5.9%) 70.97 (6.3%) -0.2% ( -11% - 12%) LowSloppyPhrase 35.23 (2.0%) 35.16 (2.6%) -0.2% ( -4% - 4%) LowPhrase 74.10 (1.7%) 73.98 (1.8%) -0.2% ( -3% - 3%) HighPhrase 34.18 (2.1%) 34.13 (2.0%) -0.1% ( -4% - 3%) Prefix3 45.33 (2.3%) 45.28 (2.1%) -0.1% ( -4% - 4%) MedPhrase 28.30 (2.1%) 28.27 (1.7%) -0.1% ( -3% - 3%) MedSloppyPhrase 6.80 (3.6%) 6.80 (3.2%) -0.0% ( -6% - 6%) AndHighHigh 53.79 (3.9%) 53.79 (4.0%) -0.0% ( -7% - 8%) MedSpanNear 61.78 (2.2%) 61.83 (1.7%) 0.1% ( -3% - 4%) Wildcard 37.83 (2.5%) 37.91 (1.7%) 0.2% ( -3% - 4%) IntNRQConjHighTerm 20.17 (3.8%) 20.24 (4.9%) 0.3% ( -8% - 9%) HighTermDayOfYearSort 53.55 (7.8%) 53.76 (7.3%) 0.4% ( -13% - 16%) HighSpanNear 5.39 (2.6%) 5.42 (2.6%) 0.5% ( -4% - 5%) IntNRQConjLowTerm 19.69 (4.3%) 19.86 (4.3%) 0.9% ( -7% - 9%) IntNRQConjMedTerm 15.93 (4.5%) 16.12 (5.4%) 1.2% ( -8% - 11%) IntNRQ 114.28 (10.3%) 116.41 (14.0%) 1.9% ( -20% - 29%) {{{noformat}}} was (Author: lucacavanna): I have made the change and played with luceneutil to run some benchmark. I opened a PR here: https://github.com/apache/lucene-solr/pull/667 . Luceneutil does not currently benchmark the queries that should be affected by this change, hence I added benchmarks for numeric range queries, prefix queries and wildcard queries in conjunction with term queries (low, medium and high frequency). See the changes I made to my luceneutil fork: [https://github.com/mikemccand/luceneutil/compare/master...javanna:conjunctions] . Also, for the benchmarks I temporarily modified DocIdSetBuilder#grow to never call upgradeToBitSet (on both baseline and modified version), so that the updated code is exercised as much as possible during the benchmarks run, otherwise in many cases we would use bitsets instead and the changed code would not be exercised at all. I ran the wikimedium10m benchmarks a few times, here is probably the run with the least noise, results show a little improvement for some queries, and no
[jira] [Commented] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method
[ https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836467#comment-16836467 ] Luca Cavanna commented on LUCENE-8796: -- I have updated the PR after applying Yonik's suggestion and re-run benchmarks a few times. The run with the least noise had these results (note that I disabled the bitset optimization on both sides): {{ Report after iter 19: TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff HighTerm 1575.07 (5.9%) 1541.27 (6.9%) -2.1% ( -14% - 11%) MedTerm 1363.22 (6.5%) 1337.03 (7.0%) -1.9% ( -14% - 12%) LowTerm 1441.86 (4.2%) 1420.77 (5.2%) -1.5% ( -10% -8%) IntNRQConjMedTerm 280.55 (4.0%) 277.64 (4.1%) -1.0% ( -8% -7%) MedPhrase 153.84 (3.5%) 152.44 (3.3%) -0.9% ( -7% -6%) Prefix3 224.92 (4.0%) 223.13 (3.7%) -0.8% ( -8% -7%) HighSloppyPhrase 19.70 (3.7%) 19.56 (4.5%) -0.7% ( -8% -7%) MedSloppyPhrase 18.23 (4.3%) 18.11 (4.7%) -0.7% ( -9% -8%) OrNotHighMed 586.33 (3.4%) 582.47 (4.9%) -0.7% ( -8% -7%) LowSloppyPhrase 18.56 (3.6%) 18.46 (3.9%) -0.5% ( -7% -7%) HighPhrase 22.64 (2.7%) 22.54 (3.0%) -0.4% ( -6% -5%) LowPhrase 144.10 (3.8%) 143.55 (3.3%) -0.4% ( -7% -6%) AndHighLow 539.26 (3.7%) 537.25 (3.2%) -0.4% ( -7% -6%) PKLookup 132.96 (3.0%) 132.48 (4.6%) -0.4% ( -7% -7%) OrHighMed 115.79 (2.7%) 115.49 (3.5%) -0.3% ( -6% -6%) PrefixConjHighTerm 36.98 (2.8%) 36.93 (3.4%) -0.1% ( -6% -6%) WildcardConjHighTerm 45.79 (3.0%) 45.73 (3.1%) -0.1% ( -6% -6%) OrHighLow 448.91 (3.7%) 448.70 (6.3%) -0.0% ( -9% - 10%) Wildcard 78.89 (3.2%) 78.95 (3.6%) 0.1% ( -6% -7%) IntNRQConjHighTerm 78.35 (2.3%) 78.48 (2.4%) 0.2% ( -4% -4%) IntNRQ 100.56 (2.7%) 100.84 (2.8%) 0.3% ( -5% -5%) OrHighNotLow 732.45 (2.8%) 734.56 (5.3%) 0.3% ( -7% -8%) OrHighNotHigh 544.87 (2.8%) 546.47 (4.6%) 0.3% ( -6% -7%) IntNRQConjLowTerm 249.20 (4.2%) 249.99 (3.8%) 0.3% ( -7% -8%) Respell 73.05 (3.1%) 73.28 (3.4%) 0.3% ( -6% -7%) OrHighHigh 35.56 (3.0%) 35.68 (4.2%) 0.3% ( -6% -7%) OrNotHighLow 695.41 (4.8%) 697.88 (6.5%) 0.4% ( -10% - 12%) MedSpanNear 59.99 (3.8%) 60.30 (4.0%) 0.5% ( -7% -8%) AndHighMed 190.02 (3.1%) 191.04 (3.6%) 0.5% ( -5% -7%) LowSpanNear 12.73 (3.9%) 12.81 (4.2%) 0.6% ( -7% -8%) HighTermDayOfYearSort 88.42 (7.0%) 89.09 (7.1%) 0.8% ( -12% - 15%) PrefixConjLowTerm 54.95 (3.7%) 55.43 (3.8%) 0.9% ( -6% -8%) OrHighNotMed 628.44 (3.4%) 634.02 (6.1%) 0.9% ( -8% - 10%) HighSpanNear 28.86 (3.2%) 29.11 (3.5%) 0.9% ( -5% -7%) WildcardConjMedTerm 72.48 (3.4%) 73.19 (4.8%) 1.0% ( -7% -9%) Fuzzy2 49.17 (9.9%) 49.68 (11.7%) 1.0% ( -18% - 25%) AndHighHigh 63.44 (3.8%) 64.11 (3.8%) 1.1% ( -6% -9%) Fuzzy1 79.43 (9.9%) 80.55 (9.7%) 1.4% ( -16% - 23%) OrNotHighHigh 574.89 (3.6%) 584.43 (5.5%) 1.7% ( -7% - 11%) PrefixConjMedTerm 79.00 (3.2%) 80.50 (3.6%) 1.9% ( -4% -8%) WildcardConjLowTerm 90.67 (2.9%) 92.49 (3.7%) 2.0% ( -4% -8%) HighTermMonthSort 86.13 (11.8%) 88.79 (12.4%) 3.1% ( -18% - 30%) }} I also ran benchmarks with the bitset optimization in place on both ends: {{ Report after iter 19: TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff IntNRQ 63.46 (24.6%) 62.28 (24.2%) -1.9% ( -40% -
[jira] [Commented] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method
[ https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835628#comment-16835628 ] Luca Cavanna commented on LUCENE-8796: -- You are right [~ysee...@gmail.com] I will make that change and re-run benchmarks. > Use exponential search in IntArrayDocIdSet advance method > - > > Key: LUCENE-8796 > URL: https://issues.apache.org/jira/browse/LUCENE-8796 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Luca Cavanna >Priority: Minor > > Chatting with [~jpountz] , he suggested to improve IntArrayDocIdSet by making > its advance method use exponential search instead of binary search. This > should help performance of queries including conjunctions: given that > ConjunctionDISI uses leap frog, it advances through doc ids in small steps, > hence exponential search should be faster when advancing on average compared > to binary search. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method
[ https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835542#comment-16835542 ] Luca Cavanna commented on LUCENE-8796: -- I have made the change and played with luceneutil to run some benchmark. I opened a PR here: https://github.com/apache/lucene-solr/pull/667 . Luceneutil does not currently benchmark the queries that should be affected by this change, hence I added benchmarks for numeric range queries, prefix queries and wildcard queries in conjunction with term queries (low, medium and high frequency). See the changes I made to my luceneutil fork: [https://github.com/mikemccand/luceneutil/compare/master...javanna:conjunctions] . Also, for the benchmarks I temporarily modified DocIdSetBuilder#grow to never call upgradeToBitSet (on both baseline and modified version), so that the updated code is exercised as much as possible during the benchmarks run, otherwise in many cases we would use bitsets instead and the changed code would not be exercised at all. I ran the wikimedium10m benchmarks a few times, here is probably the run with the least noise, results show a little improvement for some queries, and no regressions in general: Report after iter 19: TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff WildcardConjMedTerm 75.49 (2.2%) 72.79 (2.0%) -3.6% ( -7% - 0%) OrHighNotMed 607.01 (5.7%) 593.10 (4.4%) -2.3% ( -11% - 8%) WildcardConjHighTerm 64.00 (1.7%) 62.55 (1.4%) -2.3% ( -5% - 0%) Fuzzy2 20.14 (3.4%) 19.72 (4.6%) -2.1% ( -9% - 6%) HighTerm 1174.41 (4.7%) 1150.11 (4.2%) -2.1% ( -10% - 7%) OrHighLow 483.40 (5.1%) 473.69 (6.9%) -2.0% ( -13% - 10%) OrNotHighLow 526.75 (3.6%) 516.47 (3.6%) -2.0% ( -8% - 5%) OrNotHighHigh 600.38 (4.9%) 590.21 (3.7%) -1.7% ( -9% - 7%) HighTermMonthSort 110.05 (11.7%) 108.58 (11.5%) -1.3% ( -21% - 24%) OrHighMed 107.83 (2.6%) 106.48 (4.7%) -1.3% ( -8% - 6%) PrefixConjMedTerm 56.98 (2.5%) 56.33 (1.7%) -1.1% ( -5% - 3%) AndHighLow 432.27 (3.6%) 427.46 (3.2%) -1.1% ( -7% - 5%) PrefixConjLowTerm 44.43 (2.8%) 43.98 (1.8%) -1.0% ( -5% - 3%) MedTerm 1409.97 (5.5%) 1396.33 (4.9%) -1.0% ( -10% - 9%) HighSloppyPhrase 11.98 (4.3%) 11.87 (5.1%) -0.9% ( -9% - 8%) OrNotHighMed 614.19 (4.6%) 608.74 (3.8%) -0.9% ( -8% - 7%) Respell 58.11 (2.4%) 57.61 (2.4%) -0.9% ( -5% - 3%) LowTerm 1342.33 (4.8%) 1330.86 (4.0%) -0.9% ( -9% - 8%) PrefixConjHighTerm 68.50 (2.9%) 67.93 (1.8%) -0.8% ( -5% - 3%) OrHighNotHigh 566.30 (5.2%) 561.88 (4.5%) -0.8% ( -9% - 9%) WildcardConjLowTerm 32.75 (2.5%) 32.56 (2.1%) -0.6% ( -5% - 4%) PKLookup 131.80 (2.4%) 131.28 (2.3%) -0.4% ( -5% - 4%) OrHighHigh 29.90 (3.4%) 29.79 (5.3%) -0.4% ( -8% - 8%) OrHighNotLow 497.65 (6.6%) 495.84 (5.2%) -0.4% ( -11% - 12%) AndHighMed 175.08 (3.5%) 174.58 (3.0%) -0.3% ( -6% - 6%) LowSpanNear 15.17 (1.8%) 15.13 (2.5%) -0.2% ( -4% - 4%) Fuzzy1 71.14 (5.9%) 70.97 (6.3%) -0.2% ( -11% - 12%) LowSloppyPhrase 35.23 (2.0%) 35.16 (2.6%) -0.2% ( -4% - 4%) LowPhrase 74.10 (1.7%) 73.98 (1.8%) -0.2% ( -3% - 3%) HighPhrase 34.18 (2.1%) 34.13 (2.0%) -0.1% ( -4% - 3%) Prefix3 45.33 (2.3%) 45.28 (2.1%) -0.1% ( -4% - 4%) MedPhrase 28.30 (2.1%) 28.27 (1.7%) -0.1% ( -3% - 3%) MedSloppyPhrase 6.80 (3.6%) 6.80 (3.2%) -0.0% ( -6% - 6%) AndHighHigh 53.79 (3.9%) 53.79 (4.0%) -0.0% ( -7% - 8%) MedSpanNear 61.78 (2.2%) 61.83 (1.7%) 0.1% ( -3% - 4%) Wildcard 37.83 (2.5%) 37.91 (1.7%) 0.2% ( -3% - 4%) IntNRQConjHighTerm 20.17 (3.8%) 20.24 (4.9%) 0.3% ( -8% - 9%) HighTermDayOfYearSort 53.55 (7.8%) 53.76 (7.3%) 0.4% ( -13% - 16%) HighSpanNear 5.39 (2.6%) 5.42 (2.6%) 0.5% ( -4% - 5%) IntNRQConjLowTerm 19.69 (4.3%) 19.86 (4.3%) 0.9% ( -7% - 9%) IntNRQConjMedTerm 15.93 (4.5%) 16.12 (5.4%) 1.2% ( -8% - 11%) IntNRQ 114.28 (10.3%) 116.41 (14.0%) 1.9% ( -20% - 29%) > Use exponential search in IntArrayDocIdSet advance method > - > > Key: LUCENE-8796 > URL: https://issues.apache.org/jira/browse/LUCENE-8796 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Luca Cavanna >Priority: Minor > > Chatting with [~jpountz] , he suggested to improve IntArrayDocIdSet by making > its advance method use exponential search instead of binary search. This > should help performance of queries including conjunctions: given that > ConjunctionDISI uses leap frog, it advances through doc ids in small steps, > hence exponential search should be faster when advancing on average compared > to binary search. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method
Luca Cavanna created LUCENE-8796: Summary: Use exponential search in IntArrayDocIdSet advance method Key: LUCENE-8796 URL: https://issues.apache.org/jira/browse/LUCENE-8796 Project: Lucene - Core Issue Type: Improvement Reporter: Luca Cavanna Chatting with [~jpountz] , he suggested to improve IntArrayDocIdSet by making its advance method use exponential search instead of binary search. This should help performance of queries including conjunctions: given that ConjunctionDISI uses leap frog, it advances through doc ids in small steps, hence exponential search should be faster when advancing on average compared to binary search. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5718) More flexible compound queries (containing mtq) support in postings highlighter
[ https://issues.apache.org/jira/browse/LUCENE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna resolved LUCENE-5718. -- Resolution: Duplicate This will be addressed with LUCENE-3401, which is being worked on. > More flexible compound queries (containing mtq) support in postings > highlighter > --- > > Key: LUCENE-5718 > URL: https://issues.apache.org/jira/browse/LUCENE-5718 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Affects Versions: 4.8.1 >Reporter: Luca Cavanna >Priority: Major > Attachments: LUCENE-5718.patch > > > The postings highlighter currently pulls the automata from multi term queries > and doesn't require calling rewrite to make highlighting work. In order to do > so it also needs to check whether the query is a compound one and eventually > extract its subqueries. This is currently done in the MultiTermHighlighting > class and works well but has two potential problems: > 1) not all the possible compound queries are necessarily supported as we need > to go over each of them one by one (see LUCENE-5717) and this requires > keeping the "switch" up-to-date if new queries gets added to lucene > 2) it doesn't support custom compound queries but only the set of queries > available out-of-the-box > I've been thinking about how this can be improved and one of the ideas I came > up with is to introduce a generic way to retrieve the subqueries from > compound queries, like for instance have a new abstract base class with a > getLeaves or getSubQueries method and have all the compound queries extend > it. What this method would do is return a flat array of all the leaf queries > that the compound query is made of. > Not sure whether this would be needed in other places in lucene, but it > doesn't seem like a small change and it would definitely affect (or benefit?) > more than just the postings highlighter support for multi term queries. > In particular the second problem (custom queries) seems hard to solve without > a way to expose this info directly from the query though, unless we want to > make the MultiTermHighlighting#extractAutomata method extensible in some way. > Would like to hear what people think and work on this as soon as we > identified which direction we want to take. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8664) Add equals/hashcode to TotalHits
[ https://issues.apache.org/jira/browse/LUCENE-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755016#comment-16755016 ] Luca Cavanna commented on LUCENE-8664: -- I am not using TotalHits in a map. I would benefit from the equals method for comparisons in tests. For instance in Elasticsearch we return the lucene TotalHits to users as part of bigger objects that have their own equals method. We end up wrapping TotalHits into another internal class that has its own equals/hashcode (among others). Having equals/hashcode built-in into lucene would remove the need for a wrapper class, as well as making equality comparisons a one-liner, especially when comparing multiple instances of objects holding TotalHits. This is a minor thing obviously, but I did not think it would be a bug to consider two different TotalHits instances that have same value and relation equal? I was chatting to [~jim.ferenczi] about this and we thought we should propose adding this to Lucene. Happy to close this if you think it should not be done. > Add equals/hashcode to TotalHits > > > Key: LUCENE-8664 > URL: https://issues.apache.org/jira/browse/LUCENE-8664 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Luca Cavanna >Priority: Minor > > I think it would be convenient to add equals/hashcode methods to the > TotalHits class. I opened a PR here: > [https://github.com/apache/lucene-solr/pull/552] . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8664) Add equals/hashcode to TotalHits
Luca Cavanna created LUCENE-8664: Summary: Add equals/hashcode to TotalHits Key: LUCENE-8664 URL: https://issues.apache.org/jira/browse/LUCENE-8664 Project: Lucene - Core Issue Type: Improvement Reporter: Luca Cavanna I think it would be convenient to add equals/hashcode methods to the TotalHits class. I opened a PR here: [https://github.com/apache/lucene-solr/pull/552] . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8591) LegacyBM25Similarity doesn't expose getDiscountOverlaps
Luca Cavanna created LUCENE-8591: Summary: LegacyBM25Similarity doesn't expose getDiscountOverlaps Key: LUCENE-8591 URL: https://issues.apache.org/jira/browse/LUCENE-8591 Project: Lucene - Core Issue Type: Improvement Reporter: Luca Cavanna Assignee: Luca Cavanna When I worked on LUCENE-8563 I intended to expose all the needed public methods that BM25Similarity exposes, but I forgot to add getDiscountOverlaps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8591) LegacyBM25Similarity doesn't expose getDiscountOverlaps
[ https://issues.apache.org/jira/browse/LUCENE-8591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710228#comment-16710228 ] Luca Cavanna commented on LUCENE-8591: -- I just opened [https://github.com/apache/lucene-solr/pull/514] > LegacyBM25Similarity doesn't expose getDiscountOverlaps > --- > > Key: LUCENE-8591 > URL: https://issues.apache.org/jira/browse/LUCENE-8591 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Luca Cavanna >Assignee: Luca Cavanna >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > When I worked on LUCENE-8563 I intended to expose all the needed public > methods that BM25Similarity exposes, but I forgot to add getDiscountOverlaps. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703382#comment-16703382 ] Luca Cavanna commented on LUCENE-8563: -- I updated the PR according to the latest comments, and deprecated the newly introduced similarity like Robert suggested. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 40m > Remaining Estimate: 0h > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702294#comment-16702294 ] Luca Cavanna commented on LUCENE-8563: -- I opened [https://github.com/apache/lucene-solr/pull/511] . > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity
[ https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686370#comment-16686370 ] Luca Cavanna commented on LUCENE-8563: -- Hi folks, I would like to work on this issue. > Remove k1+1 from the numerator of BM25Similarity > - > > Key: LUCENE-8563 > URL: https://issues.apache.org/jira/browse/LUCENE-8563 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > Our current implementation of BM25 does > {code:java} > boost * IDF * (k1+1) * tf / (tf + norm) > {code} > As (k1+1) is a constant, it is the same for every term and doesn't modify > ordering. It is often omitted and I found out that the "The Probabilistic > Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and > Zaragova even describes adding (k1+1) to the numerator as a variant whose > benefit is to be more comparable with Robertson/Sparck-Jones weighting, which > we don't care about. > {quote}A common variant is to add a (k1 + 1) component to the > numerator of the saturation function. This is the same for all > terms, and therefore does not affect the ranking produced. > The reason for including it was to make the final formula > more compatible with the RSJ weight used on its own > {quote} > Should we remove it from BM25Similarity as well? > A side-effect that I'm interested in is that integrating other score > contributions (eg. via oal.document.FeatureField) would be a bit easier to > reason about. For instance a weight of 3 in FeatureField#newSaturationQuery > would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) > rather than a term whose IDF is 3/(k1 + 1). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-6485) Add a custom separator break iterator
Luca Cavanna created LUCENE-6485: Summary: Add a custom separator break iterator Key: LUCENE-6485 URL: https://issues.apache.org/jira/browse/LUCENE-6485 Project: Lucene - Core Issue Type: Improvement Reporter: Luca Cavanna Lucene currently includes a WholeBreakIterator used to highlight entire fields using the postings highlighter, without breaking their content into sentences. I would like to contribute a CustomSeparatorBreakIterator that breaks when a custom char separator is found in the text. This can be used for instance when wanting to highlight entire fields, value per value. One can subclass PostingsHighlighter and have getMultiValueSeparator return a control character, like U+ , then use the custom break iterator to break on U+ so that one snippet per value will be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6485) Add a custom separator break iterator
[ https://issues.apache.org/jira/browse/LUCENE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated LUCENE-6485: - Attachment: LUCENE-6485.patch Patch attached Add a custom separator break iterator - Key: LUCENE-6485 URL: https://issues.apache.org/jira/browse/LUCENE-6485 Project: Lucene - Core Issue Type: Improvement Reporter: Luca Cavanna Attachments: LUCENE-6485.patch Lucene currently includes a WholeBreakIterator used to highlight entire fields using the postings highlighter, without breaking their content into sentences. I would like to contribute a CustomSeparatorBreakIterator that breaks when a custom char separator is found in the text. This can be used for instance when wanting to highlight entire fields, value per value. One can subclass PostingsHighlighter and have getMultiValueSeparator return a control character, like U+ , then use the custom break iterator to break on U+ so that one snippet per value will be generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5718) More flexible compound queries (containing mtq) support in postings highlighter
[ https://issues.apache.org/jira/browse/LUCENE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546095#comment-14546095 ] Luca Cavanna commented on LUCENE-5718: -- Was wondering if there is interest around this patch, or maybe there are better solutions by now for the custom compound queries usecase? I can revive the patch if needed, lemme know what you think. More flexible compound queries (containing mtq) support in postings highlighter --- Key: LUCENE-5718 URL: https://issues.apache.org/jira/browse/LUCENE-5718 Project: Lucene - Core Issue Type: Improvement Components: modules/highlighter Affects Versions: 4.8.1 Reporter: Luca Cavanna Attachments: LUCENE-5718.patch The postings highlighter currently pulls the automata from multi term queries and doesn't require calling rewrite to make highlighting work. In order to do so it also needs to check whether the query is a compound one and eventually extract its subqueries. This is currently done in the MultiTermHighlighting class and works well but has two potential problems: 1) not all the possible compound queries are necessarily supported as we need to go over each of them one by one (see LUCENE-5717) and this requires keeping the switch up-to-date if new queries gets added to lucene 2) it doesn't support custom compound queries but only the set of queries available out-of-the-box I've been thinking about how this can be improved and one of the ideas I came up with is to introduce a generic way to retrieve the subqueries from compound queries, like for instance have a new abstract base class with a getLeaves or getSubQueries method and have all the compound queries extend it. What this method would do is return a flat array of all the leaf queries that the compound query is made of. Not sure whether this would be needed in other places in lucene, but it doesn't seem like a small change and it would definitely affect (or benefit?) more than just the postings highlighter support for multi term queries. In particular the second problem (custom queries) seems hard to solve without a way to expose this info directly from the query though, unless we want to make the MultiTermHighlighting#extractAutomata method extensible in some way. Would like to hear what people think and work on this as soon as we identified which direction we want to take. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5718) More flexible compound queries (containing mtq) support in postings highlighter
[ https://issues.apache.org/jira/browse/LUCENE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated LUCENE-5718: - Attachment: LUCENE-5718.patch I like your extractQueries idea. I gave it a shot, patch attached. The main difference compared to extractTerms is that it adds the query itself to the list by default instead of throwing UnsupportedOperationException. Also, I think this one doesn't necessarily require calling rewrite (not totally sure though). I overrode the extractQueries method for all the queries that contain one or more sub-queries, let's see if that's too many of if I missed any...you tell me ;) More flexible compound queries (containing mtq) support in postings highlighter --- Key: LUCENE-5718 URL: https://issues.apache.org/jira/browse/LUCENE-5718 Project: Lucene - Core Issue Type: Improvement Components: modules/highlighter Affects Versions: 4.8.1 Reporter: Luca Cavanna Attachments: LUCENE-5718.patch The postings highlighter currently pulls the automata from multi term queries and doesn't require calling rewrite to make highlighting work. In order to do so it also needs to check whether the query is a compound one and eventually extract its subqueries. This is currently done in the MultiTermHighlighting class and works well but has two potential problems: 1) not all the possible compound queries are necessarily supported as we need to go over each of them one by one (see LUCENE-5717) and this requires keeping the switch up-to-date if new queries gets added to lucene 2) it doesn't support custom compound queries but only the set of queries available out-of-the-box I've been thinking about how this can be improved and one of the ideas I came up with is to introduce a generic way to retrieve the subqueries from compound queries, like for instance have a new abstract base class with a getLeaves or getSubQueries method and have all the compound queries extend it. What this method would do is return a flat array of all the leaf queries that the compound query is made of. Not sure whether this would be needed in other places in lucene, but it doesn't seem like a small change and it would definitely affect (or benefit?) more than just the postings highlighter support for multi term queries. In particular the second problem (custom queries) seems hard to solve without a way to expose this info directly from the query though, unless we want to make the MultiTermHighlighting#extractAutomata method extensible in some way. Would like to hear what people think and work on this as soon as we identified which direction we want to take. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5718) More flexible compound queries (containing mtq) support in postings highlighter
[ https://issues.apache.org/jira/browse/LUCENE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014994#comment-14014994 ] Luca Cavanna commented on LUCENE-5718: -- Also worth mentioning that my patch addresses only the compound queries usecase. It leaves the automaton related work for the different multi term queries as it is (in MultiTermHighlighting). More flexible compound queries (containing mtq) support in postings highlighter --- Key: LUCENE-5718 URL: https://issues.apache.org/jira/browse/LUCENE-5718 Project: Lucene - Core Issue Type: Improvement Components: modules/highlighter Affects Versions: 4.8.1 Reporter: Luca Cavanna Attachments: LUCENE-5718.patch The postings highlighter currently pulls the automata from multi term queries and doesn't require calling rewrite to make highlighting work. In order to do so it also needs to check whether the query is a compound one and eventually extract its subqueries. This is currently done in the MultiTermHighlighting class and works well but has two potential problems: 1) not all the possible compound queries are necessarily supported as we need to go over each of them one by one (see LUCENE-5717) and this requires keeping the switch up-to-date if new queries gets added to lucene 2) it doesn't support custom compound queries but only the set of queries available out-of-the-box I've been thinking about how this can be improved and one of the ideas I came up with is to introduce a generic way to retrieve the subqueries from compound queries, like for instance have a new abstract base class with a getLeaves or getSubQueries method and have all the compound queries extend it. What this method would do is return a flat array of all the leaf queries that the compound query is made of. Not sure whether this would be needed in other places in lucene, but it doesn't seem like a small change and it would definitely affect (or benefit?) more than just the postings highlighter support for multi term queries. In particular the second problem (custom queries) seems hard to solve without a way to expose this info directly from the query though, unless we want to make the MultiTermHighlighting#extractAutomata method extensible in some way. Would like to hear what people think and work on this as soon as we identified which direction we want to take. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5717) Postings highlighter support for multi term queries within filtered and constant score queries
Luca Cavanna created LUCENE-5717: Summary: Postings highlighter support for multi term queries within filtered and constant score queries Key: LUCENE-5717 URL: https://issues.apache.org/jira/browse/LUCENE-5717 Project: Lucene - Core Issue Type: Improvement Components: modules/highlighter Affects Versions: 4.8.1 Reporter: Luca Cavanna The automata extraction that is done to make multi term queries work with the postings highlighter does support boolean queries but it should also support other compound queries like filtered and constant score. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5717) Postings highlighter support for multi term queries within filtered and constant score queries
[ https://issues.apache.org/jira/browse/LUCENE-5717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated LUCENE-5717: - Attachment: LUCENE-5717.patch First patch attached. At this time there's no generic way to retrieve sub-queries from compound queries, thus I could only add two more ifs to the existing extractAutomata method. Maybe it's worth discussing if there's a way to make this more generic in a separate issue. Also not sure if there are others compound queries that I missed. Postings highlighter support for multi term queries within filtered and constant score queries -- Key: LUCENE-5717 URL: https://issues.apache.org/jira/browse/LUCENE-5717 Project: Lucene - Core Issue Type: Improvement Components: modules/highlighter Affects Versions: 4.8.1 Reporter: Luca Cavanna Attachments: LUCENE-5717.patch The automata extraction that is done to make multi term queries work with the postings highlighter does support boolean queries but it should also support other compound queries like filtered and constant score. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5718) More flexible compound queries (containing mtq) support in postings highlighter
Luca Cavanna created LUCENE-5718: Summary: More flexible compound queries (containing mtq) support in postings highlighter Key: LUCENE-5718 URL: https://issues.apache.org/jira/browse/LUCENE-5718 Project: Lucene - Core Issue Type: Improvement Components: modules/highlighter Affects Versions: 4.8.1 Reporter: Luca Cavanna The postings highlighter currently pulls the automata from multi term queries and doesn't require calling rewrite to make highlighting work. In order to do so it also needs to check whether the query is a compound one and eventually extract its subqueries. This is currently done in the MultiTermHighlighting class and works well but has two potential problems: 1) not all the possible compound queries are necessarily supported as we need to go over each of them one by one (see LUCENE-5717) and this requires keeping the switch up-to-date if new queries gets added to lucene 2) it doesn't support custom compound queries but only the set of queries available out-of-the-box I've been thinking about how this can be improved and one of the ideas I came up with is to introduce a generic way to retrieve the subqueries from compound queries, like for instance have a new abstract base class with a getLeaves or getSubQueries method and have all the compound queries extend it. What this method would do is return a flat array of all the leaf queries that the compound query is made of. Not sure whether this would be needed in other places in lucene, but it doesn't seem like a small change and it would definitely affect (or benefit?) more than just the postings highlighter support for multi term queries. In particular the second problem (custom queries) seems hard to solve without a way to expose this info directly from the query though, unless we want to make the MultiTermHighlighting#extractAutomata method extensible in some way. Would like to hear what people think and work on this as soon as we identified which direction we want to take. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects
[ https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765400#comment-13765400 ] Luca Cavanna commented on LUCENE-4906: -- How about committing this? Would be great to have it with the next release! PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects -- Key: LUCENE-4906 URL: https://issues.apache.org/jira/browse/LUCENE-4906 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-4906.patch, LUCENE-4906.patch For example, in a server, I may want to render the highlight result to JsonObject to send back to the front-end. Today since we render to string, I have to render to JSON string and then re-parse to JsonObject, which is inefficient... Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls terms from snippets instead, so you get proximity-influenced salient/expanded terms, then perhaps that renders to just an array of tokens or fragments or something from each snippet. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects
[ https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765726#comment-13765726 ] Luca Cavanna commented on LUCENE-4906: -- Thanks Mike! PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects -- Key: LUCENE-4906 URL: https://issues.apache.org/jira/browse/LUCENE-4906 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 5.0, 4.6 Attachments: LUCENE-4906.patch, LUCENE-4906.patch, LUCENE-4906.patch For example, in a server, I may want to render the highlight result to JsonObject to send back to the front-end. Today since we render to string, I have to render to JSON string and then re-parse to JsonObject, which is inefficient... Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls terms from snippets instead, so you get proximity-influenced salient/expanded terms, then perhaps that renders to just an array of tokens or fragments or something from each snippet. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5057) Hunspell stemmer generates multiple tokens
[ https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13755677#comment-13755677 ] Luca Cavanna commented on LUCENE-5057: -- Hi Lukas, can you share your findings? In my case it seemed to be a dictionary problem, but I'm curious to hear what you experienced. Hunspell stemmer generates multiple tokens -- Key: LUCENE-5057 URL: https://issues.apache.org/jira/browse/LUCENE-5057 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.3 Reporter: Luca Cavanna Assignee: Adrien Grand The hunspell stemmer seems to be generating multiple tokens: the original token plus the available stems. It might be a good thing in some cases but it seems to be a different behaviour compared to the other stemmers and causes problems as well. I would rather have an option to decide whether it should output only the available stems, or the stems plus the original token. I'm not sure though if it's possible to have only a single stem indexed, which would be even better in my opinion. When I look at how snowball works only one token is indexed, the stem, and that works great. Probably there's something I'm missing in how hunspell works. Here is my issue: I have a query composed of multiple terms, which is analyzed using stemming and a boolean query is generated out of it. All fine when adding all clauses as should (OR operator), but if I add all clauses as must (AND operator), then I can get back only the documents that contain the stem originated by the exactly same original word. Example for the dutch language I'm working with: fiets (means bicycle in dutch), its plural is fietsen. If I index fietsen I get both fietsen and fiets indexed, but if I index fiets I get the only fiets indexed. When I query for fietsen whatever I get the following boolean query: field:fiets field:fietsen field:whatever. If I apply the AND operator and use must clauses for each subquery, then I can only find the documents that originally contained fietsen, not the ones that originally contained fiets, which is not really what stemming is about. Any thoughts on this? I also wonder if it can be a dictionary issue since I see that different words that have the word fiets as root don't get the same stems, and using the AND operator at query time is a big issue. I would love to contribute on this and looking forward to your feedback. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5181) Passage knows its own docID
[ https://issues.apache.org/jira/browse/LUCENE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747493#comment-13747493 ] Luca Cavanna commented on LUCENE-5181: -- True, having the doc id would be useful there. Why not adding it directly to the Passage, to be able know which document the Passage comes from? Passage knows its own docID --- Key: LUCENE-5181 URL: https://issues.apache.org/jira/browse/LUCENE-5181 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.4 Reporter: Jon Stewart Priority: Minor The new PostingsHighlight package allows for retrieval of term matches from a query if one creates a class that extends PassageFormatter and overrides format(). However, class Passage does not have a docID field, nor is this provided via PassageFormatter.format(). Therefore, it's very difficult to know which Document contains a given Passage. It would suffice for PassageFormatter.format() to be passed the docID as a parameter. From the code in PostingsHighlight, this seems like it would be easy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects
[ https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13740107#comment-13740107 ] Luca Cavanna commented on LUCENE-4906: -- No problem, thanks Mike! PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects -- Key: LUCENE-4906 URL: https://issues.apache.org/jira/browse/LUCENE-4906 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-4906.patch, LUCENE-4906.patch For example, in a server, I may want to render the highlight result to JsonObject to send back to the front-end. Today since we render to string, I have to render to JSON string and then re-parse to JsonObject, which is inefficient... Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls terms from snippets instead, so you get proximity-influenced salient/expanded terms, then perhaps that renders to just an array of tokens or fragments or something from each snippet. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects
[ https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736240#comment-13736240 ] Luca Cavanna commented on LUCENE-4906: -- Hi Mike, I definitely agree that highlighting api should be simple and the postings highlighter is probably the only one that's really easy to use. On the other hand, I think it's good to make explicit that if you use a FormatterYourObject, YourObject is what you're going to get back from the highlighter. People using the string version wouldn't notice the change, while advanced users would have to extend the base class and get type safety too, that in my opinion makes it clearier and easier. Using Object feels to me a little old-fashioned and bogus, but again that's probably me :) I do trust your experience though. If you think the object version is better that's fine with me. What I care about is that this improvement gets committed soon, since it's a really useful one ;) Thanks a lot for sharing your ideas PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects -- Key: LUCENE-4906 URL: https://issues.apache.org/jira/browse/LUCENE-4906 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-4906.patch, LUCENE-4906.patch For example, in a server, I may want to render the highlight result to JsonObject to send back to the front-end. Today since we render to string, I have to render to JSON string and then re-parse to JsonObject, which is inefficient... Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls terms from snippets instead, so you get proximity-influenced salient/expanded terms, then perhaps that renders to just an array of tokens or fragments or something from each snippet. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects
[ https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736242#comment-13736242 ] Luca Cavanna commented on LUCENE-4906: -- One more thing: re-reading Robert's previous comments, I find also interesting the idea he had about changing the api to return a proper object instead of the MapString, String[], or the String[] for the simplest methods. I wonder if it's worth to address this as well in this issue, or if the current api is clear enough in your opinion. Any thoughts? PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects -- Key: LUCENE-4906 URL: https://issues.apache.org/jira/browse/LUCENE-4906 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-4906.patch, LUCENE-4906.patch For example, in a server, I may want to render the highlight result to JsonObject to send back to the front-end. Today since we render to string, I have to render to JSON string and then re-parse to JsonObject, which is inefficient... Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls terms from snippets instead, so you get proximity-influenced salient/expanded terms, then perhaps that renders to just an array of tokens or fragments or something from each snippet. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects
[ https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734623#comment-13734623 ] Luca Cavanna commented on LUCENE-4906: -- Hi Mike, I had a look at your patch, looks good to me. Being able to get back arbitrary objects is a great improvement. The only thing I would love to improve here is the need to cast the returned Objects to the type that the custom PassageFormatter uses. We could work around this using generics, but the fact that the PassageFormatter can vary per field makes it harder. The only way I see to work around this is to prevent the PassageFormatter from returning different types of objects per field. That would mean that even though every field can have his own PassageFormatter, they all must return the same type. It kinda makes sense to me since I wouldn't want to have heterogeneous types in the MapInteger, Object, but that is something that's currently possible. What do you think? PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects -- Key: LUCENE-4906 URL: https://issues.apache.org/jira/browse/LUCENE-4906 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-4906.patch For example, in a server, I may want to render the highlight result to JsonObject to send back to the front-end. Today since we render to string, I have to render to JSON string and then re-parse to JsonObject, which is inefficient... Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls terms from snippets instead, so you get proximity-influenced salient/expanded terms, then perhaps that renders to just an array of tokens or fragments or something from each snippet. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects
[ https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated LUCENE-4906: - Attachment: LUCENE-4906.patch I don't see why adding generics would complicate or limit the API. To me it would make it simpler and nicer (not a big change in terms of api itself though). Attaching a patch with my thoughts to make more concrete what I had in mind, regardless of whether it will be integrated or not. It's backwards compatible (even though the class is marked experimental): we have an abstract postings highlighter class that does most of the work and returns arbitrary objects (uses generics in order to do so). The PostingsHighlighter is its natural extension that returns String snippets. I updated Mike's new test according to my changes. It should make it easier to understand what's needed to work with arbitrary objects in terms of code using this approach. I find it more explicit that if you want to extend the abstract one you have to declare what type the formatter is supposed to return, which makes it more explicit and avoids any cast. Limitations with this approach: 1) as mentioned before (to me it's more of a benefit) there cannot be heterogeneous types returned by the same highlighter instance. 2) generics don't play well with arrays, thus all the highlight methods that returned arrays are still in the subclass that returns string snippets to keep backwards compatibility. Moving them to the base class would most likely require to return ListFormattedPassage instead (not backward compatible). I haven't updated the javadoc yet, but if you like my approach I can go ahead with it. I would love to hear what you guys think about it. Generics can be scary... but useful sometimes too ;) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects -- Key: LUCENE-4906 URL: https://issues.apache.org/jira/browse/LUCENE-4906 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Attachments: LUCENE-4906.patch, LUCENE-4906.patch For example, in a server, I may want to render the highlight result to JsonObject to send back to the front-end. Today since we render to string, I have to render to JSON string and then re-parse to JsonObject, which is inefficient... Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls terms from snippets instead, so you get proximity-influenced salient/expanded terms, then perhaps that renders to just an array of tokens or fragments or something from each snippet. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5057) Hunspell stemmer generates multiple tokens
[ https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713487#comment-13713487 ] Luca Cavanna commented on LUCENE-5057: -- Thanks Adrien for looking into this, nice explanation! Hunspell stemmer generates multiple tokens -- Key: LUCENE-5057 URL: https://issues.apache.org/jira/browse/LUCENE-5057 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.3 Reporter: Luca Cavanna Assignee: Adrien Grand The hunspell stemmer seems to be generating multiple tokens: the original token plus the available stems. It might be a good thing in some cases but it seems to be a different behaviour compared to the other stemmers and causes problems as well. I would rather have an option to decide whether it should output only the available stems, or the stems plus the original token. I'm not sure though if it's possible to have only a single stem indexed, which would be even better in my opinion. When I look at how snowball works only one token is indexed, the stem, and that works great. Probably there's something I'm missing in how hunspell works. Here is my issue: I have a query composed of multiple terms, which is analyzed using stemming and a boolean query is generated out of it. All fine when adding all clauses as should (OR operator), but if I add all clauses as must (AND operator), then I can get back only the documents that contain the stem originated by the exactly same original word. Example for the dutch language I'm working with: fiets (means bicycle in dutch), its plural is fietsen. If I index fietsen I get both fietsen and fiets indexed, but if I index fiets I get the only fiets indexed. When I query for fietsen whatever I get the following boolean query: field:fiets field:fietsen field:whatever. If I apply the AND operator and use must clauses for each subquery, then I can only find the documents that originally contained fietsen, not the ones that originally contained fiets, which is not really what stemming is about. Any thoughts on this? I also wonder if it can be a dictionary issue since I see that different words that have the word fiets as root don't get the same stems, and using the AND operator at query time is a big issue. I would love to contribute on this and looking forward to your feedback. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-5057) Hunspell stemmer generates multiple tokens (original + stems)
Luca Cavanna created LUCENE-5057: Summary: Hunspell stemmer generates multiple tokens (original + stems) Key: LUCENE-5057 URL: https://issues.apache.org/jira/browse/LUCENE-5057 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.3 Reporter: Luca Cavanna The hunspell stemmer seems to be generating multiple tokens: the original token plus the available stems. It might be a good thing in some cases but it seems to be a different behaviour compared to the other stemmers and causes problems as well. I would rather have an option to decide whether it should output only the available stems, or the stems plus the original token. Here is my issue: I have a query composed of multiple terms, which is analyzed using stemming and a boolean query is generated out of it. All fine when adding all clauses as should (OR operator), but if I add all clauses as must (AND operator), then I can get back only the documents that contain the stem originated by the exactly same original word. Example for the dutch language I'm working with: fiets (means bicycle in dutch), its plural is fietsen. If I index fietsen I get both fietsen and fiets indexed, but if I index fiets I get the only fiets indexed. When I query for fietsen whatever I get the following boolean query: field:fiets field:fietsen field:whatever. If I apply the AND operator and use must clauses for each subquery, then I can only find the documents that originally contained fietsen, not the ones that originally contained fiets, which is not really what stemming is about. Any thoughts on this? I would work out a patch, I'd just need some help deciding the name of the option and what the default behaviour should be. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5057) Hunspell stemmer generates multiple tokens
[ https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated LUCENE-5057: - Summary: Hunspell stemmer generates multiple tokens (was: Hunspell stemmer generates multiple tokens (original + stems)) Hunspell stemmer generates multiple tokens -- Key: LUCENE-5057 URL: https://issues.apache.org/jira/browse/LUCENE-5057 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.3 Reporter: Luca Cavanna The hunspell stemmer seems to be generating multiple tokens: the original token plus the available stems. It might be a good thing in some cases but it seems to be a different behaviour compared to the other stemmers and causes problems as well. I would rather have an option to decide whether it should output only the available stems, or the stems plus the original token. Here is my issue: I have a query composed of multiple terms, which is analyzed using stemming and a boolean query is generated out of it. All fine when adding all clauses as should (OR operator), but if I add all clauses as must (AND operator), then I can get back only the documents that contain the stem originated by the exactly same original word. Example for the dutch language I'm working with: fiets (means bicycle in dutch), its plural is fietsen. If I index fietsen I get both fietsen and fiets indexed, but if I index fiets I get the only fiets indexed. When I query for fietsen whatever I get the following boolean query: field:fiets field:fietsen field:whatever. If I apply the AND operator and use must clauses for each subquery, then I can only find the documents that originally contained fietsen, not the ones that originally contained fiets, which is not really what stemming is about. Any thoughts on this? I would work out a patch, I'd just need some help deciding the name of the option and what the default behaviour should be. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-5057) Hunspell stemmer generates multiple tokens
[ https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated LUCENE-5057: - Description: The hunspell stemmer seems to be generating multiple tokens: the original token plus the available stems. It might be a good thing in some cases but it seems to be a different behaviour compared to the other stemmers and causes problems as well. I would rather have an option to decide whether it should output only the available stems, or the stems plus the original token. I'm not sure though if it's possible to have only a single stem indexed. When I look at how snowball works, only one token is indexed, the stem, and that works great. Here is my issue: I have a query composed of multiple terms, which is analyzed using stemming and a boolean query is generated out of it. All fine when adding all clauses as should (OR operator), but if I add all clauses as must (AND operator), then I can get back only the documents that contain the stem originated by the exactly same original word. Example for the dutch language I'm working with: fiets (means bicycle in dutch), its plural is fietsen. If I index fietsen I get both fietsen and fiets indexed, but if I index fiets I get the only fiets indexed. When I query for fietsen whatever I get the following boolean query: field:fiets field:fietsen field:whatever. If I apply the AND operator and use must clauses for each subquery, then I can only find the documents that originally contained fietsen, not the ones that originally contained fiets, which is not really what stemming is about. Any thoughts on this? I would work out a patch, I'd just need some help deciding the name of the option and what the default behaviour should be. was: The hunspell stemmer seems to be generating multiple tokens: the original token plus the available stems. It might be a good thing in some cases but it seems to be a different behaviour compared to the other stemmers and causes problems as well. I would rather have an option to decide whether it should output only the available stems, or the stems plus the original token. Here is my issue: I have a query composed of multiple terms, which is analyzed using stemming and a boolean query is generated out of it. All fine when adding all clauses as should (OR operator), but if I add all clauses as must (AND operator), then I can get back only the documents that contain the stem originated by the exactly same original word. Example for the dutch language I'm working with: fiets (means bicycle in dutch), its plural is fietsen. If I index fietsen I get both fietsen and fiets indexed, but if I index fiets I get the only fiets indexed. When I query for fietsen whatever I get the following boolean query: field:fiets field:fietsen field:whatever. If I apply the AND operator and use must clauses for each subquery, then I can only find the documents that originally contained fietsen, not the ones that originally contained fiets, which is not really what stemming is about. Any thoughts on this? I would work out a patch, I'd just need some help deciding the name of the option and what the default behaviour should be. Hunspell stemmer generates multiple tokens -- Key: LUCENE-5057 URL: https://issues.apache.org/jira/browse/LUCENE-5057 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.3 Reporter: Luca Cavanna The hunspell stemmer seems to be generating multiple tokens: the original token plus the available stems. It might be a good thing in some cases but it seems to be a different behaviour compared to the other stemmers and causes problems as well. I would rather have an option to decide whether it should output only the available stems, or the stems plus the original token. I'm not sure though if it's possible to have only a single stem indexed. When I look at how snowball works, only one token is indexed, the stem, and that works great. Here is my issue: I have a query composed of multiple terms, which is analyzed using stemming and a boolean query is generated out of it. All fine when adding all clauses as should (OR operator), but if I add all clauses as must (AND operator), then I can get back only the documents that contain the stem originated by the exactly same original word. Example for the dutch language I'm working with: fiets (means bicycle in dutch), its plural is fietsen. If I index fietsen I get both fietsen and fiets indexed, but if I index fiets I get the only fiets indexed. When I query for fietsen whatever I get the following boolean query: field:fiets field:fietsen field:whatever. If I apply the AND operator and use must clauses for each subquery, then I can only find the documents that originally contained
[jira] [Updated] (LUCENE-5057) Hunspell stemmer generates multiple tokens
[ https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated LUCENE-5057: - Description: The hunspell stemmer seems to be generating multiple tokens: the original token plus the available stems. It might be a good thing in some cases but it seems to be a different behaviour compared to the other stemmers and causes problems as well. I would rather have an option to decide whether it should output only the available stems, or the stems plus the original token. I'm not sure though if it's possible to have only a single stem indexed, which would be even better in my opinion. When I look at how snowball works only one token is indexed, the stem, and that works great. Probably there's something I'm missing in how hunspell works. Here is my issue: I have a query composed of multiple terms, which is analyzed using stemming and a boolean query is generated out of it. All fine when adding all clauses as should (OR operator), but if I add all clauses as must (AND operator), then I can get back only the documents that contain the stem originated by the exactly same original word. Example for the dutch language I'm working with: fiets (means bicycle in dutch), its plural is fietsen. If I index fietsen I get both fietsen and fiets indexed, but if I index fiets I get the only fiets indexed. When I query for fietsen whatever I get the following boolean query: field:fiets field:fietsen field:whatever. If I apply the AND operator and use must clauses for each subquery, then I can only find the documents that originally contained fietsen, not the ones that originally contained fiets, which is not really what stemming is about. Any thoughts on this? I wonder if it can be a dictionary issue since I see that different words that have the word fiets as root don't get the same stems, and using the AND operator at query time is a big issue. was: The hunspell stemmer seems to be generating multiple tokens: the original token plus the available stems. It might be a good thing in some cases but it seems to be a different behaviour compared to the other stemmers and causes problems as well. I would rather have an option to decide whether it should output only the available stems, or the stems plus the original token. I'm not sure though if it's possible to have only a single stem indexed. When I look at how snowball works, only one token is indexed, the stem, and that works great. Here is my issue: I have a query composed of multiple terms, which is analyzed using stemming and a boolean query is generated out of it. All fine when adding all clauses as should (OR operator), but if I add all clauses as must (AND operator), then I can get back only the documents that contain the stem originated by the exactly same original word. Example for the dutch language I'm working with: fiets (means bicycle in dutch), its plural is fietsen. If I index fietsen I get both fietsen and fiets indexed, but if I index fiets I get the only fiets indexed. When I query for fietsen whatever I get the following boolean query: field:fiets field:fietsen field:whatever. If I apply the AND operator and use must clauses for each subquery, then I can only find the documents that originally contained fietsen, not the ones that originally contained fiets, which is not really what stemming is about. Any thoughts on this? I would work out a patch, I'd just need some help deciding the name of the option and what the default behaviour should be. Hunspell stemmer generates multiple tokens -- Key: LUCENE-5057 URL: https://issues.apache.org/jira/browse/LUCENE-5057 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.3 Reporter: Luca Cavanna The hunspell stemmer seems to be generating multiple tokens: the original token plus the available stems. It might be a good thing in some cases but it seems to be a different behaviour compared to the other stemmers and causes problems as well. I would rather have an option to decide whether it should output only the available stems, or the stems plus the original token. I'm not sure though if it's possible to have only a single stem indexed, which would be even better in my opinion. When I look at how snowball works only one token is indexed, the stem, and that works great. Probably there's something I'm missing in how hunspell works. Here is my issue: I have a query composed of multiple terms, which is analyzed using stemming and a boolean query is generated out of it. All fine when adding all clauses as should (OR operator), but if I add all clauses as must (AND operator), then I can get back only the documents that contain the stem originated by the exactly same original word. Example for
[jira] [Updated] (LUCENE-5057) Hunspell stemmer generates multiple tokens
[ https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated LUCENE-5057: - Description: The hunspell stemmer seems to be generating multiple tokens: the original token plus the available stems. It might be a good thing in some cases but it seems to be a different behaviour compared to the other stemmers and causes problems as well. I would rather have an option to decide whether it should output only the available stems, or the stems plus the original token. I'm not sure though if it's possible to have only a single stem indexed, which would be even better in my opinion. When I look at how snowball works only one token is indexed, the stem, and that works great. Probably there's something I'm missing in how hunspell works. Here is my issue: I have a query composed of multiple terms, which is analyzed using stemming and a boolean query is generated out of it. All fine when adding all clauses as should (OR operator), but if I add all clauses as must (AND operator), then I can get back only the documents that contain the stem originated by the exactly same original word. Example for the dutch language I'm working with: fiets (means bicycle in dutch), its plural is fietsen. If I index fietsen I get both fietsen and fiets indexed, but if I index fiets I get the only fiets indexed. When I query for fietsen whatever I get the following boolean query: field:fiets field:fietsen field:whatever. If I apply the AND operator and use must clauses for each subquery, then I can only find the documents that originally contained fietsen, not the ones that originally contained fiets, which is not really what stemming is about. Any thoughts on this? I also wonder if it can be a dictionary issue since I see that different words that have the word fiets as root don't get the same stems, and using the AND operator at query time is a big issue. I would love to contribute on this and looking forward to your feedback. was: The hunspell stemmer seems to be generating multiple tokens: the original token plus the available stems. It might be a good thing in some cases but it seems to be a different behaviour compared to the other stemmers and causes problems as well. I would rather have an option to decide whether it should output only the available stems, or the stems plus the original token. I'm not sure though if it's possible to have only a single stem indexed, which would be even better in my opinion. When I look at how snowball works only one token is indexed, the stem, and that works great. Probably there's something I'm missing in how hunspell works. Here is my issue: I have a query composed of multiple terms, which is analyzed using stemming and a boolean query is generated out of it. All fine when adding all clauses as should (OR operator), but if I add all clauses as must (AND operator), then I can get back only the documents that contain the stem originated by the exactly same original word. Example for the dutch language I'm working with: fiets (means bicycle in dutch), its plural is fietsen. If I index fietsen I get both fietsen and fiets indexed, but if I index fiets I get the only fiets indexed. When I query for fietsen whatever I get the following boolean query: field:fiets field:fietsen field:whatever. If I apply the AND operator and use must clauses for each subquery, then I can only find the documents that originally contained fietsen, not the ones that originally contained fiets, which is not really what stemming is about. Any thoughts on this? I wonder if it can be a dictionary issue since I see that different words that have the word fiets as root don't get the same stems, and using the AND operator at query time is a big issue. Hunspell stemmer generates multiple tokens -- Key: LUCENE-5057 URL: https://issues.apache.org/jira/browse/LUCENE-5057 Project: Lucene - Core Issue Type: Improvement Affects Versions: 4.3 Reporter: Luca Cavanna The hunspell stemmer seems to be generating multiple tokens: the original token plus the available stems. It might be a good thing in some cases but it seems to be a different behaviour compared to the other stemmers and causes problems as well. I would rather have an option to decide whether it should output only the available stems, or the stems plus the original token. I'm not sure though if it's possible to have only a single stem indexed, which would be even better in my opinion. When I look at how snowball works only one token is indexed, the stem, and that works great. Probably there's something I'm missing in how hunspell works. Here is my issue: I have a query composed of multiple terms, which is analyzed using stemming and a boolean query is
[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects
[ https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629961#comment-13629961 ] Luca Cavanna commented on LUCENE-4906: -- Sounds interesting. Is anybody working on this already? I'd like to volunteer. What do you have in mind exactly? Now the format method returns a string. What would you like to see as output instead? PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects -- Key: LUCENE-4906 URL: https://issues.apache.org/jira/browse/LUCENE-4906 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless For example, in a server, I may want to render the highlight result to JsonObject to send back to the front-end. Today since we render to string, I have to render to JSON string and then re-parse to JsonObject, which is inefficient... Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls terms from snippets instead, so you get proximity-influenced salient/expanded terms, then perhaps that renders to just an array of tokens or fragments or something from each snippet. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects
[ https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629973#comment-13629973 ] Luca Cavanna commented on LUCENE-4906: -- I see! If you couldn't make it how can I make it? :) But the idea is that you could have some kind of object as output instead of a string, like for example an array of tokens plus maybe some more information? It would avoid to parse again the string output and somehow re-analyze the text as needed to have a snippet that we could provide as output directly? PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects -- Key: LUCENE-4906 URL: https://issues.apache.org/jira/browse/LUCENE-4906 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless For example, in a server, I may want to render the highlight result to JsonObject to send back to the front-end. Today since we render to string, I have to render to JSON string and then re-parse to JsonObject, which is inefficient... Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls terms from snippets instead, so you get proximity-influenced salient/expanded terms, then perhaps that renders to just an array of tokens or fragments or something from each snippet. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects
[ https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630227#comment-13630227 ] Luca Cavanna commented on LUCENE-4906: -- Sure, I hadn't seen that issue yet but I was about to propose the same looking at the code. Thanks for your insight! I thought about generics too, but then we'd have to be really careful otherwise the generics policeman jumps in :) I'll play around with some ideas and post the results here as soon as I have something. PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects -- Key: LUCENE-4906 URL: https://issues.apache.org/jira/browse/LUCENE-4906 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless For example, in a server, I may want to render the highlight result to JsonObject to send back to the front-end. Today since we render to string, I have to render to JSON string and then re-parse to JsonObject, which is inefficient... Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls terms from snippets instead, so you get proximity-influenced salient/expanded terms, then perhaps that renders to just an array of tokens or fragments or something from each snippet. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4896) PostingsHighlighter should use a interface of PassageFormatter instead of a class
[ https://issues.apache.org/jira/browse/LUCENE-4896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated LUCENE-4896: - Attachment: LUCENE-4896.patch No problem. It just felt weird to write an abstract class that looked to me like an interface. Should be better now. I also made append protected. PostingsHighlighter should use a interface of PassageFormatter instead of a class - Key: LUCENE-4896 URL: https://issues.apache.org/jira/browse/LUCENE-4896 Project: Lucene - Core Issue Type: Improvement Components: modules/highlighter Affects Versions: 4.2 Environment: NA Reporter: Sebastien Dionne Labels: newdev Attachments: LUCENE-4896.patch, LUCENE-4896.patch In my project I need a custom PassageFormatter to use with PostingsHighlighter. I extended PassageFormatter to override format(...) but if I do that, I don't have access to the private variables. So instead of changing the scope to protected, it should be more usefull to use a interface for PassageFormatter. like public DefaultPassageFormatter implements PassageFormatter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4896) PostingsHighlighter should use a interface of PassageFormatter instead of a class
[ https://issues.apache.org/jira/browse/LUCENE-4896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630392#comment-13630392 ] Luca Cavanna commented on LUCENE-4896: -- Just read your last comment, going to add another patch :) PostingsHighlighter should use a interface of PassageFormatter instead of a class - Key: LUCENE-4896 URL: https://issues.apache.org/jira/browse/LUCENE-4896 Project: Lucene - Core Issue Type: Improvement Components: modules/highlighter Affects Versions: 4.2 Environment: NA Reporter: Sebastien Dionne Labels: newdev Attachments: LUCENE-4896.patch, LUCENE-4896.patch In my project I need a custom PassageFormatter to use with PostingsHighlighter. I extended PassageFormatter to override format(...) but if I do that, I don't have access to the private variables. So instead of changing the scope to protected, it should be more usefull to use a interface for PassageFormatter. like public DefaultPassageFormatter implements PassageFormatter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4896) PostingsHighlighter should use a interface of PassageFormatter instead of a class
[ https://issues.apache.org/jira/browse/LUCENE-4896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated LUCENE-4896: - Attachment: LUCENE-4896.patch Here it is. The lucene.experimental was already there, I left it only in the abstract class. Fixed format javadocs too. PostingsHighlighter should use a interface of PassageFormatter instead of a class - Key: LUCENE-4896 URL: https://issues.apache.org/jira/browse/LUCENE-4896 Project: Lucene - Core Issue Type: Improvement Components: modules/highlighter Affects Versions: 4.2 Environment: NA Reporter: Sebastien Dionne Labels: newdev Attachments: LUCENE-4896.patch, LUCENE-4896.patch, LUCENE-4896.patch In my project I need a custom PassageFormatter to use with PostingsHighlighter. I extended PassageFormatter to override format(...) but if I do that, I don't have access to the private variables. So instead of changing the scope to protected, it should be more usefull to use a interface for PassageFormatter. like public DefaultPassageFormatter implements PassageFormatter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4896) PostingsHighlighter should use a interface of PassageFormatter instead of a class
[ https://issues.apache.org/jira/browse/LUCENE-4896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630441#comment-13630441 ] Luca Cavanna commented on LUCENE-4896: -- I see what you mean Robert, thanks a lot for your explanation. I would have probably ended up with an interface + abstract class then ;) Let's see what I can come up with for LUCENE-4906... PostingsHighlighter should use a interface of PassageFormatter instead of a class - Key: LUCENE-4896 URL: https://issues.apache.org/jira/browse/LUCENE-4896 Project: Lucene - Core Issue Type: Improvement Components: modules/highlighter Affects Versions: 4.2 Environment: NA Reporter: Sebastien Dionne Labels: newdev Attachments: LUCENE-4896.patch, LUCENE-4896.patch, LUCENE-4896.patch In my project I need a custom PassageFormatter to use with PostingsHighlighter. I extended PassageFormatter to override format(...) but if I do that, I don't have access to the private variables. So instead of changing the scope to protected, it should be more usefull to use a interface for PassageFormatter. like public DefaultPassageFormatter implements PassageFormatter. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4825) PostingsHighlighter support for positional queries
[ https://issues.apache.org/jira/browse/LUCENE-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600918#comment-13600918 ] Luca Cavanna commented on LUCENE-4825: -- Hey Robert, sorry but I don't quite understand why it would become an orange? :) I mean, the PostingsHighlighter does (among others) two great things: 1) reads offsets from the postings list, as its name says 2) summarizes the content giving nice sentences as output I think the two above features are a great improvement and pretty much what everybody would like to have! I'm proposing to add support for positional queries, as a third optional feature. We would need to read the spans from the positional queries in order to highlight only the proper terms, otherwise the output is wrong from a user perspective. Would this make it that slower? I don't mean to reanalyze the text... Don't get me wrong you must be right but I would like to understand more. You're saying that instead of adding 3) to 2) and 1) we should have another highlighter that does 1) 2) and 3)? PostingsHighlighter support for positional queries -- Key: LUCENE-4825 URL: https://issues.apache.org/jira/browse/LUCENE-4825 Project: Lucene - Core Issue Type: Improvement Components: modules/highlighter Affects Versions: 4.2 Reporter: Luca Cavanna I've been playing around with the brand new PostingsHighlighter. I'm really happy with the result in terms of quality of the snippets and performance. On the other hand, I noticed it doesn't support positional queries. If you make a span query, for example, all the single terms will be highlighted, even though they haven't contributed to the match. That reminds me of the difference between the QueryTermScorer and the QueryScorer (using the standard Highlighter). I've been trying to adapt what the QueryScorer does, especially the extraction of the query terms together with their positions (what WeightedSpanTermExtractor does). Next step would be to take that information into account within the formatter and highlight only the terms that actually contributed to the match. I'm not quite ready yet with a patch to contribute this back, but I certainly intend to do so. That's why I opened the issue and in the meantime I would like to hear what you guys think about it and discuss how best we can fix it. I think it would be a big improvement for this new highlighter, which is already great! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4825) PostingsHighlighter support for positional queries
Luca Cavanna created LUCENE-4825: Summary: PostingsHighlighter support for positional queries Key: LUCENE-4825 URL: https://issues.apache.org/jira/browse/LUCENE-4825 Project: Lucene - Core Issue Type: Improvement Components: modules/highlighter Affects Versions: 4.2 Reporter: Luca Cavanna I've been playing around with the brand new PostingsHighlighter. I'm really happy with the result in terms of quality of the snippets and performance. On the other hand, I noticed it doesn't support positional queries. If you make a span query, for example, all the single terms will be highlighted, even though they haven't contributed to the match. That reminds me of the difference between the QueryTermScorer and the QueryScorer (using the standard Highlighter). I've been trying to adapt what the QueryScorer does, especially the extraction of the query terms together with their positions (what WeightedSpanTermExtractor does). Next step would be to take that information into account within the formatter and highlight only the terms that actually contributed to the match. I'm not quite ready yet with a patch to contribute this back, but I certainly intend to do so. That's why I opened the issue and in the meantime I would like to hear what you guys think about it and discuss how best we can fix it. I think it would be a big improvement for this new highlighter, which is already great! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4825) PostingsHighlighter support for positional queries
[ https://issues.apache.org/jira/browse/LUCENE-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600572#comment-13600572 ] Luca Cavanna commented on LUCENE-4825: -- Thanks for your inputs Robert! I see your point, even though from a user perspective I'd rather see only the complete phrase highlighted if I make a phrase query, not every single term. I think we can currently achieve this only like the old highlighter does, am I right? Maybe we can make this pluggable and have different implementations? PostingsHighlighter support for positional queries -- Key: LUCENE-4825 URL: https://issues.apache.org/jira/browse/LUCENE-4825 Project: Lucene - Core Issue Type: Improvement Components: modules/highlighter Affects Versions: 4.2 Reporter: Luca Cavanna I've been playing around with the brand new PostingsHighlighter. I'm really happy with the result in terms of quality of the snippets and performance. On the other hand, I noticed it doesn't support positional queries. If you make a span query, for example, all the single terms will be highlighted, even though they haven't contributed to the match. That reminds me of the difference between the QueryTermScorer and the QueryScorer (using the standard Highlighter). I've been trying to adapt what the QueryScorer does, especially the extraction of the query terms together with their positions (what WeightedSpanTermExtractor does). Next step would be to take that information into account within the formatter and highlight only the terms that actually contributed to the match. I'm not quite ready yet with a patch to contribute this back, but I certainly intend to do so. That's why I opened the issue and in the meantime I would like to hear what you guys think about it and discuss how best we can fix it. I think it would be a big improvement for this new highlighter, which is already great! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3976) Improve error messages for unsupported Hunspell formats
[ https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287260#comment-13287260 ] Luca Cavanna commented on LUCENE-3976: -- Hi Chris, I agree with you. On the other hand with the affix rule mentioned, before LUCENE-4019 we had an AOE, so the additional catch would have been useful just to throw a nicer error message like Error while parsing the affix file. That one has been solved at its source, for now I don't see any other possible errors but I'm sure there are some, maybe plenty since we support only a subset of the formats and features. It was just a way to introduce a generic error message but I totally agree that the right apporach would be fixing everything at the source. Improve error messages for unsupported Hunspell formats --- Key: LUCENE-3976 URL: https://issues.apache.org/jira/browse/LUCENE-3976 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Chris Male Attachments: LUCENE-3976.patch, LUCENE-3976.patch Our hunspell implementation is never going to be able to support the huge variety of formats that are out there, especially since our impl is based on papers written on the topic rather than being a pure port. Recently we ran into the following suffix rule: {noformat}SFX CA 0 /CaCp{noformat} Due to the missing regex conditional, an AOE was being thrown, which made it difficult to diagnose the problem. We should instead try to provide better error messages showing what we were unable to parse. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition
[ https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated LUCENE-4019: - Attachment: LUCENE-4019.patch Hi Chris, thanks for your feedback. Here is a new patch containing a new option in order to enable/disable the affix strict parsing, by default it is enabled. I updated the HunspellStemFilterFactory too in order to expose the new option to Solr. Parsing Hunspell affix rules without regexp condition - Key: LUCENE-4019 URL: https://issues.apache.org/jira/browse/LUCENE-4019 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6 Reporter: Luca Cavanna Assignee: Chris Male Attachments: LUCENE-4019.patch, LUCENE-4019.patch We found out that some recent Dutch hunspell dictionaries contain suffix or prefix rules like the following: {code} SFX Na N 1 SFX Na 0 ste {code} The rule on the second line doesn't contain the 5th parameter, which should be the condition (a regexp usually). You can usually see a '.' as condition, meaning always (for every character). As explained in LUCENE-3976 the readAffix method throws error. I wonder if we should treat the missing value as a kind of default value, like '.'. On the other hand I haven't found any information about this within the spec. Any thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition
[ https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated LUCENE-4019: - Attachment: LUCENE-4019.patch Yeah, sorry for my mistakes, I corrected them. And I added the line number to the ParseException. Let me know if there's something more I can do! Parsing Hunspell affix rules without regexp condition - Key: LUCENE-4019 URL: https://issues.apache.org/jira/browse/LUCENE-4019 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6 Reporter: Luca Cavanna Assignee: Chris Male Attachments: LUCENE-4019.patch, LUCENE-4019.patch, LUCENE-4019.patch We found out that some recent Dutch hunspell dictionaries contain suffix or prefix rules like the following: {code} SFX Na N 1 SFX Na 0 ste {code} The rule on the second line doesn't contain the 5th parameter, which should be the condition (a regexp usually). You can usually see a '.' as condition, meaning always (for every character). As explained in LUCENE-3976 the readAffix method throws error. I wonder if we should treat the missing value as a kind of default value, like '.'. On the other hand I haven't found any information about this within the spec. Any thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition
[ https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269516#comment-13269516 ] Luca Cavanna commented on LUCENE-4019: -- Thank you Robert for the explanation! In this specific case it's hard to understand the differences between hunspell and Lucene, since Lucene doesn't even parse the affix file. I've been in contact with the authors of those Ducth dictionaries, as well as with the hunspell author. It turned out that those affix rules are wrong and hunspell actually ignores them. I think it's better to ignore them in Lucene too, rather than throwing an exception, which makes impossible to use those dictionaries at all. Parsing Hunspell affix rules without regexp condition - Key: LUCENE-4019 URL: https://issues.apache.org/jira/browse/LUCENE-4019 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6 Reporter: Luca Cavanna We found out that some recent Dutch hunspell dictionaries contain suffix or prefix rules like the following: {code} SFX Na N 1 SFX Na 0 ste {code} The rule on the second line doesn't contain the 5th parameter, which should be the condition (a regexp usually). You can usually see a '.' as condition, meaning always (for every character). As explained in LUCENE-3976 the readAffix method throws error. I wonder if we should treat the missing value as a kind of default value, like '.'. On the other hand I haven't found any information about this within the spec. Any thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition
[ https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated LUCENE-4019: - Attachment: LUCENE-4019.patch Small patch: affix rules with less than 5 elements are now ignored. I added a specific test with a new affix file containing an example of rule shorter than it should be. Let me know if you prefer to add a warning when a rule is skipped. Hunspell does that only with a specific command line option. Parsing Hunspell affix rules without regexp condition - Key: LUCENE-4019 URL: https://issues.apache.org/jira/browse/LUCENE-4019 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6 Reporter: Luca Cavanna Attachments: LUCENE-4019.patch We found out that some recent Dutch hunspell dictionaries contain suffix or prefix rules like the following: {code} SFX Na N 1 SFX Na 0 ste {code} The rule on the second line doesn't contain the 5th parameter, which should be the condition (a regexp usually). You can usually see a '.' as condition, meaning always (for every character). As explained in LUCENE-3976 the readAffix method throws error. I wonder if we should treat the missing value as a kind of default value, like '.'. On the other hand I haven't found any information about this within the spec. Any thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3976) Improve error messages for unsupported Hunspell formats
[ https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269520#comment-13269520 ] Luca Cavanna commented on LUCENE-3976: -- The specific case of affix rule with less than 5 elements has been addressed in LUCENE-4019. Please ignore my first patch here since it's related to that specific case which is now handled in a different way in LUCENE-4019. I'm looking into improve error messages anyway, in a more generic way if possible. Improve error messages for unsupported Hunspell formats --- Key: LUCENE-3976 URL: https://issues.apache.org/jira/browse/LUCENE-3976 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Chris Male Attachments: LUCENE-3976.patch Our hunspell implementation is never going to be able to support the huge variety of formats that are out there, especially since our impl is based on papers written on the topic rather than being a pure port. Recently we ran into the following suffix rule: {noformat}SFX CA 0 /CaCp{noformat} Due to the missing regex conditional, an AOE was being thrown, which made it difficult to diagnose the problem. We should instead try to provide better error messages showing what we were unable to parse. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3976) Improve error messages for unsupported Hunspell formats
[ https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269520#comment-13269520 ] Luca Cavanna edited comment on LUCENE-3976 at 5/7/12 11:03 AM: --- The specific case of affix rule with less than 5 elements has been addressed in LUCENE-4019. Please ignore my first patch here since it's related to that specific case which is now handled in a different way in LUCENE-4019. I'm looking into improving error messages anyway, possibly in a more generic way. was (Author: lucacavanna): The specific case of affix rule with less than 5 elements has been addressed in LUCENE-4019. Please ignore my first patch here since it's related to that specific case which is now handled in a different way in LUCENE-4019. I'm looking into improve error messages anyway, in a more generic way if possible. Improve error messages for unsupported Hunspell formats --- Key: LUCENE-3976 URL: https://issues.apache.org/jira/browse/LUCENE-3976 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Chris Male Attachments: LUCENE-3976.patch Our hunspell implementation is never going to be able to support the huge variety of formats that are out there, especially since our impl is based on papers written on the topic rather than being a pure port. Recently we ran into the following suffix rule: {noformat}SFX CA 0 /CaCp{noformat} Due to the missing regex conditional, an AOE was being thrown, which made it difficult to diagnose the problem. We should instead try to provide better error messages showing what we were unable to parse. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Issue Comment Edited] (LUCENE-3976) Improve error messages for unsupported Hunspell formats
[ https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269520#comment-13269520 ] Luca Cavanna edited comment on LUCENE-3976 at 5/7/12 11:04 AM: --- The specific case of affix rule with less than 5 elements has been addressed in LUCENE-4019. Please ignore my first patch here since it's related to that specific case which is now handled in a different way in LUCENE-4019. I'm looking into improving error messages anyway, possibly in a generic way. was (Author: lucacavanna): The specific case of affix rule with less than 5 elements has been addressed in LUCENE-4019. Please ignore my first patch here since it's related to that specific case which is now handled in a different way in LUCENE-4019. I'm looking into improving error messages anyway, possibly in a more generic way. Improve error messages for unsupported Hunspell formats --- Key: LUCENE-3976 URL: https://issues.apache.org/jira/browse/LUCENE-3976 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Chris Male Attachments: LUCENE-3976.patch Our hunspell implementation is never going to be able to support the huge variety of formats that are out there, especially since our impl is based on papers written on the topic rather than being a pure port. Recently we ran into the following suffix rule: {noformat}SFX CA 0 /CaCp{noformat} Due to the missing regex conditional, an AOE was being thrown, which made it difficult to diagnose the problem. We should instead try to provide better error messages showing what we were unable to parse. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3976) Improve error messages for unsupported Hunspell formats
[ https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated LUCENE-3976: - Attachment: LUCENE-3976.patch The patch tries to address unexpected errors while parsing affix files and dictionaries. I just added an external try catch with a generic Error while parsing the affix/dictionary file, in my opinion better than just eventually throwing some unchecked exception. Let me know if there's something else we can improve meanwhile. Improve error messages for unsupported Hunspell formats --- Key: LUCENE-3976 URL: https://issues.apache.org/jira/browse/LUCENE-3976 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Chris Male Attachments: LUCENE-3976.patch, LUCENE-3976.patch Our hunspell implementation is never going to be able to support the huge variety of formats that are out there, especially since our impl is based on papers written on the topic rather than being a pure port. Recently we ran into the following suffix rule: {noformat}SFX CA 0 /CaCp{noformat} Due to the missing regex conditional, an AOE was being thrown, which made it difficult to diagnose the problem. We should instead try to provide better error messages showing what we were unable to parse. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3189) Removing a field using TemplateTransformer
[ https://issues.apache.org/jira/browse/SOLR-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269561#comment-13269561 ] Luca Cavanna commented on SOLR-3189: Guys, don't you think this feature could be useful? Is there something I can do to convince you? :) Removing a field using TemplateTransformer -- Key: SOLR-3189 URL: https://issues.apache.org/jira/browse/SOLR-3189 Project: Solr Issue Type: Improvement Components: contrib - DataImportHandler Reporter: Luca Cavanna Priority: Minor Attachments: SOLR-3189.patch While importing documents through DataImportHandler I need to remove some fields from the final SolrDocument before it's submitted to Solr. My usecase: the import query returns an A column which I use to fill in the B field on the Solr instance. My Solr schema contains both the A and B fields, so they are both filled in through dih. I'd like to force the deletion of A from the generated SolrDocument since I need a value only on the B field and want to leave empty the A field. The only way I found is using ScriptTransformer, so I thought it could be useful to add this feature to the TemplateTransformer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-3443) Optimize hunspell dictionary loading with multiple cores
Luca Cavanna created SOLR-3443: -- Summary: Optimize hunspell dictionary loading with multiple cores Key: SOLR-3443 URL: https://issues.apache.org/jira/browse/SOLR-3443 Project: Solr Issue Type: Improvement Reporter: Luca Cavanna The Hunspell dictionary is actually loaded into memory. Each core using hunspell loads its own dictionary, no matter if all the cores are using the same dictionary files. As a result, the same dictionary is loaded into memory multiple times, once for each core. I think we should share those dictionaries between all cores in order to optimize the memory usage. In fact, let's say a dictionary takes 20MB into memory (this is what I detected), if you have 20 cores you are going to use 400MB only for dictionaries, which doesn't seem a good idea to me. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3443) Optimize hunspell dictionary loading with multiple cores
[ https://issues.apache.org/jira/browse/SOLR-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269573#comment-13269573 ] Luca Cavanna commented on SOLR-3443: The first thing I have in mind is a static map containing all loaded dictionaries with some kind of unique identifier, so that the same dictionary can be reused between cores. But my question is: is there a mechanism to share object between cores in Solr? Is this the first time someone needs to share something between multiple cores? I'd like to hear your thoughts! Optimize hunspell dictionary loading with multiple cores Key: SOLR-3443 URL: https://issues.apache.org/jira/browse/SOLR-3443 Project: Solr Issue Type: Improvement Reporter: Luca Cavanna The Hunspell dictionary is actually loaded into memory. Each core using hunspell loads its own dictionary, no matter if all the cores are using the same dictionary files. As a result, the same dictionary is loaded into memory multiple times, once for each core. I think we should share those dictionaries between all cores in order to optimize the memory usage. In fact, let's say a dictionary takes 20MB into memory (this is what I detected), if you have 20 cores you are going to use 400MB only for dictionaries, which doesn't seem a good idea to me. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-3407) Field names with leading digits cause strange behavior when used with fl param
[ https://issues.apache.org/jira/browse/SOLR-3407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13261605#comment-13261605 ] Luca Cavanna commented on SOLR-3407: I guess this is a side effect coming from SOLR-2444. Solr already had similar problems (SOLR-2719: hyphen within the field name didn't work anymore). Some kind of validation would help (SOLR-3207) but it's not trivial to do since there are dynamic fields too; I think Solr should guarantee that allowed field names always work, or at least make clearer what restrictions there are. Field names with leading digits cause strange behavior when used with fl param Key: SOLR-3407 URL: https://issues.apache.org/jira/browse/SOLR-3407 Project: Solr Issue Type: Bug Components: search Affects Versions: 4.0 Environment: apache-solr-4.0-2012-04-24_08-27-47 Reporter: Chris Bleakley When specifying a field name that starts with a digit (or digits) in the fl parameter solr returns both the field name and field value as the those digits. For example, using nightly build apache-solr-4.0-2012-04-24_08-27-47 I run: java -jar start.jar and java -jar post.jar solr.xml monitor.xml If I then add a field to the field list that starts with a digit ( localhost:8983/solr/select?q=*:*fl=24 ) the results look like: ... doc long name=2424/long /doc ... if I try fl=24_7 it looks like everything after the underscore is truncated ... doc long name=2424/long /doc ... and if I try fl=3test it looks like everything after the last digit is truncated ... doc long name=33/long /doc ... If I have an actual value for that field (say I've indexed 24_7 to be true ) I get back that value as well as the behavior above. ... doc bool name=24_7true/bool long name=2424/long /doc ... -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-3976) Improve error messages for unsupported Hunspell formats
[ https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Cavanna updated LUCENE-3976: - Attachment: LUCENE-3976.patch First draft patch: I added a check for that specific problem with an understandable error. In fact, since we are going to read the first 5 elements from an array, better to check if there are at least 5 elements. Not sure how we can improve generic errors handling. Let me know your thoughts. Improve error messages for unsupported Hunspell formats --- Key: LUCENE-3976 URL: https://issues.apache.org/jira/browse/LUCENE-3976 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Chris Male Attachments: LUCENE-3976.patch Our hunspell implementation is never going to be able to support the huge variety of formats that are out there, especially since our impl is based on papers written on the topic rather than being a pure port. Recently we ran into the following suffix rule: {noformat}SFX CA 0 /CaCp{noformat} Due to the missing regex conditional, an AOE was being thrown, which made it difficult to diagnose the problem. We should instead try to provide better error messages showing what we were unable to parse. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-3976) Improve error messages for unsupported Hunspell formats
[ https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260490#comment-13260490 ] Luca Cavanna commented on LUCENE-3976: -- We found out that some recent dutch dictionaries contain rule like the one mentioned (Starting from version 2.00 if I'm correct). I'm going to look at that specific problem and see how we can parse those affix rules. Improve error messages for unsupported Hunspell formats --- Key: LUCENE-3976 URL: https://issues.apache.org/jira/browse/LUCENE-3976 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Reporter: Chris Male Attachments: LUCENE-3976.patch Our hunspell implementation is never going to be able to support the huge variety of formats that are out there, especially since our impl is based on papers written on the topic rather than being a pure port. Recently we ran into the following suffix rule: {noformat}SFX CA 0 /CaCp{noformat} Due to the missing regex conditional, an AOE was being thrown, which made it difficult to diagnose the problem. We should instead try to provide better error messages showing what we were unable to parse. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition
Luca Cavanna created LUCENE-4019: Summary: Parsing Hunspell affix rules without regexp condition Key: LUCENE-4019 URL: https://issues.apache.org/jira/browse/LUCENE-4019 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6 Reporter: Luca Cavanna We found out that some recent Dutch hunspell dictionaries contain suffix or prefix rules like the following: {code} SFX Na N 1 SFX Na 0 ste {code} The rule on the second line doesn't contain the 5th parameter, which should be the condition (a regexp usually). You can usually see a '.' as condition, meaning always (for every character). As explained in LUCENE-3976 the readAffix method throws error. I wonder if we should treat the missing value as a kind of default value, like '.'. On the other hand I haven't found any information about this within the spec. Any thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition
[ https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260612#comment-13260612 ] Luca Cavanna commented on LUCENE-4019: -- Robert, with spec I meant exactly your links :) Actually it's clear that the affix header has 4 elements while each rule has at least 5 elements. I don't really know what hunspell does with that kind of malformed rules. Lucene just throws an error while loading the dictionary. Looking at the hunspell source code, I might be wrong but I suspect it just skips that specific rule with some warning. But honestly it's hard to believe that at least 4 dictionaries I tried contain mistaken rules, isn't it? I'll investigate more, thanks! Parsing Hunspell affix rules without regexp condition - Key: LUCENE-4019 URL: https://issues.apache.org/jira/browse/LUCENE-4019 Project: Lucene - Java Issue Type: Improvement Components: modules/analysis Affects Versions: 3.6 Reporter: Luca Cavanna We found out that some recent Dutch hunspell dictionaries contain suffix or prefix rules like the following: {code} SFX Na N 1 SFX Na 0 ste {code} The rule on the second line doesn't contain the 5th parameter, which should be the condition (a regexp usually). You can usually see a '.' as condition, meaning always (for every character). As explained in LUCENE-3976 the readAffix method throws error. I wonder if we should treat the missing value as a kind of default value, like '.'. On the other hand I haven't found any information about this within the spec. Any thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org