[jira] [Comment Edited] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-31 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852876#comment-16852876
 ] 

Luca Cavanna edited comment on LUCENE-8796 at 5/31/19 10:08 AM:


I updated the PR and addressed all the comments, here are the latest benchmark 
results (with bitset optimization disabled on both ends):
{noformat}
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
 MedTerm 1510.74  (6.8%) 1457.20  (8.4%)   
-3.5% ( -17% -   12%)
  Fuzzy1   70.49  (8.5%)   68.11  (9.8%)   
-3.4% ( -19% -   16%)
OrHighNotMed  650.57  (5.8%)  629.81  (6.0%)   
-3.2% ( -14% -9%)
   OrHighLow  447.13  (4.2%)  433.05  (4.5%)   
-3.2% ( -11% -5%)
OrNotHighMed  623.22  (6.3%)  605.19  (6.1%)   
-2.9% ( -14% -   10%)
OrHighNotLow  720.89  (7.0%)  701.26  (7.9%)   
-2.7% ( -16% -   13%)
   OrNotHighHigh  558.43  (6.3%)  544.82  (4.9%)   
-2.4% ( -12% -9%)
 LowTerm 1279.34  (4.9%) 1248.60  (5.2%)   
-2.4% ( -11% -8%)
  AndHighLow  690.75  (4.0%)  675.22  (5.3%)   
-2.2% ( -11% -7%)
   LowPhrase  358.90  (2.3%)  351.28  (4.0%)   
-2.1% (  -8% -4%)
PKLookup  139.97  (3.0%)  137.32  (3.5%)   
-1.9% (  -8% -4%)
OrNotHighLow  728.48  (6.8%)  714.79  (6.5%)   
-1.9% ( -14% -   12%)
HighTerm 1222.38  (6.3%) 1199.77  (7.1%)   
-1.8% ( -14% -   12%)
 AndHighHigh   58.93  (6.2%)   58.01  (5.8%)   
-1.6% ( -12% -   11%)
 Prefix3  152.21  (4.5%)  150.00  (5.0%)   
-1.5% ( -10% -8%)
   IntNRQConjMedTerm   79.15 (10.7%)   78.06 (10.5%)   
-1.4% ( -20% -   22%)
   HighTermDayOfYearSort   95.28  (5.1%)   94.10  (7.8%)   
-1.2% ( -13% -   12%)
Wildcard   64.23  (2.3%)   63.45  (2.3%)   
-1.2% (  -5% -3%)
 MedSpanNear   81.15  (2.2%)   80.19  (2.8%)   
-1.2% (  -6% -3%)
HighSpanNear   10.20  (3.9%)   10.08  (4.2%)   
-1.2% (  -8% -7%)
HighIntervalsOrdered4.07  (1.8%)4.03  (2.2%)   
-1.1% (  -4% -2%)
 LowSpanNear   41.62  (3.1%)   41.20  (3.6%)   
-1.0% (  -7% -5%)
   IntNRQConjLowTerm   20.36  (4.1%)   20.15  (4.5%)   
-1.0% (  -9% -7%)
  IntNRQConjHighTerm   64.84  (9.6%)   64.21  (9.4%)   
-1.0% ( -18% -   19%)
  AndHighMed  229.08  (2.8%)  227.00  (2.5%)   
-0.9% (  -6% -4%)
   MedPhrase   18.73  (1.5%)   18.57  (2.3%)   
-0.8% (  -4% -2%)
 LowSloppyPhrase  124.52  (2.3%)  123.48  (2.6%)   
-0.8% (  -5% -4%)
 Respell   69.26  (3.0%)   68.68  (2.9%)   
-0.8% (  -6% -5%)
  HighPhrase   12.98  (1.6%)   12.88  (2.2%)   
-0.7% (  -4% -3%)
   PrefixConjLowTerm   42.11  (2.6%)   41.81  (3.0%)   
-0.7% (  -6% -5%)
   OrHighNotHigh  680.34  (6.1%)  676.16  (7.6%)   
-0.6% ( -13% -   13%)
 MedSloppyPhrase   34.06  (4.9%)   33.89  (4.5%)   
-0.5% (  -9% -9%)
  IntNRQ   89.97 (12.4%)   89.62 (12.0%)   
-0.4% ( -22% -   27%)
HighSloppyPhrase8.28  (4.0%)8.25  (3.9%)   
-0.3% (  -7% -7%)
 WildcardConjLowTerm   36.35  (2.7%)   36.26  (2.7%)   
-0.3% (  -5% -5%)
  OrHighHigh   27.89  (2.6%)   27.85  (3.1%)   
-0.1% (  -5% -5%)
  Fuzzy2   44.19  (3.8%)   44.17  (3.1%)   
-0.1% (  -6% -7%)
   OrHighMed   90.42  (2.8%)   90.57  (2.8%)
0.2% (  -5% -6%)
   PrefixConjMedTerm   45.56  (2.8%)   45.79  (2.9%)
0.5% (  -5% -6%)
WildcardConjHighTerm   33.08  (2.6%)   33.47  (3.0%)
1.2% (  -4% -6%)
  PrefixConjHighTerm   83.65  (2.6%)   86.23  (3.7%)
3.1% (  -3% -9%)
   HighTermMonthSort  130.35 (15.8%)  135.08 (12.1%)
3.6% ( -20% -   37%)
 WildcardConjMedTerm   99.19  (3.6%)  103.37  (4.1%)
4.2% (  -3% -   12%)
{noformat}


was (Author: lucacavanna):
I updated the PR and addressed all the comments, here are the latest benchmark 
results:

{noformat}
Report after iter 19:
TaskQPS baseline  StdDevQPS 

[jira] [Commented] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-31 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852876#comment-16852876
 ] 

Luca Cavanna commented on LUCENE-8796:
--

I updated the PR and addressed all the comments, here are the latest benchmark 
results:

{noformat}
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
 MedTerm 1510.74  (6.8%) 1457.20  (8.4%)   
-3.5% ( -17% -   12%)
  Fuzzy1   70.49  (8.5%)   68.11  (9.8%)   
-3.4% ( -19% -   16%)
OrHighNotMed  650.57  (5.8%)  629.81  (6.0%)   
-3.2% ( -14% -9%)
   OrHighLow  447.13  (4.2%)  433.05  (4.5%)   
-3.2% ( -11% -5%)
OrNotHighMed  623.22  (6.3%)  605.19  (6.1%)   
-2.9% ( -14% -   10%)
OrHighNotLow  720.89  (7.0%)  701.26  (7.9%)   
-2.7% ( -16% -   13%)
   OrNotHighHigh  558.43  (6.3%)  544.82  (4.9%)   
-2.4% ( -12% -9%)
 LowTerm 1279.34  (4.9%) 1248.60  (5.2%)   
-2.4% ( -11% -8%)
  AndHighLow  690.75  (4.0%)  675.22  (5.3%)   
-2.2% ( -11% -7%)
   LowPhrase  358.90  (2.3%)  351.28  (4.0%)   
-2.1% (  -8% -4%)
PKLookup  139.97  (3.0%)  137.32  (3.5%)   
-1.9% (  -8% -4%)
OrNotHighLow  728.48  (6.8%)  714.79  (6.5%)   
-1.9% ( -14% -   12%)
HighTerm 1222.38  (6.3%) 1199.77  (7.1%)   
-1.8% ( -14% -   12%)
 AndHighHigh   58.93  (6.2%)   58.01  (5.8%)   
-1.6% ( -12% -   11%)
 Prefix3  152.21  (4.5%)  150.00  (5.0%)   
-1.5% ( -10% -8%)
   IntNRQConjMedTerm   79.15 (10.7%)   78.06 (10.5%)   
-1.4% ( -20% -   22%)
   HighTermDayOfYearSort   95.28  (5.1%)   94.10  (7.8%)   
-1.2% ( -13% -   12%)
Wildcard   64.23  (2.3%)   63.45  (2.3%)   
-1.2% (  -5% -3%)
 MedSpanNear   81.15  (2.2%)   80.19  (2.8%)   
-1.2% (  -6% -3%)
HighSpanNear   10.20  (3.9%)   10.08  (4.2%)   
-1.2% (  -8% -7%)
HighIntervalsOrdered4.07  (1.8%)4.03  (2.2%)   
-1.1% (  -4% -2%)
 LowSpanNear   41.62  (3.1%)   41.20  (3.6%)   
-1.0% (  -7% -5%)
   IntNRQConjLowTerm   20.36  (4.1%)   20.15  (4.5%)   
-1.0% (  -9% -7%)
  IntNRQConjHighTerm   64.84  (9.6%)   64.21  (9.4%)   
-1.0% ( -18% -   19%)
  AndHighMed  229.08  (2.8%)  227.00  (2.5%)   
-0.9% (  -6% -4%)
   MedPhrase   18.73  (1.5%)   18.57  (2.3%)   
-0.8% (  -4% -2%)
 LowSloppyPhrase  124.52  (2.3%)  123.48  (2.6%)   
-0.8% (  -5% -4%)
 Respell   69.26  (3.0%)   68.68  (2.9%)   
-0.8% (  -6% -5%)
  HighPhrase   12.98  (1.6%)   12.88  (2.2%)   
-0.7% (  -4% -3%)
   PrefixConjLowTerm   42.11  (2.6%)   41.81  (3.0%)   
-0.7% (  -6% -5%)
   OrHighNotHigh  680.34  (6.1%)  676.16  (7.6%)   
-0.6% ( -13% -   13%)
 MedSloppyPhrase   34.06  (4.9%)   33.89  (4.5%)   
-0.5% (  -9% -9%)
  IntNRQ   89.97 (12.4%)   89.62 (12.0%)   
-0.4% ( -22% -   27%)
HighSloppyPhrase8.28  (4.0%)8.25  (3.9%)   
-0.3% (  -7% -7%)
 WildcardConjLowTerm   36.35  (2.7%)   36.26  (2.7%)   
-0.3% (  -5% -5%)
  OrHighHigh   27.89  (2.6%)   27.85  (3.1%)   
-0.1% (  -5% -5%)
  Fuzzy2   44.19  (3.8%)   44.17  (3.1%)   
-0.1% (  -6% -7%)
   OrHighMed   90.42  (2.8%)   90.57  (2.8%)
0.2% (  -5% -6%)
   PrefixConjMedTerm   45.56  (2.8%)   45.79  (2.9%)
0.5% (  -5% -6%)
WildcardConjHighTerm   33.08  (2.6%)   33.47  (3.0%)
1.2% (  -4% -6%)
  PrefixConjHighTerm   83.65  (2.6%)   86.23  (3.7%)
3.1% (  -3% -9%)
   HighTermMonthSort  130.35 (15.8%)  135.08 (12.1%)
3.6% ( -20% -   37%)
 WildcardConjMedTerm   99.19  (3.6%)  103.37  (4.1%)
4.2% (  -3% -   12%)
{noformat}

> Use exponential search in IntArrayDocIdSet advance method
> -
>
> Key: LUCENE-8796
> URL: https://issues.apache.org/jira/browse/LUCENE-8796
> Project: Lucene - Core
>  Issue Type: Improvement

[jira] [Comment Edited] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-09 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836467#comment-16836467
 ] 

Luca Cavanna edited comment on LUCENE-8796 at 5/9/19 3:21 PM:
--

I have updated the PR after applying Yonik's suggestion and re-run benchmarks a 
few times. The run with the least noise had these results (note that I disabled 
the bitset optimization on both sides):
{noformat}
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
HighTerm 1575.07  (5.9%) 1541.27  (6.9%)   
-2.1% ( -14% -   11%)
 MedTerm 1363.22  (6.5%) 1337.03  (7.0%)   
-1.9% ( -14% -   12%)
 LowTerm 1441.86  (4.2%) 1420.77  (5.2%)   
-1.5% ( -10% -8%)
   IntNRQConjMedTerm  280.55  (4.0%)  277.64  (4.1%)   
-1.0% (  -8% -7%)
   MedPhrase  153.84  (3.5%)  152.44  (3.3%)   
-0.9% (  -7% -6%)
 Prefix3  224.92  (4.0%)  223.13  (3.7%)   
-0.8% (  -8% -7%)
HighSloppyPhrase   19.70  (3.7%)   19.56  (4.5%)   
-0.7% (  -8% -7%)
 MedSloppyPhrase   18.23  (4.3%)   18.11  (4.7%)   
-0.7% (  -9% -8%)
OrNotHighMed  586.33  (3.4%)  582.47  (4.9%)   
-0.7% (  -8% -7%)
 LowSloppyPhrase   18.56  (3.6%)   18.46  (3.9%)   
-0.5% (  -7% -7%)
  HighPhrase   22.64  (2.7%)   22.54  (3.0%)   
-0.4% (  -6% -5%)
   LowPhrase  144.10  (3.8%)  143.55  (3.3%)   
-0.4% (  -7% -6%)
  AndHighLow  539.26  (3.7%)  537.25  (3.2%)   
-0.4% (  -7% -6%)
PKLookup  132.96  (3.0%)  132.48  (4.6%)   
-0.4% (  -7% -7%)
   OrHighMed  115.79  (2.7%)  115.49  (3.5%)   
-0.3% (  -6% -6%)
  PrefixConjHighTerm   36.98  (2.8%)   36.93  (3.4%)   
-0.1% (  -6% -6%)
WildcardConjHighTerm   45.79  (3.0%)   45.73  (3.1%)   
-0.1% (  -6% -6%)
   OrHighLow  448.91  (3.7%)  448.70  (6.3%)   
-0.0% (  -9% -   10%)
Wildcard   78.89  (3.2%)   78.95  (3.6%)
0.1% (  -6% -7%)
  IntNRQConjHighTerm   78.35  (2.3%)   78.48  (2.4%)
0.2% (  -4% -4%)
  IntNRQ  100.56  (2.7%)  100.84  (2.8%)
0.3% (  -5% -5%)
OrHighNotLow  732.45  (2.8%)  734.56  (5.3%)
0.3% (  -7% -8%)
   OrHighNotHigh  544.87  (2.8%)  546.47  (4.6%)
0.3% (  -6% -7%)
   IntNRQConjLowTerm  249.20  (4.2%)  249.99  (3.8%)
0.3% (  -7% -8%)
 Respell   73.05  (3.1%)   73.28  (3.4%)
0.3% (  -6% -7%)
  OrHighHigh   35.56  (3.0%)   35.68  (4.2%)
0.3% (  -6% -7%)
OrNotHighLow  695.41  (4.8%)  697.88  (6.5%)
0.4% ( -10% -   12%)
 MedSpanNear   59.99  (3.8%)   60.30  (4.0%)
0.5% (  -7% -8%)
  AndHighMed  190.02  (3.1%)  191.04  (3.6%)
0.5% (  -5% -7%)
 LowSpanNear   12.73  (3.9%)   12.81  (4.2%)
0.6% (  -7% -8%)
   HighTermDayOfYearSort   88.42  (7.0%)   89.09  (7.1%)
0.8% ( -12% -   15%)
   PrefixConjLowTerm   54.95  (3.7%)   55.43  (3.8%)
0.9% (  -6% -8%)
OrHighNotMed  628.44  (3.4%)  634.02  (6.1%)
0.9% (  -8% -   10%)
HighSpanNear   28.86  (3.2%)   29.11  (3.5%)
0.9% (  -5% -7%)
 WildcardConjMedTerm   72.48  (3.4%)   73.19  (4.8%)
1.0% (  -7% -9%)
  Fuzzy2   49.17  (9.9%)   49.68 (11.7%)
1.0% ( -18% -   25%)
 AndHighHigh   63.44  (3.8%)   64.11  (3.8%)
1.1% (  -6% -9%)
  Fuzzy1   79.43  (9.9%)   80.55  (9.7%)
1.4% ( -16% -   23%)
   OrNotHighHigh  574.89  (3.6%)  584.43  (5.5%)
1.7% (  -7% -   11%)
   PrefixConjMedTerm   79.00  (3.2%)   80.50  (3.6%)
1.9% (  -4% -8%)
 WildcardConjLowTerm   90.67  (2.9%)   92.49  (3.7%)
2.0% (  -4% -8%)
   HighTermMonthSort   86.13 (11.8%)   88.79 (12.4%)
3.1% ( -18% -   30%)
{noformat}
I also ran benchmarks with the bitset optimization in place on both ends:

{{{noformat}}}
 Report after iter 19:
 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff
 IntNRQ 63.46 (24.6%) 62.28 (24.2%) -1.9% ( -40% - 62%)
 

[jira] [Comment Edited] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-09 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835542#comment-16835542
 ] 

Luca Cavanna edited comment on LUCENE-8796 at 5/9/19 3:22 PM:
--

I have made the change and played with luceneutil to run some benchmark. I 
opened a PR here: [https://github.com/apache/lucene-solr/pull/667] .

Luceneutil does not currently benchmark the queries that should be affected by 
this change, hence I added benchmarks for numeric range queries, prefix queries 
and wildcard queries in conjunction with term queries (low, medium and high 
frequency). See the changes I made to my luceneutil fork: 
[https://github.com/mikemccand/luceneutil/compare/master...javanna:conjunctions]
 .  Also, for the benchmarks I temporarily modified DocIdSetBuilder#grow to 
never call upgradeToBitSet (on both baseline and modified version), so that the 
updated code is exercised as much as possible during the benchmarks run, 
otherwise in many cases we would use bitsets instead and the changed code would 
not be exercised at all.

I ran the wikimedium10m benchmarks a few times, here is probably the run with 
the least noise, results show a little improvement for some queries, and no 
regressions in general:
  

 
{noformat}
 Report after iter 19:
 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff
 WildcardConjMedTerm 75.49 (2.2%) 72.79 (2.0%) -3.6% ( -7% - 0%)
 OrHighNotMed 607.01 (5.7%) 593.10 (4.4%) -2.3% ( -11% - 8%)
 WildcardConjHighTerm 64.00 (1.7%) 62.55 (1.4%) -2.3% ( -5% - 0%)
 Fuzzy2 20.14 (3.4%) 19.72 (4.6%) -2.1% ( -9% - 6%)
 HighTerm 1174.41 (4.7%) 1150.11 (4.2%) -2.1% ( -10% - 7%)
 OrHighLow 483.40 (5.1%) 473.69 (6.9%) -2.0% ( -13% - 10%)
 OrNotHighLow 526.75 (3.6%) 516.47 (3.6%) -2.0% ( -8% - 5%)
 OrNotHighHigh 600.38 (4.9%) 590.21 (3.7%) -1.7% ( -9% - 7%)
 HighTermMonthSort 110.05 (11.7%) 108.58 (11.5%) -1.3% ( -21% - 24%)
 OrHighMed 107.83 (2.6%) 106.48 (4.7%) -1.3% ( -8% - 6%)
 PrefixConjMedTerm 56.98 (2.5%) 56.33 (1.7%) -1.1% ( -5% - 3%)
 AndHighLow 432.27 (3.6%) 427.46 (3.2%) -1.1% ( -7% - 5%)
 PrefixConjLowTerm 44.43 (2.8%) 43.98 (1.8%) -1.0% ( -5% - 3%)
 MedTerm 1409.97 (5.5%) 1396.33 (4.9%) -1.0% ( -10% - 9%)
 HighSloppyPhrase 11.98 (4.3%) 11.87 (5.1%) -0.9% ( -9% - 8%)
 OrNotHighMed 614.19 (4.6%) 608.74 (3.8%) -0.9% ( -8% - 7%)
 Respell 58.11 (2.4%) 57.61 (2.4%) -0.9% ( -5% - 3%)
 LowTerm 1342.33 (4.8%) 1330.86 (4.0%) -0.9% ( -9% - 8%)
 PrefixConjHighTerm 68.50 (2.9%) 67.93 (1.8%) -0.8% ( -5% - 3%)
 OrHighNotHigh 566.30 (5.2%) 561.88 (4.5%) -0.8% ( -9% - 9%)
 WildcardConjLowTerm 32.75 (2.5%) 32.56 (2.1%) -0.6% ( -5% - 4%)
 PKLookup 131.80 (2.4%) 131.28 (2.3%) -0.4% ( -5% - 4%)
 OrHighHigh 29.90 (3.4%) 29.79 (5.3%) -0.4% ( -8% - 8%)
 OrHighNotLow 497.65 (6.6%) 495.84 (5.2%) -0.4% ( -11% - 12%)
 AndHighMed 175.08 (3.5%) 174.58 (3.0%) -0.3% ( -6% - 6%)
 LowSpanNear 15.17 (1.8%) 15.13 (2.5%) -0.2% ( -4% - 4%)
 Fuzzy1 71.14 (5.9%) 70.97 (6.3%) -0.2% ( -11% - 12%)
 LowSloppyPhrase 35.23 (2.0%) 35.16 (2.6%) -0.2% ( -4% - 4%)
 LowPhrase 74.10 (1.7%) 73.98 (1.8%) -0.2% ( -3% - 3%)
 HighPhrase 34.18 (2.1%) 34.13 (2.0%) -0.1% ( -4% - 3%)
 Prefix3 45.33 (2.3%) 45.28 (2.1%) -0.1% ( -4% - 4%)
 MedPhrase 28.30 (2.1%) 28.27 (1.7%) -0.1% ( -3% - 3%)
 MedSloppyPhrase 6.80 (3.6%) 6.80 (3.2%) -0.0% ( -6% - 6%)
 AndHighHigh 53.79 (3.9%) 53.79 (4.0%) -0.0% ( -7% - 8%)
 MedSpanNear 61.78 (2.2%) 61.83 (1.7%) 0.1% ( -3% - 4%)
 Wildcard 37.83 (2.5%) 37.91 (1.7%) 0.2% ( -3% - 4%)
 IntNRQConjHighTerm 20.17 (3.8%) 20.24 (4.9%) 0.3% ( -8% - 9%)
 HighTermDayOfYearSort 53.55 (7.8%) 53.76 (7.3%) 0.4% ( -13% - 16%)
 HighSpanNear 5.39 (2.6%) 5.42 (2.6%) 0.5% ( -4% - 5%)
 IntNRQConjLowTerm 19.69 (4.3%) 19.86 (4.3%) 0.9% ( -7% - 9%)
 IntNRQConjMedTerm 15.93 (4.5%) 16.12 (5.4%) 1.2% ( -8% - 11%)
 IntNRQ 114.28 (10.3%) 116.41 (14.0%) 1.9% ( -20% - 29%)
 {noformat}
 


was (Author: lucacavanna):
I have made the change and played with luceneutil to run some benchmark. I 
opened a PR here: [https://github.com/apache/lucene-solr/pull/667] .

Luceneutil does not currently benchmark the queries that should be affected by 
this change, hence I added benchmarks for numeric range queries, prefix queries 
and wildcard queries in conjunction with term queries (low, medium and high 
frequency). See the changes I made to my luceneutil fork: 
[https://github.com/mikemccand/luceneutil/compare/master...javanna:conjunctions]
 .  Also, for the benchmarks I temporarily modified DocIdSetBuilder#grow to 
never call upgradeToBitSet (on both baseline and modified version), so that the 
updated code is exercised as much as possible during the benchmarks run, 
otherwise in many cases we would use bitsets instead and the changed code would 
not be exercised at all.

I ran the wikimedium10m benchmarks a few times, here is probably the run with 
the least noise, results show a little improvement for some queries, and no 
regressions in 

[jira] [Comment Edited] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-09 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836467#comment-16836467
 ] 

Luca Cavanna edited comment on LUCENE-8796 at 5/9/19 3:22 PM:
--

I have updated the PR after applying Yonik's suggestion and re-run benchmarks a 
few times. The run with the least noise had these results (note that I disabled 
the bitset optimization on both sides):
{noformat}
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
HighTerm 1575.07  (5.9%) 1541.27  (6.9%)   
-2.1% ( -14% -   11%)
 MedTerm 1363.22  (6.5%) 1337.03  (7.0%)   
-1.9% ( -14% -   12%)
 LowTerm 1441.86  (4.2%) 1420.77  (5.2%)   
-1.5% ( -10% -8%)
   IntNRQConjMedTerm  280.55  (4.0%)  277.64  (4.1%)   
-1.0% (  -8% -7%)
   MedPhrase  153.84  (3.5%)  152.44  (3.3%)   
-0.9% (  -7% -6%)
 Prefix3  224.92  (4.0%)  223.13  (3.7%)   
-0.8% (  -8% -7%)
HighSloppyPhrase   19.70  (3.7%)   19.56  (4.5%)   
-0.7% (  -8% -7%)
 MedSloppyPhrase   18.23  (4.3%)   18.11  (4.7%)   
-0.7% (  -9% -8%)
OrNotHighMed  586.33  (3.4%)  582.47  (4.9%)   
-0.7% (  -8% -7%)
 LowSloppyPhrase   18.56  (3.6%)   18.46  (3.9%)   
-0.5% (  -7% -7%)
  HighPhrase   22.64  (2.7%)   22.54  (3.0%)   
-0.4% (  -6% -5%)
   LowPhrase  144.10  (3.8%)  143.55  (3.3%)   
-0.4% (  -7% -6%)
  AndHighLow  539.26  (3.7%)  537.25  (3.2%)   
-0.4% (  -7% -6%)
PKLookup  132.96  (3.0%)  132.48  (4.6%)   
-0.4% (  -7% -7%)
   OrHighMed  115.79  (2.7%)  115.49  (3.5%)   
-0.3% (  -6% -6%)
  PrefixConjHighTerm   36.98  (2.8%)   36.93  (3.4%)   
-0.1% (  -6% -6%)
WildcardConjHighTerm   45.79  (3.0%)   45.73  (3.1%)   
-0.1% (  -6% -6%)
   OrHighLow  448.91  (3.7%)  448.70  (6.3%)   
-0.0% (  -9% -   10%)
Wildcard   78.89  (3.2%)   78.95  (3.6%)
0.1% (  -6% -7%)
  IntNRQConjHighTerm   78.35  (2.3%)   78.48  (2.4%)
0.2% (  -4% -4%)
  IntNRQ  100.56  (2.7%)  100.84  (2.8%)
0.3% (  -5% -5%)
OrHighNotLow  732.45  (2.8%)  734.56  (5.3%)
0.3% (  -7% -8%)
   OrHighNotHigh  544.87  (2.8%)  546.47  (4.6%)
0.3% (  -6% -7%)
   IntNRQConjLowTerm  249.20  (4.2%)  249.99  (3.8%)
0.3% (  -7% -8%)
 Respell   73.05  (3.1%)   73.28  (3.4%)
0.3% (  -6% -7%)
  OrHighHigh   35.56  (3.0%)   35.68  (4.2%)
0.3% (  -6% -7%)
OrNotHighLow  695.41  (4.8%)  697.88  (6.5%)
0.4% ( -10% -   12%)
 MedSpanNear   59.99  (3.8%)   60.30  (4.0%)
0.5% (  -7% -8%)
  AndHighMed  190.02  (3.1%)  191.04  (3.6%)
0.5% (  -5% -7%)
 LowSpanNear   12.73  (3.9%)   12.81  (4.2%)
0.6% (  -7% -8%)
   HighTermDayOfYearSort   88.42  (7.0%)   89.09  (7.1%)
0.8% ( -12% -   15%)
   PrefixConjLowTerm   54.95  (3.7%)   55.43  (3.8%)
0.9% (  -6% -8%)
OrHighNotMed  628.44  (3.4%)  634.02  (6.1%)
0.9% (  -8% -   10%)
HighSpanNear   28.86  (3.2%)   29.11  (3.5%)
0.9% (  -5% -7%)
 WildcardConjMedTerm   72.48  (3.4%)   73.19  (4.8%)
1.0% (  -7% -9%)
  Fuzzy2   49.17  (9.9%)   49.68 (11.7%)
1.0% ( -18% -   25%)
 AndHighHigh   63.44  (3.8%)   64.11  (3.8%)
1.1% (  -6% -9%)
  Fuzzy1   79.43  (9.9%)   80.55  (9.7%)
1.4% ( -16% -   23%)
   OrNotHighHigh  574.89  (3.6%)  584.43  (5.5%)
1.7% (  -7% -   11%)
   PrefixConjMedTerm   79.00  (3.2%)   80.50  (3.6%)
1.9% (  -4% -8%)
 WildcardConjLowTerm   90.67  (2.9%)   92.49  (3.7%)
2.0% (  -4% -8%)
   HighTermMonthSort   86.13 (11.8%)   88.79 (12.4%)
3.1% ( -18% -   30%)
{noformat}
I also ran benchmarks with the bitset optimization in place on both ends:

{noformat}
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
  IntNRQ 

[jira] [Comment Edited] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-09 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835542#comment-16835542
 ] 

Luca Cavanna edited comment on LUCENE-8796 at 5/9/19 3:21 PM:
--

I have made the change and played with luceneutil to run some benchmark. I 
opened a PR here: [https://github.com/apache/lucene-solr/pull/667] .

Luceneutil does not currently benchmark the queries that should be affected by 
this change, hence I added benchmarks for numeric range queries, prefix queries 
and wildcard queries in conjunction with term queries (low, medium and high 
frequency). See the changes I made to my luceneutil fork: 
[https://github.com/mikemccand/luceneutil/compare/master...javanna:conjunctions]
 .  Also, for the benchmarks I temporarily modified DocIdSetBuilder#grow to 
never call upgradeToBitSet (on both baseline and modified version), so that the 
updated code is exercised as much as possible during the benchmarks run, 
otherwise in many cases we would use bitsets instead and the changed code would 
not be exercised at all.

I ran the wikimedium10m benchmarks a few times, here is probably the run with 
the least noise, results show a little improvement for some queries, and no 
regressions in general:
  

{{{noformat}}}
 Report after iter 19:
 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff
 WildcardConjMedTerm 75.49 (2.2%) 72.79 (2.0%) -3.6% ( -7% - 0%)
 OrHighNotMed 607.01 (5.7%) 593.10 (4.4%) -2.3% ( -11% - 8%)
 WildcardConjHighTerm 64.00 (1.7%) 62.55 (1.4%) -2.3% ( -5% - 0%)
 Fuzzy2 20.14 (3.4%) 19.72 (4.6%) -2.1% ( -9% - 6%)
 HighTerm 1174.41 (4.7%) 1150.11 (4.2%) -2.1% ( -10% - 7%)
 OrHighLow 483.40 (5.1%) 473.69 (6.9%) -2.0% ( -13% - 10%)
 OrNotHighLow 526.75 (3.6%) 516.47 (3.6%) -2.0% ( -8% - 5%)
 OrNotHighHigh 600.38 (4.9%) 590.21 (3.7%) -1.7% ( -9% - 7%)
 HighTermMonthSort 110.05 (11.7%) 108.58 (11.5%) -1.3% ( -21% - 24%)
 OrHighMed 107.83 (2.6%) 106.48 (4.7%) -1.3% ( -8% - 6%)
 PrefixConjMedTerm 56.98 (2.5%) 56.33 (1.7%) -1.1% ( -5% - 3%)
 AndHighLow 432.27 (3.6%) 427.46 (3.2%) -1.1% ( -7% - 5%)
 PrefixConjLowTerm 44.43 (2.8%) 43.98 (1.8%) -1.0% ( -5% - 3%)
 MedTerm 1409.97 (5.5%) 1396.33 (4.9%) -1.0% ( -10% - 9%)
 HighSloppyPhrase 11.98 (4.3%) 11.87 (5.1%) -0.9% ( -9% - 8%)
 OrNotHighMed 614.19 (4.6%) 608.74 (3.8%) -0.9% ( -8% - 7%)
 Respell 58.11 (2.4%) 57.61 (2.4%) -0.9% ( -5% - 3%)
 LowTerm 1342.33 (4.8%) 1330.86 (4.0%) -0.9% ( -9% - 8%)
 PrefixConjHighTerm 68.50 (2.9%) 67.93 (1.8%) -0.8% ( -5% - 3%)
 OrHighNotHigh 566.30 (5.2%) 561.88 (4.5%) -0.8% ( -9% - 9%)
 WildcardConjLowTerm 32.75 (2.5%) 32.56 (2.1%) -0.6% ( -5% - 4%)
 PKLookup 131.80 (2.4%) 131.28 (2.3%) -0.4% ( -5% - 4%)
 OrHighHigh 29.90 (3.4%) 29.79 (5.3%) -0.4% ( -8% - 8%)
 OrHighNotLow 497.65 (6.6%) 495.84 (5.2%) -0.4% ( -11% - 12%)
 AndHighMed 175.08 (3.5%) 174.58 (3.0%) -0.3% ( -6% - 6%)
 LowSpanNear 15.17 (1.8%) 15.13 (2.5%) -0.2% ( -4% - 4%)
 Fuzzy1 71.14 (5.9%) 70.97 (6.3%) -0.2% ( -11% - 12%)
 LowSloppyPhrase 35.23 (2.0%) 35.16 (2.6%) -0.2% ( -4% - 4%)
 LowPhrase 74.10 (1.7%) 73.98 (1.8%) -0.2% ( -3% - 3%)
 HighPhrase 34.18 (2.1%) 34.13 (2.0%) -0.1% ( -4% - 3%)
 Prefix3 45.33 (2.3%) 45.28 (2.1%) -0.1% ( -4% - 4%)
 MedPhrase 28.30 (2.1%) 28.27 (1.7%) -0.1% ( -3% - 3%)
 MedSloppyPhrase 6.80 (3.6%) 6.80 (3.2%) -0.0% ( -6% - 6%)
 AndHighHigh 53.79 (3.9%) 53.79 (4.0%) -0.0% ( -7% - 8%)
 MedSpanNear 61.78 (2.2%) 61.83 (1.7%) 0.1% ( -3% - 4%)
 Wildcard 37.83 (2.5%) 37.91 (1.7%) 0.2% ( -3% - 4%)
 IntNRQConjHighTerm 20.17 (3.8%) 20.24 (4.9%) 0.3% ( -8% - 9%)
 HighTermDayOfYearSort 53.55 (7.8%) 53.76 (7.3%) 0.4% ( -13% - 16%)
 HighSpanNear 5.39 (2.6%) 5.42 (2.6%) 0.5% ( -4% - 5%)
 IntNRQConjLowTerm 19.69 (4.3%) 19.86 (4.3%) 0.9% ( -7% - 9%)
 IntNRQConjMedTerm 15.93 (4.5%) 16.12 (5.4%) 1.2% ( -8% - 11%)
 IntNRQ 114.28 (10.3%) 116.41 (14.0%) 1.9% ( -20% - 29%)

 {{{noformat}}}

 


was (Author: lucacavanna):
I have made the change and played with luceneutil to run some benchmark. I 
opened a PR here: https://github.com/apache/lucene-solr/pull/667 .

Luceneutil does not currently benchmark the queries that should be affected by 
this change, hence I added benchmarks for numeric range queries, prefix queries 
and wildcard queries in conjunction with term queries (low, medium and high 
frequency). See the changes I made to my luceneutil fork: 
[https://github.com/mikemccand/luceneutil/compare/master...javanna:conjunctions]
 .  Also, for the benchmarks I temporarily modified DocIdSetBuilder#grow to 
never call upgradeToBitSet (on both baseline and modified version), so that the 
updated code is exercised as much as possible during the benchmarks run, 
otherwise in many cases we would use bitsets instead and the changed code would 
not be exercised at all.

I ran the wikimedium10m benchmarks a few times, here is probably the run with 
the least noise, results show a little improvement for some queries, and no 

[jira] [Commented] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-09 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836467#comment-16836467
 ] 

Luca Cavanna commented on LUCENE-8796:
--

I have updated the PR after applying Yonik's suggestion and re-run benchmarks a 
few times. The run with the least noise had these results (note that I disabled 
the bitset optimization on both sides):

{{
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
HighTerm 1575.07  (5.9%) 1541.27  (6.9%)   
-2.1% ( -14% -   11%)
 MedTerm 1363.22  (6.5%) 1337.03  (7.0%)   
-1.9% ( -14% -   12%)
 LowTerm 1441.86  (4.2%) 1420.77  (5.2%)   
-1.5% ( -10% -8%)
   IntNRQConjMedTerm  280.55  (4.0%)  277.64  (4.1%)   
-1.0% (  -8% -7%)
   MedPhrase  153.84  (3.5%)  152.44  (3.3%)   
-0.9% (  -7% -6%)
 Prefix3  224.92  (4.0%)  223.13  (3.7%)   
-0.8% (  -8% -7%)
HighSloppyPhrase   19.70  (3.7%)   19.56  (4.5%)   
-0.7% (  -8% -7%)
 MedSloppyPhrase   18.23  (4.3%)   18.11  (4.7%)   
-0.7% (  -9% -8%)
OrNotHighMed  586.33  (3.4%)  582.47  (4.9%)   
-0.7% (  -8% -7%)
 LowSloppyPhrase   18.56  (3.6%)   18.46  (3.9%)   
-0.5% (  -7% -7%)
  HighPhrase   22.64  (2.7%)   22.54  (3.0%)   
-0.4% (  -6% -5%)
   LowPhrase  144.10  (3.8%)  143.55  (3.3%)   
-0.4% (  -7% -6%)
  AndHighLow  539.26  (3.7%)  537.25  (3.2%)   
-0.4% (  -7% -6%)
PKLookup  132.96  (3.0%)  132.48  (4.6%)   
-0.4% (  -7% -7%)
   OrHighMed  115.79  (2.7%)  115.49  (3.5%)   
-0.3% (  -6% -6%)
  PrefixConjHighTerm   36.98  (2.8%)   36.93  (3.4%)   
-0.1% (  -6% -6%)
WildcardConjHighTerm   45.79  (3.0%)   45.73  (3.1%)   
-0.1% (  -6% -6%)
   OrHighLow  448.91  (3.7%)  448.70  (6.3%)   
-0.0% (  -9% -   10%)
Wildcard   78.89  (3.2%)   78.95  (3.6%)
0.1% (  -6% -7%)
  IntNRQConjHighTerm   78.35  (2.3%)   78.48  (2.4%)
0.2% (  -4% -4%)
  IntNRQ  100.56  (2.7%)  100.84  (2.8%)
0.3% (  -5% -5%)
OrHighNotLow  732.45  (2.8%)  734.56  (5.3%)
0.3% (  -7% -8%)
   OrHighNotHigh  544.87  (2.8%)  546.47  (4.6%)
0.3% (  -6% -7%)
   IntNRQConjLowTerm  249.20  (4.2%)  249.99  (3.8%)
0.3% (  -7% -8%)
 Respell   73.05  (3.1%)   73.28  (3.4%)
0.3% (  -6% -7%)
  OrHighHigh   35.56  (3.0%)   35.68  (4.2%)
0.3% (  -6% -7%)
OrNotHighLow  695.41  (4.8%)  697.88  (6.5%)
0.4% ( -10% -   12%)
 MedSpanNear   59.99  (3.8%)   60.30  (4.0%)
0.5% (  -7% -8%)
  AndHighMed  190.02  (3.1%)  191.04  (3.6%)
0.5% (  -5% -7%)
 LowSpanNear   12.73  (3.9%)   12.81  (4.2%)
0.6% (  -7% -8%)
   HighTermDayOfYearSort   88.42  (7.0%)   89.09  (7.1%)
0.8% ( -12% -   15%)
   PrefixConjLowTerm   54.95  (3.7%)   55.43  (3.8%)
0.9% (  -6% -8%)
OrHighNotMed  628.44  (3.4%)  634.02  (6.1%)
0.9% (  -8% -   10%)
HighSpanNear   28.86  (3.2%)   29.11  (3.5%)
0.9% (  -5% -7%)
 WildcardConjMedTerm   72.48  (3.4%)   73.19  (4.8%)
1.0% (  -7% -9%)
  Fuzzy2   49.17  (9.9%)   49.68 (11.7%)
1.0% ( -18% -   25%)
 AndHighHigh   63.44  (3.8%)   64.11  (3.8%)
1.1% (  -6% -9%)
  Fuzzy1   79.43  (9.9%)   80.55  (9.7%)
1.4% ( -16% -   23%)
   OrNotHighHigh  574.89  (3.6%)  584.43  (5.5%)
1.7% (  -7% -   11%)
   PrefixConjMedTerm   79.00  (3.2%)   80.50  (3.6%)
1.9% (  -4% -8%)
 WildcardConjLowTerm   90.67  (2.9%)   92.49  (3.7%)
2.0% (  -4% -8%)
   HighTermMonthSort   86.13 (11.8%)   88.79 (12.4%)
3.1% ( -18% -   30%)
}}

I also ran benchmarks with the bitset optimization in place on both ends:

{{
Report after iter 19:
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff
  IntNRQ   63.46 (24.6%)   62.28 (24.2%)   
-1.9% ( -40% -   

[jira] [Commented] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-08 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835628#comment-16835628
 ] 

Luca Cavanna commented on LUCENE-8796:
--

You are right [~ysee...@gmail.com] I will make that change and re-run 
benchmarks.

> Use exponential search in IntArrayDocIdSet advance method
> -
>
> Key: LUCENE-8796
> URL: https://issues.apache.org/jira/browse/LUCENE-8796
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Priority: Minor
>
> Chatting with [~jpountz] , he suggested to improve IntArrayDocIdSet by making 
> its advance method use exponential search instead of binary search. This 
> should help performance of queries including conjunctions: given that 
> ConjunctionDISI uses leap frog, it advances through doc ids in small steps, 
> hence exponential search should be faster when advancing on average compared 
> to binary search.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-08 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835542#comment-16835542
 ] 

Luca Cavanna commented on LUCENE-8796:
--

I have made the change and played with luceneutil to run some benchmark. I 
opened a PR here: https://github.com/apache/lucene-solr/pull/667 .

Luceneutil does not currently benchmark the queries that should be affected by 
this change, hence I added benchmarks for numeric range queries, prefix queries 
and wildcard queries in conjunction with term queries (low, medium and high 
frequency). See the changes I made to my luceneutil fork: 
[https://github.com/mikemccand/luceneutil/compare/master...javanna:conjunctions]
 .  Also, for the benchmarks I temporarily modified DocIdSetBuilder#grow to 
never call upgradeToBitSet (on both baseline and modified version), so that the 
updated code is exercised as much as possible during the benchmarks run, 
otherwise in many cases we would use bitsets instead and the changed code would 
not be exercised at all.

I ran the wikimedium10m benchmarks a few times, here is probably the run with 
the least noise, results show a little improvement for some queries, and no 
regressions in general:
 
Report after iter 19:
 TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff
 WildcardConjMedTerm 75.49 (2.2%) 72.79 (2.0%) -3.6% ( -7% - 0%)
 OrHighNotMed 607.01 (5.7%) 593.10 (4.4%) -2.3% ( -11% - 8%)
 WildcardConjHighTerm 64.00 (1.7%) 62.55 (1.4%) -2.3% ( -5% - 0%)
 Fuzzy2 20.14 (3.4%) 19.72 (4.6%) -2.1% ( -9% - 6%)
 HighTerm 1174.41 (4.7%) 1150.11 (4.2%) -2.1% ( -10% - 7%)
 OrHighLow 483.40 (5.1%) 473.69 (6.9%) -2.0% ( -13% - 10%)
 OrNotHighLow 526.75 (3.6%) 516.47 (3.6%) -2.0% ( -8% - 5%)
 OrNotHighHigh 600.38 (4.9%) 590.21 (3.7%) -1.7% ( -9% - 7%)
 HighTermMonthSort 110.05 (11.7%) 108.58 (11.5%) -1.3% ( -21% - 24%)
 OrHighMed 107.83 (2.6%) 106.48 (4.7%) -1.3% ( -8% - 6%)
 PrefixConjMedTerm 56.98 (2.5%) 56.33 (1.7%) -1.1% ( -5% - 3%)
 AndHighLow 432.27 (3.6%) 427.46 (3.2%) -1.1% ( -7% - 5%)
 PrefixConjLowTerm 44.43 (2.8%) 43.98 (1.8%) -1.0% ( -5% - 3%)
 MedTerm 1409.97 (5.5%) 1396.33 (4.9%) -1.0% ( -10% - 9%)
 HighSloppyPhrase 11.98 (4.3%) 11.87 (5.1%) -0.9% ( -9% - 8%)
 OrNotHighMed 614.19 (4.6%) 608.74 (3.8%) -0.9% ( -8% - 7%)
 Respell 58.11 (2.4%) 57.61 (2.4%) -0.9% ( -5% - 3%)
 LowTerm 1342.33 (4.8%) 1330.86 (4.0%) -0.9% ( -9% - 8%)
 PrefixConjHighTerm 68.50 (2.9%) 67.93 (1.8%) -0.8% ( -5% - 3%)
 OrHighNotHigh 566.30 (5.2%) 561.88 (4.5%) -0.8% ( -9% - 9%)
 WildcardConjLowTerm 32.75 (2.5%) 32.56 (2.1%) -0.6% ( -5% - 4%)
 PKLookup 131.80 (2.4%) 131.28 (2.3%) -0.4% ( -5% - 4%)
 OrHighHigh 29.90 (3.4%) 29.79 (5.3%) -0.4% ( -8% - 8%)
 OrHighNotLow 497.65 (6.6%) 495.84 (5.2%) -0.4% ( -11% - 12%)
 AndHighMed 175.08 (3.5%) 174.58 (3.0%) -0.3% ( -6% - 6%)
 LowSpanNear 15.17 (1.8%) 15.13 (2.5%) -0.2% ( -4% - 4%)
 Fuzzy1 71.14 (5.9%) 70.97 (6.3%) -0.2% ( -11% - 12%)
 LowSloppyPhrase 35.23 (2.0%) 35.16 (2.6%) -0.2% ( -4% - 4%)
 LowPhrase 74.10 (1.7%) 73.98 (1.8%) -0.2% ( -3% - 3%)
 HighPhrase 34.18 (2.1%) 34.13 (2.0%) -0.1% ( -4% - 3%)
 Prefix3 45.33 (2.3%) 45.28 (2.1%) -0.1% ( -4% - 4%)
 MedPhrase 28.30 (2.1%) 28.27 (1.7%) -0.1% ( -3% - 3%)
 MedSloppyPhrase 6.80 (3.6%) 6.80 (3.2%) -0.0% ( -6% - 6%)
 AndHighHigh 53.79 (3.9%) 53.79 (4.0%) -0.0% ( -7% - 8%)
 MedSpanNear 61.78 (2.2%) 61.83 (1.7%) 0.1% ( -3% - 4%)
 Wildcard 37.83 (2.5%) 37.91 (1.7%) 0.2% ( -3% - 4%)
 IntNRQConjHighTerm 20.17 (3.8%) 20.24 (4.9%) 0.3% ( -8% - 9%)
 HighTermDayOfYearSort 53.55 (7.8%) 53.76 (7.3%) 0.4% ( -13% - 16%)
 HighSpanNear 5.39 (2.6%) 5.42 (2.6%) 0.5% ( -4% - 5%)
 IntNRQConjLowTerm 19.69 (4.3%) 19.86 (4.3%) 0.9% ( -7% - 9%)
 IntNRQConjMedTerm 15.93 (4.5%) 16.12 (5.4%) 1.2% ( -8% - 11%)
 IntNRQ 114.28 (10.3%) 116.41 (14.0%) 1.9% ( -20% - 29%)

 

 

> Use exponential search in IntArrayDocIdSet advance method
> -
>
> Key: LUCENE-8796
> URL: https://issues.apache.org/jira/browse/LUCENE-8796
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Priority: Minor
>
> Chatting with [~jpountz] , he suggested to improve IntArrayDocIdSet by making 
> its advance method use exponential search instead of binary search. This 
> should help performance of queries including conjunctions: given that 
> ConjunctionDISI uses leap frog, it advances through doc ids in small steps, 
> hence exponential search should be faster when advancing on average compared 
> to binary search.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8796) Use exponential search in IntArrayDocIdSet advance method

2019-05-08 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-8796:


 Summary: Use exponential search in IntArrayDocIdSet advance method
 Key: LUCENE-8796
 URL: https://issues.apache.org/jira/browse/LUCENE-8796
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Luca Cavanna


Chatting with [~jpountz] , he suggested to improve IntArrayDocIdSet by making 
its advance method use exponential search instead of binary search. This should 
help performance of queries including conjunctions: given that ConjunctionDISI 
uses leap frog, it advances through doc ids in small steps, hence exponential 
search should be faster when advancing on average compared to binary search.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-5718) More flexible compound queries (containing mtq) support in postings highlighter

2019-02-26 Thread Luca Cavanna (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna resolved LUCENE-5718.
--
Resolution: Duplicate

This will be addressed with LUCENE-3401, which is being worked on.

> More flexible compound queries (containing mtq) support in postings 
> highlighter
> ---
>
> Key: LUCENE-5718
> URL: https://issues.apache.org/jira/browse/LUCENE-5718
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Affects Versions: 4.8.1
>Reporter: Luca Cavanna
>Priority: Major
> Attachments: LUCENE-5718.patch
>
>
> The postings highlighter currently pulls the automata from multi term queries 
> and doesn't require calling rewrite to make highlighting work. In order to do 
> so it also needs to check whether the query is a compound one and eventually 
> extract its subqueries. This is currently done in the MultiTermHighlighting 
> class and works well but has two potential problems:
> 1) not all the possible compound queries are necessarily supported as we need 
> to go over each of them one by one (see LUCENE-5717) and this requires 
> keeping the "switch" up-to-date if new queries gets added to lucene
> 2) it doesn't support custom compound queries but only the set of queries 
> available out-of-the-box
> I've been thinking about how this can be improved and one of the ideas I came 
> up with is to introduce a generic way to retrieve the subqueries from 
> compound queries, like for instance have a new abstract base class with a 
> getLeaves or getSubQueries method and have all the compound queries extend 
> it. What this method would do is return a flat array of all the leaf queries 
> that the compound query is made of. 
> Not sure whether this would be needed in other places in lucene, but it 
> doesn't seem like a small change and it would definitely affect (or benefit?) 
> more than just the postings highlighter support for multi term queries.
> In particular the second problem (custom queries) seems hard to solve without 
> a way to expose this info directly from the query though, unless we want to 
> make the MultiTermHighlighting#extractAutomata method extensible in some way.
> Would like to hear what people think and work on this as soon as we 
> identified which direction we want to take.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8664) Add equals/hashcode to TotalHits

2019-01-29 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16755016#comment-16755016
 ] 

Luca Cavanna commented on LUCENE-8664:
--

I am not using TotalHits in a map. I would benefit from the equals method for 
comparisons in tests. For instance in Elasticsearch we return the lucene 
TotalHits to users as part of bigger objects that have their own equals method. 
We end up wrapping TotalHits into another internal class that has its own 
equals/hashcode (among others). Having equals/hashcode built-in into lucene 
would remove the need for a wrapper class, as well as making equality 
comparisons a one-liner, especially when comparing multiple instances of 
objects holding TotalHits. This is a minor thing obviously, but I did not think 
it would be a bug to consider two different TotalHits instances that have same 
value and relation equal? I was chatting to [~jim.ferenczi] about this and we 
thought we should propose adding this to Lucene. Happy to close this if you 
think it should not be done.

> Add equals/hashcode to TotalHits
> 
>
> Key: LUCENE-8664
> URL: https://issues.apache.org/jira/browse/LUCENE-8664
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Priority: Minor
>
> I think it would be convenient to add equals/hashcode methods to the 
> TotalHits class. I opened a PR here: 
> [https://github.com/apache/lucene-solr/pull/552] .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8664) Add equals/hashcode to TotalHits

2019-01-29 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-8664:


 Summary: Add equals/hashcode to TotalHits
 Key: LUCENE-8664
 URL: https://issues.apache.org/jira/browse/LUCENE-8664
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Luca Cavanna


I think it would be convenient to add equals/hashcode methods to the TotalHits 
class. I opened a PR here: [https://github.com/apache/lucene-solr/pull/552] .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8591) LegacyBM25Similarity doesn't expose getDiscountOverlaps

2018-12-05 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-8591:


 Summary: LegacyBM25Similarity doesn't expose getDiscountOverlaps
 Key: LUCENE-8591
 URL: https://issues.apache.org/jira/browse/LUCENE-8591
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Luca Cavanna
Assignee: Luca Cavanna


When I worked on LUCENE-8563 I intended to expose all the needed public methods 
that BM25Similarity exposes, but I forgot to add getDiscountOverlaps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8591) LegacyBM25Similarity doesn't expose getDiscountOverlaps

2018-12-05 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710228#comment-16710228
 ] 

Luca Cavanna commented on LUCENE-8591:
--

I just opened [https://github.com/apache/lucene-solr/pull/514] 

> LegacyBM25Similarity doesn't expose getDiscountOverlaps
> ---
>
> Key: LUCENE-8591
> URL: https://issues.apache.org/jira/browse/LUCENE-8591
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Luca Cavanna
>Assignee: Luca Cavanna
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When I worked on LUCENE-8563 I intended to expose all the needed public 
> methods that BM25Similarity exposes, but I forgot to add getDiscountOverlaps.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

2018-11-29 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16703382#comment-16703382
 ] 

Luca Cavanna commented on LUCENE-8563:
--

I updated the PR according to the latest comments, and deprecated the newly 
introduced similarity like Robert suggested.

> Remove k1+1 from the numerator of  BM25Similarity
> -
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

2018-11-28 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702294#comment-16702294
 ] 

Luca Cavanna commented on LUCENE-8563:
--

I opened [https://github.com/apache/lucene-solr/pull/511] . 

> Remove k1+1 from the numerator of  BM25Similarity
> -
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8563) Remove k1+1 from the numerator of BM25Similarity

2018-11-14 Thread Luca Cavanna (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686370#comment-16686370
 ] 

Luca Cavanna commented on LUCENE-8563:
--

Hi folks, I would like to work on this issue.

> Remove k1+1 from the numerator of  BM25Similarity
> -
>
> Key: LUCENE-8563
> URL: https://issues.apache.org/jira/browse/LUCENE-8563
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> Our current implementation of BM25 does
> {code:java}
> boost * IDF * (k1+1) * tf / (tf + norm)
> {code}
> As (k1+1) is a constant, it is the same for every term and doesn't modify 
> ordering. It is often omitted and I found out that the "The Probabilistic 
> Relevance Framework: BM25 and Beyond" paper by Robertson (BM25's author) and 
> Zaragova even describes adding (k1+1) to the numerator as a variant whose 
> benefit is to be more comparable with Robertson/Sparck-Jones weighting, which 
> we don't care about.
> {quote}A common variant is to add a (k1 + 1) component to the
>  numerator of the saturation function. This is the same for all
>  terms, and therefore does not affect the ranking produced.
>  The reason for including it was to make the final formula
>  more compatible with the RSJ weight used on its own
> {quote}
> Should we remove it from BM25Similarity as well?
> A side-effect that I'm interested in is that integrating other score 
> contributions (eg. via oal.document.FeatureField) would be a bit easier to 
> reason about. For instance a weight of 3 in FeatureField#newSaturationQuery 
> would have a similar impact as a term whose IDF is 3 (and thus docFreq ~= 5%) 
> rather than a term whose IDF is 3/(k1 + 1).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-6485) Add a custom separator break iterator

2015-05-15 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-6485:


 Summary: Add a custom separator break iterator
 Key: LUCENE-6485
 URL: https://issues.apache.org/jira/browse/LUCENE-6485
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Luca Cavanna


Lucene currently includes a WholeBreakIterator used to highlight entire fields 
using the postings highlighter, without breaking their content into sentences.

I would like to contribute a CustomSeparatorBreakIterator that breaks when a 
custom char separator is found in the text. This can be used for instance when 
wanting to highlight entire fields, value per value. One can subclass 
PostingsHighlighter and have getMultiValueSeparator return a control character, 
like U+ , then use the custom break iterator to break on U+ so that one 
snippet per value will be generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6485) Add a custom separator break iterator

2015-05-15 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-6485:
-
Attachment: LUCENE-6485.patch

Patch attached

 Add a custom separator break iterator
 -

 Key: LUCENE-6485
 URL: https://issues.apache.org/jira/browse/LUCENE-6485
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Luca Cavanna
 Attachments: LUCENE-6485.patch


 Lucene currently includes a WholeBreakIterator used to highlight entire 
 fields using the postings highlighter, without breaking their content into 
 sentences.
 I would like to contribute a CustomSeparatorBreakIterator that breaks when a 
 custom char separator is found in the text. This can be used for instance 
 when wanting to highlight entire fields, value per value. One can subclass 
 PostingsHighlighter and have getMultiValueSeparator return a control 
 character, like U+ , then use the custom break iterator to break on 
 U+ so that one snippet per value will be generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5718) More flexible compound queries (containing mtq) support in postings highlighter

2015-05-15 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14546095#comment-14546095
 ] 

Luca Cavanna commented on LUCENE-5718:
--

Was wondering if there is interest around this patch, or maybe there are better 
solutions by now for the custom compound queries usecase? I can revive the 
patch if needed, lemme know what you think.

 More flexible compound queries (containing mtq) support in postings 
 highlighter
 ---

 Key: LUCENE-5718
 URL: https://issues.apache.org/jira/browse/LUCENE-5718
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.8.1
Reporter: Luca Cavanna
 Attachments: LUCENE-5718.patch


 The postings highlighter currently pulls the automata from multi term queries 
 and doesn't require calling rewrite to make highlighting work. In order to do 
 so it also needs to check whether the query is a compound one and eventually 
 extract its subqueries. This is currently done in the MultiTermHighlighting 
 class and works well but has two potential problems:
 1) not all the possible compound queries are necessarily supported as we need 
 to go over each of them one by one (see LUCENE-5717) and this requires 
 keeping the switch up-to-date if new queries gets added to lucene
 2) it doesn't support custom compound queries but only the set of queries 
 available out-of-the-box
 I've been thinking about how this can be improved and one of the ideas I came 
 up with is to introduce a generic way to retrieve the subqueries from 
 compound queries, like for instance have a new abstract base class with a 
 getLeaves or getSubQueries method and have all the compound queries extend 
 it. What this method would do is return a flat array of all the leaf queries 
 that the compound query is made of. 
 Not sure whether this would be needed in other places in lucene, but it 
 doesn't seem like a small change and it would definitely affect (or benefit?) 
 more than just the postings highlighter support for multi term queries.
 In particular the second problem (custom queries) seems hard to solve without 
 a way to expose this info directly from the query though, unless we want to 
 make the MultiTermHighlighting#extractAutomata method extensible in some way.
 Would like to hear what people think and work on this as soon as we 
 identified which direction we want to take.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5718) More flexible compound queries (containing mtq) support in postings highlighter

2014-06-01 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-5718:
-

Attachment: LUCENE-5718.patch

I like your extractQueries idea. I gave it a shot, patch attached.

The main difference compared to extractTerms is that it adds the query itself 
to the list by default instead of throwing UnsupportedOperationException. Also, 
I think this one doesn't necessarily require calling rewrite (not totally sure 
though). I overrode the extractQueries method for all the queries that contain 
one or more sub-queries, let's see if that's too many of if I missed any...you 
tell me ;)

 More flexible compound queries (containing mtq) support in postings 
 highlighter
 ---

 Key: LUCENE-5718
 URL: https://issues.apache.org/jira/browse/LUCENE-5718
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.8.1
Reporter: Luca Cavanna
 Attachments: LUCENE-5718.patch


 The postings highlighter currently pulls the automata from multi term queries 
 and doesn't require calling rewrite to make highlighting work. In order to do 
 so it also needs to check whether the query is a compound one and eventually 
 extract its subqueries. This is currently done in the MultiTermHighlighting 
 class and works well but has two potential problems:
 1) not all the possible compound queries are necessarily supported as we need 
 to go over each of them one by one (see LUCENE-5717) and this requires 
 keeping the switch up-to-date if new queries gets added to lucene
 2) it doesn't support custom compound queries but only the set of queries 
 available out-of-the-box
 I've been thinking about how this can be improved and one of the ideas I came 
 up with is to introduce a generic way to retrieve the subqueries from 
 compound queries, like for instance have a new abstract base class with a 
 getLeaves or getSubQueries method and have all the compound queries extend 
 it. What this method would do is return a flat array of all the leaf queries 
 that the compound query is made of. 
 Not sure whether this would be needed in other places in lucene, but it 
 doesn't seem like a small change and it would definitely affect (or benefit?) 
 more than just the postings highlighter support for multi term queries.
 In particular the second problem (custom queries) seems hard to solve without 
 a way to expose this info directly from the query though, unless we want to 
 make the MultiTermHighlighting#extractAutomata method extensible in some way.
 Would like to hear what people think and work on this as soon as we 
 identified which direction we want to take.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5718) More flexible compound queries (containing mtq) support in postings highlighter

2014-06-01 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14014994#comment-14014994
 ] 

Luca Cavanna commented on LUCENE-5718:
--

Also worth mentioning that my patch addresses only the compound queries 
usecase. It leaves the automaton related work for the different multi term 
queries as it is (in MultiTermHighlighting).

 More flexible compound queries (containing mtq) support in postings 
 highlighter
 ---

 Key: LUCENE-5718
 URL: https://issues.apache.org/jira/browse/LUCENE-5718
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.8.1
Reporter: Luca Cavanna
 Attachments: LUCENE-5718.patch


 The postings highlighter currently pulls the automata from multi term queries 
 and doesn't require calling rewrite to make highlighting work. In order to do 
 so it also needs to check whether the query is a compound one and eventually 
 extract its subqueries. This is currently done in the MultiTermHighlighting 
 class and works well but has two potential problems:
 1) not all the possible compound queries are necessarily supported as we need 
 to go over each of them one by one (see LUCENE-5717) and this requires 
 keeping the switch up-to-date if new queries gets added to lucene
 2) it doesn't support custom compound queries but only the set of queries 
 available out-of-the-box
 I've been thinking about how this can be improved and one of the ideas I came 
 up with is to introduce a generic way to retrieve the subqueries from 
 compound queries, like for instance have a new abstract base class with a 
 getLeaves or getSubQueries method and have all the compound queries extend 
 it. What this method would do is return a flat array of all the leaf queries 
 that the compound query is made of. 
 Not sure whether this would be needed in other places in lucene, but it 
 doesn't seem like a small change and it would definitely affect (or benefit?) 
 more than just the postings highlighter support for multi term queries.
 In particular the second problem (custom queries) seems hard to solve without 
 a way to expose this info directly from the query though, unless we want to 
 make the MultiTermHighlighting#extractAutomata method extensible in some way.
 Would like to hear what people think and work on this as soon as we 
 identified which direction we want to take.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5717) Postings highlighter support for multi term queries within filtered and constant score queries

2014-05-30 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-5717:


 Summary: Postings highlighter support for multi term queries 
within filtered and constant score queries
 Key: LUCENE-5717
 URL: https://issues.apache.org/jira/browse/LUCENE-5717
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.8.1
Reporter: Luca Cavanna


The automata extraction that is done to make multi term queries work with the 
postings highlighter does support boolean queries but it should also support 
other compound queries like filtered and constant score.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5717) Postings highlighter support for multi term queries within filtered and constant score queries

2014-05-30 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-5717:
-

Attachment: LUCENE-5717.patch

First patch attached. At this time there's no generic way to retrieve 
sub-queries from compound queries, thus I could only add two more ifs to the 
existing extractAutomata method. Maybe it's worth discussing if there's a way 
to make this more generic in a separate issue. Also not sure if there are 
others compound queries that I missed.

 Postings highlighter support for multi term queries within filtered and 
 constant score queries
 --

 Key: LUCENE-5717
 URL: https://issues.apache.org/jira/browse/LUCENE-5717
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.8.1
Reporter: Luca Cavanna
 Attachments: LUCENE-5717.patch


 The automata extraction that is done to make multi term queries work with the 
 postings highlighter does support boolean queries but it should also support 
 other compound queries like filtered and constant score.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5718) More flexible compound queries (containing mtq) support in postings highlighter

2014-05-30 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-5718:


 Summary: More flexible compound queries (containing mtq) support 
in postings highlighter
 Key: LUCENE-5718
 URL: https://issues.apache.org/jira/browse/LUCENE-5718
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.8.1
Reporter: Luca Cavanna


The postings highlighter currently pulls the automata from multi term queries 
and doesn't require calling rewrite to make highlighting work. In order to do 
so it also needs to check whether the query is a compound one and eventually 
extract its subqueries. This is currently done in the MultiTermHighlighting 
class and works well but has two potential problems:

1) not all the possible compound queries are necessarily supported as we need 
to go over each of them one by one (see LUCENE-5717) and this requires keeping 
the switch up-to-date if new queries gets added to lucene
2) it doesn't support custom compound queries but only the set of queries 
available out-of-the-box

I've been thinking about how this can be improved and one of the ideas I came 
up with is to introduce a generic way to retrieve the subqueries from compound 
queries, like for instance have a new abstract base class with a getLeaves or 
getSubQueries method and have all the compound queries extend it. What this 
method would do is return a flat array of all the leaf queries that the 
compound query is made of. 

Not sure whether this would be needed in other places in lucene, but it doesn't 
seem like a small change and it would definitely affect (or benefit?) more than 
just the postings highlighter support for multi term queries.

In particular the second problem (custom queries) seems hard to solve without a 
way to expose this info directly from the query though, unless we want to make 
the MultiTermHighlighting#extractAutomata method extensible in some way.

Would like to hear what people think and work on this as soon as we identified 
which direction we want to take.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-09-12 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765400#comment-13765400
 ] 

Luca Cavanna commented on LUCENE-4906:
--

How about committing this? Would be great to have it with the next release!

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-4906.patch, LUCENE-4906.patch


 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-09-12 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765726#comment-13765726
 ] 

Luca Cavanna commented on LUCENE-4906:
--

Thanks Mike!

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 5.0, 4.6

 Attachments: LUCENE-4906.patch, LUCENE-4906.patch, LUCENE-4906.patch


 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5057) Hunspell stemmer generates multiple tokens

2013-09-01 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13755677#comment-13755677
 ] 

Luca Cavanna commented on LUCENE-5057:
--

Hi Lukas,
can you share your findings? In my case it seemed to be a dictionary problem, 
but I'm curious to hear what you experienced.




 Hunspell stemmer generates multiple tokens
 --

 Key: LUCENE-5057
 URL: https://issues.apache.org/jira/browse/LUCENE-5057
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna
Assignee: Adrien Grand

 The hunspell stemmer seems to be generating multiple tokens: the original 
 token plus the available stems.
 It might be a good thing in some cases but it seems to be a different 
 behaviour compared to the other stemmers and causes problems as well. I would 
 rather have an option to decide whether it should output only the available 
 stems, or the stems plus the original token. I'm not sure though if it's 
 possible to have only a single stem indexed, which would be even better in my 
 opinion. When I look at how snowball works only one token is indexed, the 
 stem, and that works great. Probably there's something I'm missing in how 
 hunspell works.
 Here is my issue: I have a query composed of multiple terms, which is 
 analyzed using stemming and a boolean query is generated out of it. All fine 
 when adding all clauses as should (OR operator), but if I add all clauses as 
 must (AND operator), then I can get back only the documents that contain the 
 stem originated by the exactly same original word.
 Example for the dutch language I'm working with: fiets (means bicycle in 
 dutch), its plural is fietsen.
 If I index fietsen I get both fietsen and fiets indexed, but if I index 
 fiets I get the only fiets indexed.
 When I query for fietsen whatever I get the following boolean query: 
 field:fiets field:fietsen field:whatever.
 If I apply the AND operator and use must clauses for each subquery, then I 
 can only find the documents that originally contained fietsen, not the ones 
 that originally contained fiets, which is not really what stemming is about.
 Any thoughts on this? I also wonder if it can be a dictionary issue since I 
 see that different words that have the word fiets as root don't get the 
 same stems, and using the AND operator at query time is a big issue.
 I would love to contribute on this and looking forward to your feedback.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5181) Passage knows its own docID

2013-08-22 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13747493#comment-13747493
 ] 

Luca Cavanna commented on LUCENE-5181:
--

True, having the doc id would be useful there. Why not adding it directly to 
the Passage, to be able know which document the Passage comes from?

 Passage knows its own docID
 ---

 Key: LUCENE-5181
 URL: https://issues.apache.org/jira/browse/LUCENE-5181
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.4
Reporter: Jon Stewart
Priority: Minor

 The new PostingsHighlight package allows for retrieval of term matches from a 
 query if one creates a class that extends PassageFormatter and overrides 
 format(). However, class Passage does not have a docID field, nor is this 
 provided via PassageFormatter.format(). Therefore, it's very difficult to 
 know which Document contains a given Passage.
 It would suffice for PassageFormatter.format() to be passed the docID as a 
 parameter. From the code in PostingsHighlight, this seems like it would be 
 easy.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-08-14 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13740107#comment-13740107
 ] 

Luca Cavanna commented on LUCENE-4906:
--

No problem, thanks Mike!

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-4906.patch, LUCENE-4906.patch


 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-08-11 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736240#comment-13736240
 ] 

Luca Cavanna commented on LUCENE-4906:
--

Hi Mike,
I definitely agree that highlighting api should be simple and the postings 
highlighter is probably the only one that's really easy to use.

On the other hand, I think it's good to make explicit that if you use a 
FormatterYourObject, YourObject is what you're going to get back from the 
highlighter. People using the string version wouldn't notice the change, while 
advanced users would have to extend the base class and get type safety too, 
that in my opinion makes it clearier and easier. Using Object feels to me a 
little old-fashioned and bogus, but again that's probably me :)

I do trust your experience though. If you think the object version is better 
that's fine with me. What I care about is that this improvement gets committed 
soon, since it's a really useful one ;)

Thanks a lot for sharing your ideas

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-4906.patch, LUCENE-4906.patch


 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-08-11 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736242#comment-13736242
 ] 

Luca Cavanna commented on LUCENE-4906:
--

One more thing: re-reading Robert's previous comments, I find also interesting 
the idea he had about changing the api to return a proper object instead of the 
MapString, String[], or the String[] for the simplest methods. I wonder if 
it's worth to address this as well in this issue, or if the current api is 
clear enough in your opinion. Any thoughts?

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-4906.patch, LUCENE-4906.patch


 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-08-09 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734623#comment-13734623
 ] 

Luca Cavanna commented on LUCENE-4906:
--

Hi Mike,
I had a look at your patch, looks good to me. Being able to get back arbitrary 
objects is a great improvement.

The only thing I would love to improve here is the need to cast the returned 
Objects to the type that the custom PassageFormatter uses.

We could work around this using generics, but the fact that the 
PassageFormatter can vary per field makes it harder. The only way I see to work 
around this is to prevent the PassageFormatter from returning different types 
of objects per field. That would mean that even though every field can have his 
own PassageFormatter, they all must return the same type. It kinda makes sense 
to me since I wouldn't want to have heterogeneous types in the MapInteger, 
Object, but that is something that's currently possible. What do you think?

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-4906.patch


 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-08-09 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-4906:
-

Attachment: LUCENE-4906.patch

I don't see why adding generics would complicate or limit the API. To me it 
would make it simpler and nicer (not a big change in terms of api itself 
though).

Attaching a patch with my thoughts to make more concrete what I had in mind, 
regardless of whether it will be integrated or not.

It's backwards compatible (even though the class is marked experimental): we 
have an abstract postings highlighter class that does most of the work and 
returns arbitrary objects (uses generics in order to do so). The 
PostingsHighlighter is its natural extension that returns String snippets.
 
I updated Mike's new test according to my changes. It should make it easier to 
understand what's needed to work with arbitrary objects in terms of code using 
this approach.

I find it more explicit that if you want to extend the abstract one you have to 
declare what type the formatter is supposed to return, which makes it more 
explicit and avoids any cast.

Limitations with this approach: 
1) as mentioned before (to me it's more of a benefit) there cannot be 
heterogeneous types returned by the same highlighter instance.
2) generics don't play well with arrays, thus all the highlight methods that 
returned arrays are still in the subclass that returns string snippets to keep 
backwards compatibility. Moving them to the base class would most likely 
require to return ListFormattedPassage instead (not backward compatible).

I haven't updated the javadoc yet, but if you like my approach I can go ahead 
with it.

I would love to hear what you guys think about it. Generics can be scary... but 
useful sometimes too ;)

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless
 Attachments: LUCENE-4906.patch, LUCENE-4906.patch


 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5057) Hunspell stemmer generates multiple tokens

2013-07-19 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13713487#comment-13713487
 ] 

Luca Cavanna commented on LUCENE-5057:
--

Thanks Adrien for looking into this, nice explanation!

 Hunspell stemmer generates multiple tokens
 --

 Key: LUCENE-5057
 URL: https://issues.apache.org/jira/browse/LUCENE-5057
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna
Assignee: Adrien Grand

 The hunspell stemmer seems to be generating multiple tokens: the original 
 token plus the available stems.
 It might be a good thing in some cases but it seems to be a different 
 behaviour compared to the other stemmers and causes problems as well. I would 
 rather have an option to decide whether it should output only the available 
 stems, or the stems plus the original token. I'm not sure though if it's 
 possible to have only a single stem indexed, which would be even better in my 
 opinion. When I look at how snowball works only one token is indexed, the 
 stem, and that works great. Probably there's something I'm missing in how 
 hunspell works.
 Here is my issue: I have a query composed of multiple terms, which is 
 analyzed using stemming and a boolean query is generated out of it. All fine 
 when adding all clauses as should (OR operator), but if I add all clauses as 
 must (AND operator), then I can get back only the documents that contain the 
 stem originated by the exactly same original word.
 Example for the dutch language I'm working with: fiets (means bicycle in 
 dutch), its plural is fietsen.
 If I index fietsen I get both fietsen and fiets indexed, but if I index 
 fiets I get the only fiets indexed.
 When I query for fietsen whatever I get the following boolean query: 
 field:fiets field:fietsen field:whatever.
 If I apply the AND operator and use must clauses for each subquery, then I 
 can only find the documents that originally contained fietsen, not the ones 
 that originally contained fiets, which is not really what stemming is about.
 Any thoughts on this? I also wonder if it can be a dictionary issue since I 
 see that different words that have the word fiets as root don't get the 
 same stems, and using the AND operator at query time is a big issue.
 I would love to contribute on this and looking forward to your feedback.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-5057) Hunspell stemmer generates multiple tokens (original + stems)

2013-06-14 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-5057:


 Summary: Hunspell stemmer generates multiple tokens (original + 
stems)
 Key: LUCENE-5057
 URL: https://issues.apache.org/jira/browse/LUCENE-5057
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna


The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index fietsen I get both fietsen and fiets indexed, but if I index 
fiets I get the only fiets indexed.

When I query for fietsen whatever I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained fietsen, not the ones that 
originally contained fiets, which is not really what stemming is about.

Any thoughts on this? I would work out a patch, I'd just need some help 
deciding the name of the option and what the default behaviour should be.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5057) Hunspell stemmer generates multiple tokens

2013-06-14 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-5057:
-

Summary: Hunspell stemmer generates multiple tokens  (was: Hunspell stemmer 
generates multiple tokens (original + stems))

 Hunspell stemmer generates multiple tokens
 --

 Key: LUCENE-5057
 URL: https://issues.apache.org/jira/browse/LUCENE-5057
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna

 The hunspell stemmer seems to be generating multiple tokens: the original 
 token plus the available stems.
 It might be a good thing in some cases but it seems to be a different 
 behaviour compared to the other stemmers and causes problems as well. I would 
 rather have an option to decide whether it should output only the available 
 stems, or the stems plus the original token.
 Here is my issue: I have a query composed of multiple terms, which is 
 analyzed using stemming and a boolean query is generated out of it. All fine 
 when adding all clauses as should (OR operator), but if I add all clauses as 
 must (AND operator), then I can get back only the documents that contain the 
 stem originated by the exactly same original word.
 Example for the dutch language I'm working with: fiets (means bicycle in 
 dutch), its plural is fietsen.
 If I index fietsen I get both fietsen and fiets indexed, but if I index 
 fiets I get the only fiets indexed.
 When I query for fietsen whatever I get the following boolean query: 
 field:fiets field:fietsen field:whatever.
 If I apply the AND operator and use must clauses for each subquery, then I 
 can only find the documents that originally contained fietsen, not the ones 
 that originally contained fiets, which is not really what stemming is about.
 Any thoughts on this? I would work out a patch, I'd just need some help 
 deciding the name of the option and what the default behaviour should be.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-5057) Hunspell stemmer generates multiple tokens

2013-06-14 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-5057:
-

Description: 
The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token. I'm not sure though if it's possible to have 
only a single stem indexed.

When I look at how snowball works, only one token is indexed, the stem, and 
that works great.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index fietsen I get both fietsen and fiets indexed, but if I index 
fiets I get the only fiets indexed.

When I query for fietsen whatever I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained fietsen, not the ones that 
originally contained fiets, which is not really what stemming is about.

Any thoughts on this? I would work out a patch, I'd just need some help 
deciding the name of the option and what the default behaviour should be.

  was:
The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index fietsen I get both fietsen and fiets indexed, but if I index 
fiets I get the only fiets indexed.

When I query for fietsen whatever I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained fietsen, not the ones that 
originally contained fiets, which is not really what stemming is about.

Any thoughts on this? I would work out a patch, I'd just need some help 
deciding the name of the option and what the default behaviour should be.


 Hunspell stemmer generates multiple tokens
 --

 Key: LUCENE-5057
 URL: https://issues.apache.org/jira/browse/LUCENE-5057
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna

 The hunspell stemmer seems to be generating multiple tokens: the original 
 token plus the available stems.
 It might be a good thing in some cases but it seems to be a different 
 behaviour compared to the other stemmers and causes problems as well. I would 
 rather have an option to decide whether it should output only the available 
 stems, or the stems plus the original token. I'm not sure though if it's 
 possible to have only a single stem indexed.
 When I look at how snowball works, only one token is indexed, the stem, and 
 that works great.
 Here is my issue: I have a query composed of multiple terms, which is 
 analyzed using stemming and a boolean query is generated out of it. All fine 
 when adding all clauses as should (OR operator), but if I add all clauses as 
 must (AND operator), then I can get back only the documents that contain the 
 stem originated by the exactly same original word.
 Example for the dutch language I'm working with: fiets (means bicycle in 
 dutch), its plural is fietsen.
 If I index fietsen I get both fietsen and fiets indexed, but if I index 
 fiets I get the only fiets indexed.
 When I query for fietsen whatever I get the following boolean query: 
 field:fiets field:fietsen field:whatever.
 If I apply the AND operator and use must clauses for each subquery, then I 
 can only find the documents that originally contained 

[jira] [Updated] (LUCENE-5057) Hunspell stemmer generates multiple tokens

2013-06-14 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-5057:
-

Description: 
The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token. I'm not sure though if it's possible to have 
only a single stem indexed, which would be even better in my opinion. When I 
look at how snowball works only one token is indexed, the stem, and that works 
great. Probably there's something I'm missing in how hunspell works.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index fietsen I get both fietsen and fiets indexed, but if I index 
fiets I get the only fiets indexed.

When I query for fietsen whatever I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained fietsen, not the ones that 
originally contained fiets, which is not really what stemming is about.

Any thoughts on this? I wonder if it can be a dictionary issue since I see that 
different words that have the word fiets as root don't get the same stems, 
and using the AND operator at query time is a big issue.



  was:
The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token. I'm not sure though if it's possible to have 
only a single stem indexed.

When I look at how snowball works, only one token is indexed, the stem, and 
that works great.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index fietsen I get both fietsen and fiets indexed, but if I index 
fiets I get the only fiets indexed.

When I query for fietsen whatever I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained fietsen, not the ones that 
originally contained fiets, which is not really what stemming is about.

Any thoughts on this? I would work out a patch, I'd just need some help 
deciding the name of the option and what the default behaviour should be.


 Hunspell stemmer generates multiple tokens
 --

 Key: LUCENE-5057
 URL: https://issues.apache.org/jira/browse/LUCENE-5057
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna

 The hunspell stemmer seems to be generating multiple tokens: the original 
 token plus the available stems.
 It might be a good thing in some cases but it seems to be a different 
 behaviour compared to the other stemmers and causes problems as well. I would 
 rather have an option to decide whether it should output only the available 
 stems, or the stems plus the original token. I'm not sure though if it's 
 possible to have only a single stem indexed, which would be even better in my 
 opinion. When I look at how snowball works only one token is indexed, the 
 stem, and that works great. Probably there's something I'm missing in how 
 hunspell works.
 Here is my issue: I have a query composed of multiple terms, which is 
 analyzed using stemming and a boolean query is generated out of it. All fine 
 when adding all clauses as should (OR operator), but if I add all clauses as 
 must (AND operator), then I can get back only the documents that contain the 
 stem originated by the exactly same original word.
 Example for 

[jira] [Updated] (LUCENE-5057) Hunspell stemmer generates multiple tokens

2013-06-14 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-5057:
-

Description: 
The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token. I'm not sure though if it's possible to have 
only a single stem indexed, which would be even better in my opinion. When I 
look at how snowball works only one token is indexed, the stem, and that works 
great. Probably there's something I'm missing in how hunspell works.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index fietsen I get both fietsen and fiets indexed, but if I index 
fiets I get the only fiets indexed.

When I query for fietsen whatever I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained fietsen, not the ones that 
originally contained fiets, which is not really what stemming is about.

Any thoughts on this? I also wonder if it can be a dictionary issue since I see 
that different words that have the word fiets as root don't get the same 
stems, and using the AND operator at query time is a big issue.

I would love to contribute on this and looking forward to your feedback.



  was:
The hunspell stemmer seems to be generating multiple tokens: the original token 
plus the available stems.

It might be a good thing in some cases but it seems to be a different behaviour 
compared to the other stemmers and causes problems as well. I would rather have 
an option to decide whether it should output only the available stems, or the 
stems plus the original token. I'm not sure though if it's possible to have 
only a single stem indexed, which would be even better in my opinion. When I 
look at how snowball works only one token is indexed, the stem, and that works 
great. Probably there's something I'm missing in how hunspell works.

Here is my issue: I have a query composed of multiple terms, which is analyzed 
using stemming and a boolean query is generated out of it. All fine when adding 
all clauses as should (OR operator), but if I add all clauses as must (AND 
operator), then I can get back only the documents that contain the stem 
originated by the exactly same original word.

Example for the dutch language I'm working with: fiets (means bicycle in 
dutch), its plural is fietsen.

If I index fietsen I get both fietsen and fiets indexed, but if I index 
fiets I get the only fiets indexed.

When I query for fietsen whatever I get the following boolean query: 
field:fiets field:fietsen field:whatever.

If I apply the AND operator and use must clauses for each subquery, then I can 
only find the documents that originally contained fietsen, not the ones that 
originally contained fiets, which is not really what stemming is about.

Any thoughts on this? I wonder if it can be a dictionary issue since I see that 
different words that have the word fiets as root don't get the same stems, 
and using the AND operator at query time is a big issue.




 Hunspell stemmer generates multiple tokens
 --

 Key: LUCENE-5057
 URL: https://issues.apache.org/jira/browse/LUCENE-5057
 Project: Lucene - Core
  Issue Type: Improvement
Affects Versions: 4.3
Reporter: Luca Cavanna

 The hunspell stemmer seems to be generating multiple tokens: the original 
 token plus the available stems.
 It might be a good thing in some cases but it seems to be a different 
 behaviour compared to the other stemmers and causes problems as well. I would 
 rather have an option to decide whether it should output only the available 
 stems, or the stems plus the original token. I'm not sure though if it's 
 possible to have only a single stem indexed, which would be even better in my 
 opinion. When I look at how snowball works only one token is indexed, the 
 stem, and that works great. Probably there's something I'm missing in how 
 hunspell works.
 Here is my issue: I have a query composed of multiple terms, which is 
 analyzed using stemming and a boolean query is 

[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-04-12 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629961#comment-13629961
 ] 

Luca Cavanna commented on LUCENE-4906:
--

Sounds interesting. Is anybody working on this already? I'd like to volunteer. 
What do you have in mind exactly? Now the format method returns a string. What 
would you like to see as output instead?

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless

 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-04-12 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629973#comment-13629973
 ] 

Luca Cavanna commented on LUCENE-4906:
--

I see! If you couldn't make it how can I make it? :)

But the idea is that you could have some kind of object as output instead of a 
string, like for example an array of tokens plus maybe some more information?

It would avoid to parse again the string output and somehow re-analyze the text 
as needed to have a snippet that we could provide as output directly?

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless

 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4906) PostingsHighlighter's PassageFormatter should allow for rendering to arbitrary objects

2013-04-12 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630227#comment-13630227
 ] 

Luca Cavanna commented on LUCENE-4906:
--

Sure, I hadn't seen that issue yet but I was about to propose the same looking 
at the code.

Thanks for your insight!
I thought about generics too, but then we'd have to be really careful otherwise 
the generics policeman jumps in :) 

I'll play around with some ideas and post the results here as soon as I have 
something.

 PostingsHighlighter's PassageFormatter should allow for rendering to 
 arbitrary objects
 --

 Key: LUCENE-4906
 URL: https://issues.apache.org/jira/browse/LUCENE-4906
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless

 For example, in a server, I may want to render the highlight result to 
 JsonObject to send back to the front-end. Today since we render to string, I 
 have to render to JSON string and then re-parse to JsonObject, which is 
 inefficient...
 Or, if (Rob's idea:) we make a query that's like MoreLikeThis but it pulls 
 terms from snippets instead, so you get proximity-influenced salient/expanded 
 terms, then perhaps that renders to just an array of tokens or fragments or 
 something from each snippet.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4896) PostingsHighlighter should use a interface of PassageFormatter instead of a class

2013-04-12 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-4896:
-

Attachment: LUCENE-4896.patch

No problem. It just felt weird to write an abstract class that looked to me 
like an interface.

Should be better now. I also made append protected.

 PostingsHighlighter should use a interface of PassageFormatter instead of a 
 class
 -

 Key: LUCENE-4896
 URL: https://issues.apache.org/jira/browse/LUCENE-4896
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.2
 Environment: NA
Reporter: Sebastien Dionne
  Labels: newdev
 Attachments: LUCENE-4896.patch, LUCENE-4896.patch


 In my project I need a custom PassageFormatter to use with 
 PostingsHighlighter.  I extended PassageFormatter  to override format(...)
 but if I do that, I don't have access to the private variables.  So instead 
 of changing the scope to protected, it should be more usefull to use a 
 interface for PassageFormatter.
 like public DefaultPassageFormatter implements PassageFormatter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4896) PostingsHighlighter should use a interface of PassageFormatter instead of a class

2013-04-12 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630392#comment-13630392
 ] 

Luca Cavanna commented on LUCENE-4896:
--

Just read your last comment, going to add another patch :)

 PostingsHighlighter should use a interface of PassageFormatter instead of a 
 class
 -

 Key: LUCENE-4896
 URL: https://issues.apache.org/jira/browse/LUCENE-4896
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.2
 Environment: NA
Reporter: Sebastien Dionne
  Labels: newdev
 Attachments: LUCENE-4896.patch, LUCENE-4896.patch


 In my project I need a custom PassageFormatter to use with 
 PostingsHighlighter.  I extended PassageFormatter  to override format(...)
 but if I do that, I don't have access to the private variables.  So instead 
 of changing the scope to protected, it should be more usefull to use a 
 interface for PassageFormatter.
 like public DefaultPassageFormatter implements PassageFormatter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4896) PostingsHighlighter should use a interface of PassageFormatter instead of a class

2013-04-12 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-4896:
-

Attachment: LUCENE-4896.patch

Here it is.
The lucene.experimental was already there, I left it only in the abstract 
class. 

Fixed format javadocs too.

 PostingsHighlighter should use a interface of PassageFormatter instead of a 
 class
 -

 Key: LUCENE-4896
 URL: https://issues.apache.org/jira/browse/LUCENE-4896
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.2
 Environment: NA
Reporter: Sebastien Dionne
  Labels: newdev
 Attachments: LUCENE-4896.patch, LUCENE-4896.patch, LUCENE-4896.patch


 In my project I need a custom PassageFormatter to use with 
 PostingsHighlighter.  I extended PassageFormatter  to override format(...)
 but if I do that, I don't have access to the private variables.  So instead 
 of changing the scope to protected, it should be more usefull to use a 
 interface for PassageFormatter.
 like public DefaultPassageFormatter implements PassageFormatter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4896) PostingsHighlighter should use a interface of PassageFormatter instead of a class

2013-04-12 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630441#comment-13630441
 ] 

Luca Cavanna commented on LUCENE-4896:
--

I see what you mean Robert, thanks a lot for your explanation. I would have 
probably ended up with an interface + abstract class then ;)

Let's see what I can come up with for LUCENE-4906...





 PostingsHighlighter should use a interface of PassageFormatter instead of a 
 class
 -

 Key: LUCENE-4896
 URL: https://issues.apache.org/jira/browse/LUCENE-4896
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.2
 Environment: NA
Reporter: Sebastien Dionne
  Labels: newdev
 Attachments: LUCENE-4896.patch, LUCENE-4896.patch, LUCENE-4896.patch


 In my project I need a custom PassageFormatter to use with 
 PostingsHighlighter.  I extended PassageFormatter  to override format(...)
 but if I do that, I don't have access to the private variables.  So instead 
 of changing the scope to protected, it should be more usefull to use a 
 interface for PassageFormatter.
 like public DefaultPassageFormatter implements PassageFormatter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4825) PostingsHighlighter support for positional queries

2013-03-13 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600918#comment-13600918
 ] 

Luca Cavanna commented on LUCENE-4825:
--

Hey Robert,
sorry but I don't quite understand why it would become an orange? :)

I mean, the PostingsHighlighter does (among others) two great things:
1) reads offsets from the postings list, as its name says
2) summarizes the content giving nice sentences as output

I think the two above features are a great improvement and pretty much what 
everybody would like to have!

I'm proposing to add support for positional queries, as a third optional 
feature. We would need to read the spans from the positional queries in order 
to highlight only the proper terms, otherwise the output is wrong from a user 
perspective. Would this make it that slower? I don't mean to reanalyze the 
text...

Don't get me wrong you must be right but I would like to understand more. 

You're saying that instead of adding 3) to 2) and 1) we should have another 
highlighter that does 1) 2) and 3)?





 PostingsHighlighter support for positional queries
 --

 Key: LUCENE-4825
 URL: https://issues.apache.org/jira/browse/LUCENE-4825
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.2
Reporter: Luca Cavanna

 I've been playing around with the brand new PostingsHighlighter. I'm really 
 happy with the result in terms of quality of the snippets and performance.
 On the other hand, I noticed it doesn't support positional queries. If you 
 make a span query, for example, all the single terms will be highlighted, 
 even though they haven't contributed to the match. That reminds me of the 
 difference between the QueryTermScorer and the QueryScorer (using the 
 standard Highlighter).
 I've been trying to adapt what the QueryScorer does, especially the 
 extraction of the query terms together with their positions (what 
 WeightedSpanTermExtractor does). Next step would be to take that information 
 into account within the formatter and highlight only the terms that actually 
 contributed to the match. I'm not quite ready yet with a patch to contribute 
 this back, but I certainly intend to do so. That's why I opened the issue and 
 in the meantime I would like to hear what you guys think about it and  
 discuss how best we can fix it. I think it would be a big improvement for 
 this new highlighter, which is already great!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4825) PostingsHighlighter support for positional queries

2013-03-12 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-4825:


 Summary: PostingsHighlighter support for positional queries
 Key: LUCENE-4825
 URL: https://issues.apache.org/jira/browse/LUCENE-4825
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.2
Reporter: Luca Cavanna


I've been playing around with the brand new PostingsHighlighter. I'm really 
happy with the result in terms of quality of the snippets and performance.
On the other hand, I noticed it doesn't support positional queries. If you make 
a span query, for example, all the single terms will be highlighted, even 
though they haven't contributed to the match. That reminds me of the difference 
between the QueryTermScorer and the QueryScorer (using the standard 
Highlighter).

I've been trying to adapt what the QueryScorer does, especially the extraction 
of the query terms together with their positions (what 
WeightedSpanTermExtractor does). Next step would be to take that information 
into account within the formatter and highlight only the terms that actually 
contributed to the match. I'm not quite ready yet with a patch to contribute 
this back, but I certainly intend to do so. That's why I opened the issue and 
in the meantime I would like to hear what you guys think about it and  discuss 
how best we can fix it. I think it would be a big improvement for this new 
highlighter, which is already great!



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4825) PostingsHighlighter support for positional queries

2013-03-12 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13600572#comment-13600572
 ] 

Luca Cavanna commented on LUCENE-4825:
--

Thanks for your inputs Robert!

I see your point, even though from a user perspective I'd rather see only the 
complete phrase highlighted if I make a phrase query, not every single term. I 
think we can currently achieve this only like the old highlighter does, am I 
right? 
Maybe we can make this pluggable and have different implementations?




 PostingsHighlighter support for positional queries
 --

 Key: LUCENE-4825
 URL: https://issues.apache.org/jira/browse/LUCENE-4825
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/highlighter
Affects Versions: 4.2
Reporter: Luca Cavanna

 I've been playing around with the brand new PostingsHighlighter. I'm really 
 happy with the result in terms of quality of the snippets and performance.
 On the other hand, I noticed it doesn't support positional queries. If you 
 make a span query, for example, all the single terms will be highlighted, 
 even though they haven't contributed to the match. That reminds me of the 
 difference between the QueryTermScorer and the QueryScorer (using the 
 standard Highlighter).
 I've been trying to adapt what the QueryScorer does, especially the 
 extraction of the query terms together with their positions (what 
 WeightedSpanTermExtractor does). Next step would be to take that information 
 into account within the formatter and highlight only the terms that actually 
 contributed to the match. I'm not quite ready yet with a patch to contribute 
 this back, but I certainly intend to do so. That's why I opened the issue and 
 in the meantime I would like to hear what you guys think about it and  
 discuss how best we can fix it. I think it would be a big improvement for 
 this new highlighter, which is already great!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3976) Improve error messages for unsupported Hunspell formats

2012-06-01 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13287260#comment-13287260
 ] 

Luca Cavanna commented on LUCENE-3976:
--

Hi Chris, 
I agree with you. On the other hand with the affix rule mentioned, before 
LUCENE-4019 we had an AOE, so the additional catch would have been useful just 
to throw a nicer error message like Error while parsing the affix file. That 
one has been solved at its source, for now I don't see any other possible 
errors but I'm sure there are some, maybe plenty since we support only a subset 
of the formats and features.
It was just a way to introduce a generic error message but I totally agree that 
the right apporach would be fixing everything at the source.

 Improve error messages for unsupported Hunspell formats
 ---

 Key: LUCENE-3976
 URL: https://issues.apache.org/jira/browse/LUCENE-3976
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Chris Male
 Attachments: LUCENE-3976.patch, LUCENE-3976.patch


 Our hunspell implementation is never going to be able to support the huge 
 variety of formats that are out there, especially since our impl is based on 
 papers written on the topic rather than being a pure port.
 Recently we ran into the following suffix rule:
 {noformat}SFX CA 0 /CaCp{noformat}
 Due to the missing regex conditional, an AOE was being thrown, which made it 
 difficult to diagnose the problem.
 We should instead try to provide better error messages showing what we were 
 unable to parse.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition

2012-05-31 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-4019:
-

Attachment: LUCENE-4019.patch

Hi Chris, 
thanks for your feedback. Here is a new patch containing a new option in order 
to enable/disable the affix strict parsing, by default it is enabled. I updated 
the HunspellStemFilterFactory too in order to expose the new option to Solr.

 Parsing Hunspell affix rules without regexp condition
 -

 Key: LUCENE-4019
 URL: https://issues.apache.org/jira/browse/LUCENE-4019
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
Reporter: Luca Cavanna
Assignee: Chris Male
 Attachments: LUCENE-4019.patch, LUCENE-4019.patch


 We found out that some recent Dutch hunspell dictionaries contain suffix or 
 prefix rules like the following:
 {code} 
 SFX Na N 1
 SFX Na 0 ste
 {code}
 The rule on the second line doesn't contain the 5th parameter, which should 
 be the condition (a regexp usually). You can usually see a '.' as condition, 
 meaning always (for every character). As explained in LUCENE-3976 the 
 readAffix method throws error. I wonder if we should treat the missing value 
 as a kind of default value, like '.'.  On the other hand I haven't found any 
 information about this within the spec. Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition

2012-05-31 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-4019:
-

Attachment: LUCENE-4019.patch

Yeah, sorry for my mistakes, I corrected them.
And I added the line number to the ParseException.
Let me know if there's something more I can do!

 Parsing Hunspell affix rules without regexp condition
 -

 Key: LUCENE-4019
 URL: https://issues.apache.org/jira/browse/LUCENE-4019
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
Reporter: Luca Cavanna
Assignee: Chris Male
 Attachments: LUCENE-4019.patch, LUCENE-4019.patch, LUCENE-4019.patch


 We found out that some recent Dutch hunspell dictionaries contain suffix or 
 prefix rules like the following:
 {code} 
 SFX Na N 1
 SFX Na 0 ste
 {code}
 The rule on the second line doesn't contain the 5th parameter, which should 
 be the condition (a regexp usually). You can usually see a '.' as condition, 
 meaning always (for every character). As explained in LUCENE-3976 the 
 readAffix method throws error. I wonder if we should treat the missing value 
 as a kind of default value, like '.'.  On the other hand I haven't found any 
 information about this within the spec. Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition

2012-05-07 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269516#comment-13269516
 ] 

Luca Cavanna commented on LUCENE-4019:
--

Thank you Robert for the explanation!
In this specific case it's hard to understand the differences between hunspell 
and Lucene, since Lucene doesn't even parse the affix file.
I've been in contact with the authors of those Ducth dictionaries, as well as 
with the hunspell author. It turned out that those affix rules are wrong and 
hunspell actually ignores them. I think it's better to ignore them in Lucene 
too, rather than throwing an exception, which makes impossible to use those 
dictionaries at all.

 Parsing Hunspell affix rules without regexp condition
 -

 Key: LUCENE-4019
 URL: https://issues.apache.org/jira/browse/LUCENE-4019
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
Reporter: Luca Cavanna

 We found out that some recent Dutch hunspell dictionaries contain suffix or 
 prefix rules like the following:
 {code} 
 SFX Na N 1
 SFX Na 0 ste
 {code}
 The rule on the second line doesn't contain the 5th parameter, which should 
 be the condition (a regexp usually). You can usually see a '.' as condition, 
 meaning always (for every character). As explained in LUCENE-3976 the 
 readAffix method throws error. I wonder if we should treat the missing value 
 as a kind of default value, like '.'.  On the other hand I haven't found any 
 information about this within the spec. Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition

2012-05-07 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-4019:
-

Attachment: LUCENE-4019.patch

Small patch: affix rules with less than 5 elements are now ignored. I added a 
specific test with a new affix file containing an example of rule shorter than 
it should be. Let me know if you prefer to add a warning when a rule is 
skipped. Hunspell does that only with a specific command line option.

 Parsing Hunspell affix rules without regexp condition
 -

 Key: LUCENE-4019
 URL: https://issues.apache.org/jira/browse/LUCENE-4019
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
Reporter: Luca Cavanna
 Attachments: LUCENE-4019.patch


 We found out that some recent Dutch hunspell dictionaries contain suffix or 
 prefix rules like the following:
 {code} 
 SFX Na N 1
 SFX Na 0 ste
 {code}
 The rule on the second line doesn't contain the 5th parameter, which should 
 be the condition (a regexp usually). You can usually see a '.' as condition, 
 meaning always (for every character). As explained in LUCENE-3976 the 
 readAffix method throws error. I wonder if we should treat the missing value 
 as a kind of default value, like '.'.  On the other hand I haven't found any 
 information about this within the spec. Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3976) Improve error messages for unsupported Hunspell formats

2012-05-07 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269520#comment-13269520
 ] 

Luca Cavanna commented on LUCENE-3976:
--

The specific case of affix rule with less than 5 elements has been addressed in 
LUCENE-4019. Please ignore my first patch here since it's related to that 
specific case which is now handled in a different way in LUCENE-4019.
I'm looking into improve error messages anyway, in a more generic way if 
possible.

 Improve error messages for unsupported Hunspell formats
 ---

 Key: LUCENE-3976
 URL: https://issues.apache.org/jira/browse/LUCENE-3976
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Chris Male
 Attachments: LUCENE-3976.patch


 Our hunspell implementation is never going to be able to support the huge 
 variety of formats that are out there, especially since our impl is based on 
 papers written on the topic rather than being a pure port.
 Recently we ran into the following suffix rule:
 {noformat}SFX CA 0 /CaCp{noformat}
 Due to the missing regex conditional, an AOE was being thrown, which made it 
 difficult to diagnose the problem.
 We should instead try to provide better error messages showing what we were 
 unable to parse.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (LUCENE-3976) Improve error messages for unsupported Hunspell formats

2012-05-07 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269520#comment-13269520
 ] 

Luca Cavanna edited comment on LUCENE-3976 at 5/7/12 11:03 AM:
---

The specific case of affix rule with less than 5 elements has been addressed in 
LUCENE-4019. Please ignore my first patch here since it's related to that 
specific case which is now handled in a different way in LUCENE-4019.
I'm looking into improving error messages anyway, possibly in a more generic 
way.

  was (Author: lucacavanna):
The specific case of affix rule with less than 5 elements has been 
addressed in LUCENE-4019. Please ignore my first patch here since it's related 
to that specific case which is now handled in a different way in LUCENE-4019.
I'm looking into improve error messages anyway, in a more generic way if 
possible.
  
 Improve error messages for unsupported Hunspell formats
 ---

 Key: LUCENE-3976
 URL: https://issues.apache.org/jira/browse/LUCENE-3976
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Chris Male
 Attachments: LUCENE-3976.patch


 Our hunspell implementation is never going to be able to support the huge 
 variety of formats that are out there, especially since our impl is based on 
 papers written on the topic rather than being a pure port.
 Recently we ran into the following suffix rule:
 {noformat}SFX CA 0 /CaCp{noformat}
 Due to the missing regex conditional, an AOE was being thrown, which made it 
 difficult to diagnose the problem.
 We should instead try to provide better error messages showing what we were 
 unable to parse.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Issue Comment Edited] (LUCENE-3976) Improve error messages for unsupported Hunspell formats

2012-05-07 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269520#comment-13269520
 ] 

Luca Cavanna edited comment on LUCENE-3976 at 5/7/12 11:04 AM:
---

The specific case of affix rule with less than 5 elements has been addressed in 
LUCENE-4019. Please ignore my first patch here since it's related to that 
specific case which is now handled in a different way in LUCENE-4019.
I'm looking into improving error messages anyway, possibly in a generic way.

  was (Author: lucacavanna):
The specific case of affix rule with less than 5 elements has been 
addressed in LUCENE-4019. Please ignore my first patch here since it's related 
to that specific case which is now handled in a different way in LUCENE-4019.
I'm looking into improving error messages anyway, possibly in a more generic 
way.
  
 Improve error messages for unsupported Hunspell formats
 ---

 Key: LUCENE-3976
 URL: https://issues.apache.org/jira/browse/LUCENE-3976
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Chris Male
 Attachments: LUCENE-3976.patch


 Our hunspell implementation is never going to be able to support the huge 
 variety of formats that are out there, especially since our impl is based on 
 papers written on the topic rather than being a pure port.
 Recently we ran into the following suffix rule:
 {noformat}SFX CA 0 /CaCp{noformat}
 Due to the missing regex conditional, an AOE was being thrown, which made it 
 difficult to diagnose the problem.
 We should instead try to provide better error messages showing what we were 
 unable to parse.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3976) Improve error messages for unsupported Hunspell formats

2012-05-07 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-3976:
-

Attachment: LUCENE-3976.patch

The patch tries to address unexpected errors while parsing affix files and 
dictionaries. I just added an external try catch with a generic Error while 
parsing the affix/dictionary file, in my opinion better than just eventually 
throwing some unchecked exception. Let me know if there's something else we can 
improve meanwhile.

 Improve error messages for unsupported Hunspell formats
 ---

 Key: LUCENE-3976
 URL: https://issues.apache.org/jira/browse/LUCENE-3976
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Chris Male
 Attachments: LUCENE-3976.patch, LUCENE-3976.patch


 Our hunspell implementation is never going to be able to support the huge 
 variety of formats that are out there, especially since our impl is based on 
 papers written on the topic rather than being a pure port.
 Recently we ran into the following suffix rule:
 {noformat}SFX CA 0 /CaCp{noformat}
 Due to the missing regex conditional, an AOE was being thrown, which made it 
 difficult to diagnose the problem.
 We should instead try to provide better error messages showing what we were 
 unable to parse.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3189) Removing a field using TemplateTransformer

2012-05-07 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269561#comment-13269561
 ] 

Luca Cavanna commented on SOLR-3189:


Guys, don't you think this feature could be useful? Is there something I can do 
to convince you? :)

 Removing a field using TemplateTransformer
 --

 Key: SOLR-3189
 URL: https://issues.apache.org/jira/browse/SOLR-3189
 Project: Solr
  Issue Type: Improvement
  Components: contrib - DataImportHandler
Reporter: Luca Cavanna
Priority: Minor
 Attachments: SOLR-3189.patch


 While importing documents through DataImportHandler I need to remove some 
 fields from the final SolrDocument before it's submitted to Solr.
 My usecase: the import query returns an A column which I use to fill in the B 
 field on the Solr instance. My Solr schema contains both the A and B fields, 
 so they are both filled in through dih. I'd like to force the deletion of A 
 from the generated SolrDocument since I need a value only on the B field and 
 want to leave empty the A field. The only way I found is using 
 ScriptTransformer, so I thought it could be useful to add this feature to the 
 TemplateTransformer.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-3443) Optimize hunspell dictionary loading with multiple cores

2012-05-07 Thread Luca Cavanna (JIRA)
Luca Cavanna created SOLR-3443:
--

 Summary: Optimize hunspell dictionary loading with multiple cores
 Key: SOLR-3443
 URL: https://issues.apache.org/jira/browse/SOLR-3443
 Project: Solr
  Issue Type: Improvement
Reporter: Luca Cavanna


The Hunspell dictionary is actually loaded into memory. Each core using 
hunspell loads its own dictionary, no matter if all the cores are using the 
same dictionary files. As a result, the same dictionary is loaded into memory 
multiple times, once for each core. I think we should share those dictionaries 
between all cores in order to optimize the memory usage. In fact, let's say a 
dictionary takes 20MB into memory (this is what I detected), if you have 20 
cores you are going to use 400MB only for dictionaries, which doesn't seem a 
good idea to me.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3443) Optimize hunspell dictionary loading with multiple cores

2012-05-07 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13269573#comment-13269573
 ] 

Luca Cavanna commented on SOLR-3443:


The first thing I have in mind is a static map containing all loaded 
dictionaries with some kind of unique identifier, so that the same dictionary 
can be reused between cores.
But my question is: is there a mechanism to share object between cores in Solr? 
Is this the first time someone needs to share something between multiple cores?
I'd like to hear your thoughts!

 Optimize hunspell dictionary loading with multiple cores
 

 Key: SOLR-3443
 URL: https://issues.apache.org/jira/browse/SOLR-3443
 Project: Solr
  Issue Type: Improvement
Reporter: Luca Cavanna

 The Hunspell dictionary is actually loaded into memory. Each core using 
 hunspell loads its own dictionary, no matter if all the cores are using the 
 same dictionary files. As a result, the same dictionary is loaded into memory 
 multiple times, once for each core. I think we should share those 
 dictionaries between all cores in order to optimize the memory usage. In 
 fact, let's say a dictionary takes 20MB into memory (this is what I 
 detected), if you have 20 cores you are going to use 400MB only for 
 dictionaries, which doesn't seem a good idea to me.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3407) Field names with leading digits cause strange behavior when used with fl param

2012-04-25 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13261605#comment-13261605
 ] 

Luca Cavanna commented on SOLR-3407:


I guess this is a side effect coming from SOLR-2444. Solr already had similar 
problems (SOLR-2719: hyphen within the field name didn't work anymore). Some 
kind of validation would help (SOLR-3207) but it's not trivial to do since 
there are dynamic fields too; I think Solr should guarantee that allowed field 
names always work, or at least make clearer what restrictions there are.

 Field names with leading digits cause strange behavior when used with fl 
 param
 

 Key: SOLR-3407
 URL: https://issues.apache.org/jira/browse/SOLR-3407
 Project: Solr
  Issue Type: Bug
  Components: search
Affects Versions: 4.0
 Environment: apache-solr-4.0-2012-04-24_08-27-47
Reporter: Chris Bleakley

 When specifying a field name that starts with a digit (or digits) in the fl 
 parameter solr returns both the field name and field value as the those 
 digits. For example, using nightly build 
 apache-solr-4.0-2012-04-24_08-27-47 I run: 
 java -jar start.jar 
 and 
 java -jar post.jar solr.xml monitor.xml 
 If I then add a field to the field list that starts with a digit ( 
 localhost:8983/solr/select?q=*:*fl=24 ) the results look like: 
 ... 
 doc
 long name=2424/long
 /doc
 ... 
 if I try fl=24_7 it looks like everything after the underscore is truncated 
 ... 
 doc
 long name=2424/long
 /doc
 ... 
 and if I try fl=3test it looks like everything after the last digit is 
 truncated 
 ... 
 doc
 long name=33/long
 /doc
 ... 
 If I have an actual value for that field (say I've indexed 24_7 to be true 
 ) I get back that value as well as the behavior above. 
 ... 
 doc
 bool name=24_7true/bool
 long name=2424/long
 /doc
 ... 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-3976) Improve error messages for unsupported Hunspell formats

2012-04-24 Thread Luca Cavanna (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Cavanna updated LUCENE-3976:
-

Attachment: LUCENE-3976.patch

First draft patch: I added a check for that specific problem with an 
understandable error. In fact, since we are going to read the first 5 elements 
from an array, better to check if there are at least 5 elements.
Not sure how we can improve generic errors handling. Let me know your thoughts.

 Improve error messages for unsupported Hunspell formats
 ---

 Key: LUCENE-3976
 URL: https://issues.apache.org/jira/browse/LUCENE-3976
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Chris Male
 Attachments: LUCENE-3976.patch


 Our hunspell implementation is never going to be able to support the huge 
 variety of formats that are out there, especially since our impl is based on 
 papers written on the topic rather than being a pure port.
 Recently we ran into the following suffix rule:
 {noformat}SFX CA 0 /CaCp{noformat}
 Due to the missing regex conditional, an AOE was being thrown, which made it 
 difficult to diagnose the problem.
 We should instead try to provide better error messages showing what we were 
 unable to parse.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3976) Improve error messages for unsupported Hunspell formats

2012-04-24 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260490#comment-13260490
 ] 

Luca Cavanna commented on LUCENE-3976:
--

We found out that some recent dutch dictionaries contain rule like the one 
mentioned (Starting from version 2.00 if I'm correct). I'm going to look at 
that specific problem and see how we can parse those affix rules.

 Improve error messages for unsupported Hunspell formats
 ---

 Key: LUCENE-3976
 URL: https://issues.apache.org/jira/browse/LUCENE-3976
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Chris Male
 Attachments: LUCENE-3976.patch


 Our hunspell implementation is never going to be able to support the huge 
 variety of formats that are out there, especially since our impl is based on 
 papers written on the topic rather than being a pure port.
 Recently we ran into the following suffix rule:
 {noformat}SFX CA 0 /CaCp{noformat}
 Due to the missing regex conditional, an AOE was being thrown, which made it 
 difficult to diagnose the problem.
 We should instead try to provide better error messages showing what we were 
 unable to parse.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition

2012-04-24 Thread Luca Cavanna (JIRA)
Luca Cavanna created LUCENE-4019:


 Summary: Parsing Hunspell affix rules without regexp condition
 Key: LUCENE-4019
 URL: https://issues.apache.org/jira/browse/LUCENE-4019
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
Reporter: Luca Cavanna


We found out that some recent Dutch hunspell dictionaries contain suffix or 
prefix rules like the following:

{code} 
SFX Na N 1
SFX Na 0 ste
{code}

The rule on the second line doesn't contain the 5th parameter, which should be 
the condition (a regexp usually). You can usually see a '.' as condition, 
meaning always (for every character). As explained in LUCENE-3976 the readAffix 
method throws error. I wonder if we should treat the missing value as a kind of 
default value, like '.'.  On the other hand I haven't found any information 
about this within the spec. Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4019) Parsing Hunspell affix rules without regexp condition

2012-04-24 Thread Luca Cavanna (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260612#comment-13260612
 ] 

Luca Cavanna commented on LUCENE-4019:
--

Robert, with spec I meant exactly your links :)
Actually it's clear that the affix header has 4 elements while each rule has at 
least 5 elements. I don't really know what hunspell does with that kind of 
malformed rules. Lucene just throws an error while loading the dictionary. 
Looking at the hunspell source code, I might be wrong but I suspect it just 
skips that specific rule with some warning. But honestly it's hard to believe 
that at least 4 dictionaries I tried contain mistaken rules, isn't it? I'll 
investigate more, thanks!

 Parsing Hunspell affix rules without regexp condition
 -

 Key: LUCENE-4019
 URL: https://issues.apache.org/jira/browse/LUCENE-4019
 Project: Lucene - Java
  Issue Type: Improvement
  Components: modules/analysis
Affects Versions: 3.6
Reporter: Luca Cavanna

 We found out that some recent Dutch hunspell dictionaries contain suffix or 
 prefix rules like the following:
 {code} 
 SFX Na N 1
 SFX Na 0 ste
 {code}
 The rule on the second line doesn't contain the 5th parameter, which should 
 be the condition (a regexp usually). You can usually see a '.' as condition, 
 meaning always (for every character). As explained in LUCENE-3976 the 
 readAffix method throws error. I wonder if we should treat the missing value 
 as a kind of default value, like '.'.  On the other hand I haven't found any 
 information about this within the spec. Any thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org