[jira] [Commented] (LUCENE-9418) Ordered intervals can give inaccurate hits on interleaved terms

2020-09-04 Thread Alan Woodward (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190630#comment-17190630
 ] 

Alan Woodward commented on LUCENE-9418:
---

Hi [~Brain2000], I think you have a different problem there; this issue 
concerns Interval queries, whereas you look to have a problem with a sorting 
collector.  Can you open a new issue, with a reproducible test failure if 
possible?

> Ordered intervals can give inaccurate hits on interleaved terms
> ---
>
> Key: LUCENE-9418
> URL: https://issues.apache.org/jira/browse/LUCENE-9418
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 8.6
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Given the text 'A B A C', an ordered interval over 'A B C' will return the 
> inaccurate interval [2, 3], due to the way minimization is handled after 
> matches are found.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9418) Ordered intervals can give inaccurate hits on interleaved terms

2020-09-03 Thread Brian Coverstone (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190420#comment-17190420
 ] 

Brian Coverstone commented on LUCENE-9418:
--

I believe this may still be an issue in 8.6.0, as I'm finding the last slot can 
often have an incorrect record.

I found a workaround, and that is to always select 1 more than needed.

Here is some pseudo code to demonstrate:
{quote}ComplexPhraseQueryParser cpqp = new 
ComplexPhraseQueryParser("somefield", analyzer);
Query query = cpqp.parse("somevalue");

pageSize = 10;
pageNum = 1;
requestedRecords = pageSize * pageNum + 1; //+1 workaround
startOffset = (pageNum - 1) * pageSize;

FieldComparatorSource fsc = new FieldComparatorSource() {
    @Override
    public FieldComparator newComparator(String fieldname, int numhits, 
int sortPos, boolean reversed) {
        return new StringValComparatorIgnoreCase(numhits, fieldname);
    }
};

Sort sort = new Sort(new SortField("firstname", fsc, false));
IndexSearcher searcher = new IndexSearcher(reader);
TopFieldCollector tfcollector = TopFieldCollector.create(sort, requestedRecords 
+ 1, Integer.MAX_VALUE);
searcher.search(query, tfcollector);
ScoreDoc[] hits = tfcollector.topDocs(startOffset, pageSize).scoreDocs;
{quote}
At this point "hits" is correct. However, if I remove the "+1" from the 
requestedRecords above, the last item in "hits" is often incorrect.

 

> Ordered intervals can give inaccurate hits on interleaved terms
> ---
>
> Key: LUCENE-9418
> URL: https://issues.apache.org/jira/browse/LUCENE-9418
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
> Fix For: 8.6
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Given the text 'A B A C', an ordered interval over 'A B C' will return the 
> inaccurate interval [2, 3], due to the way minimization is handled after 
> matches are found.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9418) Ordered intervals can give inaccurate hits on interleaved terms

2020-06-30 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148440#comment-17148440
 ] 

ASF subversion and git services commented on LUCENE-9418:
-

Commit 1ec78ac39410c97aa397e5392a60051c04596efc in lucene-solr's branch 
refs/heads/master from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1ec78ac ]

LUCENE-9418: Add CHANGES entry


> Ordered intervals can give inaccurate hits on interleaved terms
> ---
>
> Key: LUCENE-9418
> URL: https://issues.apache.org/jira/browse/LUCENE-9418
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Given the text 'A B A C', an ordered interval over 'A B C' will return the 
> inaccurate interval [2, 3], due to the way minimization is handled after 
> matches are found.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9418) Ordered intervals can give inaccurate hits on interleaved terms

2020-06-30 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148438#comment-17148438
 ] 

ASF subversion and git services commented on LUCENE-9418:
-

Commit 3a42716cdb06ba650ccb2cbc9953c05c9a8a6abc in lucene-solr's branch 
refs/heads/branch_8x from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3a42716 ]

LUCENE-9418: Fix ordered intervals over interleaved terms (#1618)

Given the input text 'A B A C', an ordered interval 'A B C' will currently 
return an incorrect
internal [2, 3] in addition to the correct [0, 3] interval. This is due to a 
bug in the ORDERED
algorithm, where we assume that after the first interval is returned, the 
sub-intervals are
always in-order. This assumption only holds during minimization, as minimizing 
an interval
may move the earlier terms beyond the trailing terms.

For example, after the initial [0, 3] interval is found above, the algorithm 
will attempt to
minimize it by advancing A to [2,2]. Because this is still before C at [3,3], 
but after B at
[1,1], we then try advancing B, leaving it at [Inf,Inf]. Minimization has 
failed, so we return
the original interval of [0,3]. However, when we come to retrieve the next 
interval, our
subintervals look like this: A[2,2], B[Inf,Inf], C[3,3] - the assumption that 
they are in order
is broken. The algorithm sees that A is before B, assumes that therefore all 
subsequent
subintervals are in order, and returns the new interval.

This commit fixes things by changing the assumption of ordering to only hold 
during
minimization. When first finding a candidate interval, the algorithm now checks 
that
all sub-intervals appear in order.

> Ordered intervals can give inaccurate hits on interleaved terms
> ---
>
> Key: LUCENE-9418
> URL: https://issues.apache.org/jira/browse/LUCENE-9418
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Given the text 'A B A C', an ordered interval over 'A B C' will return the 
> inaccurate interval [2, 3], due to the way minimization is handled after 
> matches are found.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9418) Ordered intervals can give inaccurate hits on interleaved terms

2020-06-30 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148439#comment-17148439
 ] 

ASF subversion and git services commented on LUCENE-9418:
-

Commit 480b0f5395004c71a023222b8389f9dc5c19a9bb in lucene-solr's branch 
refs/heads/branch_8x from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=480b0f5 ]

LUCENE-9418: Add CHANGES entry


> Ordered intervals can give inaccurate hits on interleaved terms
> ---
>
> Key: LUCENE-9418
> URL: https://issues.apache.org/jira/browse/LUCENE-9418
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Given the text 'A B A C', an ordered interval over 'A B C' will return the 
> inaccurate interval [2, 3], due to the way minimization is handled after 
> matches are found.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9418) Ordered intervals can give inaccurate hits on interleaved terms

2020-06-30 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17148415#comment-17148415
 ] 

ASF subversion and git services commented on LUCENE-9418:
-

Commit 3ff331072a7435e971e35c2e28c38a90ca70802b in lucene-solr's branch 
refs/heads/master from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3ff3310 ]

LUCENE-9418: Fix ordered intervals over interleaved terms (#1618)

Given the input text 'A B A C', an ordered interval 'A B C' will currently 
return an incorrect
internal [2, 3] in addition to the correct [0, 3] interval. This is due to a 
bug in the ORDERED
algorithm, where we assume that after the first interval is returned, the 
sub-intervals are
always in-order. This assumption only holds during minimization, as minimizing 
an interval
may move the earlier terms beyond the trailing terms.

For example, after the initial [0, 3] interval is found above, the algorithm 
will attempt to
minimize it by advancing A to [2,2]. Because this is still before C at [3,3], 
but after B at
[1,1], we then try advancing B, leaving it at [Inf,Inf]. Minimization has 
failed, so we return
the original interval of [0,3]. However, when we come to retrieve the next 
interval, our
subintervals look like this: A[2,2], B[Inf,Inf], C[3,3] - the assumption that 
they are in order
is broken. The algorithm sees that A is before B, assumes that therefore all 
subsequent
subintervals are in order, and returns the new interval.

This commit fixes things by changing the assumption of ordering to only hold 
during
minimization. When first finding a candidate interval, the algorithm now checks 
that
all sub-intervals appear in order.

> Ordered intervals can give inaccurate hits on interleaved terms
> ---
>
> Key: LUCENE-9418
> URL: https://issues.apache.org/jira/browse/LUCENE-9418
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Given the text 'A B A C', an ordered interval over 'A B C' will return the 
> inaccurate interval [2, 3], due to the way minimization is handled after 
> matches are found.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9418) Ordered intervals can give inaccurate hits on interleaved terms

2020-06-27 Thread Alan Woodward (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146870#comment-17146870
 ] 

Alan Woodward commented on LUCENE-9418:
---

This was uncovered by an elasticsearch user and reported here: 
https://github.com/elastic/elasticsearch/issues/58576

> Ordered intervals can give inaccurate hits on interleaved terms
> ---
>
> Key: LUCENE-9418
> URL: https://issues.apache.org/jira/browse/LUCENE-9418
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Given the text 'A B A C', an ordered interval over 'A B C' will return the 
> inaccurate interval [2, 3], due to the way minimization is handled after 
> matches are found.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org