[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot resolved LUCENE-7580.
--
Resolution: Won't Fix
Resolved: not enough interest. I'll keep the github branches available for now.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: 7.0
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: Elschot20170326Counting.pdf, LUCENE-7580.patch,
> LUCENE-7580.patch, LUCENE-7580.patch, LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-6596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot closed LUCENE-6596.
Resolution: Fixed
Closing, not enough interest.
> Make width of unordered near spans consistent with ordered
> --
>
> Key: LUCENE-6596
> URL: https://issues.apache.org/jira/browse/LUCENE-6596
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
>Affects Versions: 6.0
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.0
>
> Attachments: LUCENE-6596.patch, LUCENE-6596.patch
>
>
> Use actual slop for width in NearSpansUnordered.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942393#comment-15942393
]
Paul Elschot edited comment on LUCENE-7580 at 3/26/17 6:51 PM:
---
I just pushed two branches to github, pullable as:
git pull https://github.com/PaulElschot/lucene-solr lucene7580-20170326
and
git pull https://github.com/PaulElschot/lucene-solr lucene7580report-20170326
The lucene7580-20170326 branch is an update of the previous pull request with a
few minor improvements. Most notable is putting SpansTreeWeight into its own
source file.
The lucene7580report-20170326 branch is on top of the lucene7580-20170326
branch, with the addition of the tex sources for a report on this issue.
I'll attach the pdf shortly here.
was (Author: paul.elsc...@xs4all.nl):
I just pushed to branches to github, pullable as:
git pull https://github.com/PaulElschot/lucene-solr lucene7580-20170326
and
git pull https://github.com/PaulElschot/lucene-solr lucene7580report-20170326
The lucene7580-20170326 branch is an update of the previous pull request with a
few minor improvements. Most notable is putting SpansTreeWeight into its own
source file.
The lucene7580report-20170326 branch is on top of the lucene7580-20170326
branch, with the addition of the tex sources for a report on this issue.
I'll attach the pdf shortly here.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: Elschot20170326Counting.pdf, LUCENE-7580.patch,
> LUCENE-7580.patch, LUCENE-7580.patch, LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942393#comment-15942393
]
Paul Elschot edited comment on LUCENE-7580 at 3/26/17 6:51 PM:
---
I just pushed to branches to github, pullable as:
git pull https://github.com/PaulElschot/lucene-solr lucene7580-20170326
and
git pull https://github.com/PaulElschot/lucene-solr lucene7580report-20170326
The lucene7580-20170326 branch is an update of the previous pull request with a
few minor improvements. Most notable is putting SpansTreeWeight into its own
source file.
The lucene7580report-20170326 branch is on top of the lucene7580-20170326
branch, with the addition of the tex sources for a report on this issue.
I'll attach the pdf shortly here.
was (Author: paul.elsc...@xs4all.nl):
I just pushed to branches to github, pullable as:
git pull https://github.com/PaulElschot/lucene-solr lucene7580-20170326
and
git pull https://github.com/PaulElschot/lucene-solr lucene7580report-20170326
The lucene7580-20170326 branch an update of the previous pull request with a
few minor improvements.
Most notable is putting SpansTreeWeight into its own source file.
The lucene7580report-20170326 is on top of the lucene7580-20170326 branch,
with the addition of the tex sources for a report on this issue.
I'll attach the pdf shortly here.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: Elschot20170326Counting.pdf, LUCENE-7580.patch,
> LUCENE-7580.patch, LUCENE-7580.patch, LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7580:
-
Attachment: Elschot20170326Counting.pdf
Report of 26 March 2017, generated from the lucene7580report-20170326 branch
and renamed to include the full date in the name.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: Elschot20170326Counting.pdf, LUCENE-7580.patch,
> LUCENE-7580.patch, LUCENE-7580.patch, LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942393#comment-15942393
]
Paul Elschot commented on LUCENE-7580:
--
I just pushed to branches to github, pullable as:
git pull https://github.com/PaulElschot/lucene-solr lucene7580-20170326
and
git pull https://github.com/PaulElschot/lucene-solr lucene7580report-20170326
The lucene7580-20170326 branch an update of the previous pull request with a
few minor improvements.
Most notable is putting SpansTreeWeight into its own source file.
The lucene7580report-20170326 is on top of the lucene7580-20170326 branch,
with the addition of the tex sources for a report on this issue.
I'll attach the pdf shortly here.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: LUCENE-7580.patch, LUCENE-7580.patch, LUCENE-7580.patch,
> LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892972#comment-15892972
]
Paul Elschot edited comment on LUCENE-7398 at 3/2/17 8:54 PM:
--
One way to view the problem is that when span end positions are used to
determine the slop, it becomes impossible to determine an order for moving the
subspans to a next position.
So one direction out of this could be: use NearSpans that determines the slop
only by the start positions of the subspans. That leaves only the cases in
which the subspans can start (and maybe also end) at the same position.
To make sure that all the subspans move forward after a match we could move
them all forward until after the current match, and while doing that also
count/collect them for scoring/highlighting as long as they are within the
match. That should solve the bug reported here, which is about scoring a missed
matching occurrence.
This limits the required slop to using only the starting positions of the
subspans. Could this work?
was (Author: paul.elsc...@xs4all.nl):
On way to view the problem is that when span end positions are used to
determine the slop, it becomes impossible to determine an order for moving the
subspans to a next position.
So one direction out of this could be: use NearSpans that determines the slop
only by the start positions of the subspans. That leaves only the cases in
which the subspans can start (and maybe also end) at the same position.
To make sure that all the subspans move forward after a match we could move
them all forward until after the current match, and while doing that also
count/collect them for scoring/highlighting as long as they are within the
match. That should solve the bug reported here, which is about scoring a missed
matching occurrence.
This limits the required slop to using only the starting positions of the
subspans. Could this work?
> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398-20160924.patch,
> LUCENE-7398-20160925.patch, LUCENE-7398.patch, LUCENE-7398.patch,
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping],
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as
> "coordinate gene research". It does not match "coordinate gene mapping
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It
> probably stopped working with the changes on SpanQueries in 5.3. I will
> attach a unit test that shows the problem.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892972#comment-15892972
]
Paul Elschot commented on LUCENE-7398:
--
On way to view the problem is that when span end positions are used to
determine the slop, it becomes impossible to determine an order for moving the
subspans to a next position.
So one direction out of this could be: use NearSpans that determines the slop
only by the start positions of the subspans. That leaves only the cases in
which the subspans can start (and maybe also end) at the same position.
To make sure that all the subspans move forward after a match we could move
them all forward until after the current match, and while doing that also
count/collect them for scoring/highlighting as long as they are within the
match. That should solve the bug reported here, which is about scoring a missed
matching occurrence.
This limits the required slop to using only the starting positions of the
subspans. Could this work?
> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398-20160924.patch,
> LUCENE-7398-20160925.patch, LUCENE-7398.patch, LUCENE-7398.patch,
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping],
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as
> "coordinate gene research". It does not match "coordinate gene mapping
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It
> probably stopped working with the changes on SpanQueries in 5.3. I will
> attach a unit test that shows the problem.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888703#comment-15888703
]
Paul Elschot commented on LUCENE-7715:
--
bq. ... how it deals with the initial state that all sub spans have a start
position of -1.
There is no need for that, the intermediate data structure is a priority queue
that is not a Spans itself.
If the names of this priority queue (SpanTotalLengthEndPositionWindow) and its
methods (startDocument/nextPosition) are misleading, they need to be improved.
The core search tests and precommit pass.
> Simplify NearSpansUnordered
> ---
>
> Key: LUCENE-7715
> URL: https://issues.apache.org/jira/browse/LUCENE-7715
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: LUCENE-7715.patch
>
>
> {code}
> git diff --stat master...
> .../spans/NearSpansUnordered.java | 211 -
> 1 file changed, 59 insertions(+), 152 deletions(-)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720077#comment-15720077
]
Paul Elschot edited comment on LUCENE-7580 at 2/26/17 8:08 PM:
---
What SpansTreeQuery does not do, and some rough edges:
The SpansDocScorer objects do the match recording and scoring, and there is one
for each Spans.
These SpansDocScorer objects might be merged into their Spans to reduce the
number of objects.
Related: how to deal with the same term occurring in more than one subquery?
See also LUCENE-7398.
Normally the term frequency score has a diminishing contribution for extra
occurrences.
In the patch the slop factors for a term are applied in decreasing order on
these diminished contributions.
This requires sorting of the slop factors.
Sorting the slop factors could be avoided when an actual score of a single term
occurrence was available.
In that case the given slop factor could be used as a weight on that score.
It might be possible to estimate an actual score for a single term occurrence
from the distances to other occurrences of the same term.
Similarly, the decreasing term frequency contributions can be seen as a
proximity weighting for the same term (or subquery):
the closer a term occurs to itself, the smaller its contribution.
This might be refined by using the actual distances to other the term
occurrences (or subquery occurrences)
to provide a weight for each term occurrence. This is unusual because the
weight decreases for smaller distances.
The slop factor from the Similarity may need to be adapted because of the way
it is combined here
with diminishing term contributions.
Another use of a score of each term occurrence could be to use the absolute
term position
to influence the score, possibly in combination with the field length.
There is an assert in TermSpansDocScorer.docScore() that verifies that
the smallest occurring slop factor is at least as large as the non matching
slop factor.
This condition is necessary for consistency.
Instead of using this assert, this condition might be enforced by somehow
automatically determining the non matching slop factor.
This is a prototype. No profiling has been done, it will take more CPU, but I
have no idea how much.
Garbage collection might be affected by the reference cycles between the
SpansDocScorers
and their Spans.
Since this allows weighting of subqueries, it might be possible to implement
synonym scoring
in SpanOrQuery by providing good subweights, and wrapping the whole thing in
SpansTreeQuery.
The only thing that might still be needed then is a SpansDocScorer that applies
the SimScorer.score()
over the total term frequency of the synonyms in a document.
SpansTreeScorer multiplies the slop factor for nested near queries at each
level.
Alternatively a minimum distance could be passed down.
This would need to change recordMatch(float slopFactor) to recordMatch(int
minDistance).
Would minDistance make sense, or is there a better distance?
What is a good way to test whether the score values from SpansTreeQuery
actually improve on
the score values from the current SpanScorer?
There are no tests for SpanFirstQuery/SpanContainingQuery/SpanWithinQuery.
These tests are not there because these queries provide FilterSpans and that is
already supported for SpanNotQuery.
The explain() method is not implemented for SpansTreeQuery.
This should be doable with an explain() method added to SpansTreeScorer to
provide the explanations.
There is no support for PayloadSpanQuery.
PayloadSpanQuery is not in here because it is not in the core module.
I think it can fit here in because PayloadSpanQuery also scores per matching
term occurrence.
Then Spans.doStartCurrentDoc() and Spans.doCurrentSpans() could be removed.
In case this is acceptable as a good way to score Spans:
Spans.width() and Scorer.freq() and SpansDocScorer.docMatchFreq() might be
removed.
Would it make sense to implement child Scorers in the tree of SpansDocScorer
objects?
was (Author: paul.elsc...@xs4all.nl):
What SpansTreeQuery does not do, and some rough edges:
The SpansDocScorer objects do the match recording and scoring, and there is one
for each Spans.
These SpansDocScorer objects might be merged into their Spans to reduce the
number of objects.
Related: how to deal with the same term occurring in more than one subquery?
See also LUCENE-7398.
Normally the term frequency score has a diminishing contribution for extra
occurrences.
In the patch the slop factors for a term are applied in decreasing order on
these diminished contributions.
This requires sorting of the slop factors.
Sorting the slop factors could be avoided when an actual score of a single term
occurrence was available.
In that case the given slop factor could be used as a weight on that score.
It might be possible to estimate an actual score for a single term
[
https://issues.apache.org/jira/browse/LUCENE-7682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15884727#comment-15884727
]
Paul Elschot commented on LUCENE-7682:
--
For queries requiring t1 near t2 with enough slop, t1 t1 t2 matches twice, but
t1 t2 t2 matches only once. This behaviour was introduced with the lazy
iteration, see:
https://issues.apache.org/jira/browse/LUCENE-6537?focusedCommentId=14579537=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14579537
This is also a problem for LUCENE-7580 where matching term occurrences are
scored: there the second occurrence of t2 will not influence the score because
it is never reported as a match.
LUCENE-7398 is probably also of interest here.
To improve highlighting and scoring, we will probably have to rethink how
matches of span queries are reported.
One way could be to report all occurrences in the matching window, and forward
all the sub-spans to after the matching window.
Would that be feasible?
> UnifiedHighlighter not highlighting all terms relevant in SpanNearQuery
> ---
>
> Key: LUCENE-7682
> URL: https://issues.apache.org/jira/browse/LUCENE-7682
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/highlighter
>Reporter: Michael Braun
>
> Original text: "Something for protecting wildlife feed in a feed thing."
> Query is:
>SpanNearQuery with Slop 9 - in order -
> 1. SpanTermQuery(wildlife)
> 2. SpanTermQuery(feed)
> This should highlight both instances of "feed" since they are both within
> slop of 9 of "wildlife". However, only the first instance is highlighted.
> This occurs with unordered SpanNearQuery as well. Test below replicates.
> Affects both the current 6.x line and master.
> Test that fits within TestUnifiedHighlighterMTQ:
> {code}
> public void testOrderedSpanNearQueryWithDupeTerms() throws Exception {
> RandomIndexWriter iw = new RandomIndexWriter(random(), dir,
> indexAnalyzer);
> Document doc = new Document();
> doc.add(new Field("body", "Something for protecting wildlife feed in a
> feed thing.", fieldType));
> doc.add(newTextField("id", "id", Field.Store.YES));
> iw.addDocument(doc);
> IndexReader ir = iw.getReader();
> iw.close();
> IndexSearcher searcher = newSearcher(ir);
> UnifiedHighlighter highlighter = new UnifiedHighlighter(searcher,
> indexAnalyzer);
> int docID = searcher.search(new TermQuery(new Term("id", "id")),
> 1).scoreDocs[0].doc;
> SpanTermQuery termOne = new SpanTermQuery(new Term("body", "wildlife"));
> SpanTermQuery termTwo = new SpanTermQuery(new Term("body", "feed"));
> SpanNearQuery topQuery = new SpanNearQuery.Builder("body", true)
> .setSlop(9)
> .addClause(termOne)
> .addClause(termTwo)
> .build();
> int[] docIds = new int[] {docID};
> String snippets[] = highlighter.highlightFields(new String[] {"body"},
> topQuery, docIds, new int[] {2}).get("body");
> assertEquals(1, snippets.length);
> assertEquals("Something for protecting wildlife feed in a
> feed thing.", snippets[0]);
> ir.close();
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720073#comment-15720073
]
Paul Elschot edited comment on LUCENE-7580 at 2/1/17 1:36 PM:
--
Some related issues, thanks for these discussions:
LUCENE-533
LUCENE-2878
LUCENE-2879
LUCENE-2880
LUCENE-6226
LUCENE-6371
LUCENE-6466
LUCENE-7398
Some related web pages:
http://www.gossamer-threads.com/lists/lucene/java-user/33902 March 2006.
http://www.gossamer-threads.com/lists/lucene/java-user/53027 September 2007,
suggests to:
"recurse the spans tree to compose a score based on the type of subqueries
(near, and, or, not) and what matched."
http://www.gossamer-threads.com/lists/lucene/java-user/60103 April 2008.
http://www.flax.co.uk/blog/2016/04/26/can-make-contribution-apache-solr-core-development/
see point 4.
How to use BM25:
http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
was (Author: paul.elsc...@xs4all.nl):
Some related issues, thanks for these discussions:
LUCENE-533
LUCENE-2878
LUCENE-2879
LUCENE-2880
LUCENE-6371
LUCENE-6466
LUCENE-7398
Some related web pages:
http://www.gossamer-threads.com/lists/lucene/java-user/33902 March 2006.
http://www.gossamer-threads.com/lists/lucene/java-user/53027 September 2007,
suggests to:
"recurse the spans tree to compose a score based on the type of subqueries
(near, and, or, not) and what matched."
http://www.gossamer-threads.com/lists/lucene/java-user/60103 April 2008.
http://www.flax.co.uk/blog/2016/04/26/can-make-contribution-apache-solr-core-development/
see point 4.
How to use BM25:
http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: LUCENE-7580.patch, LUCENE-7580.patch, LUCENE-7580.patch,
> LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot closed LUCENE-7633.
Resolution: Won't Fix
Meanwhile there is a plan for 7.0, and that might be an opportunity here.
I'm still closing this because the disadvantages seem to outweigh advantages.
> Rename Terms to IndexedField (was to FieldTerms)
>
>
> Key: LUCENE-7633
> URL: https://issues.apache.org/jira/browse/LUCENE-7633
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15844031#comment-15844031
]
Paul Elschot commented on LUCENE-7633:
--
The problem I have with the name Terms is that it is too general, it should be
a little more verbose.
I think IndexedField is a good name for what it provides: term enumeration and
term statistics.
The other renames just follow this.
In doing this I learned quite a bit about flexible indexing, and I really like
the class structure.
With that behind me, I don't really need this renaming any more...
The renames are straightforward, so adopting them elsewhere should be easy.
In case these names are actually preferred, I'd gladly add backward compatible
class names.
That would probably boil down to inserting classes with the current names as
deprecated superclasses.
> Rename Terms to IndexedField (was to FieldTerms)
>
>
> Key: LUCENE-7633
> URL: https://issues.apache.org/jira/browse/LUCENE-7633
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15833005#comment-15833005
]
Paul Elschot commented on LUCENE-7637:
--
There is a pull request from me at LUCENE-7624 that landed there because that
issue was mentioned in the commit message.
> TermInSetQuery should require that all terms come from the same field
> -
>
> Key: LUCENE-7637
> URL: https://issues.apache.org/jira/browse/LUCENE-7637
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: master (7.0), 6.5
>
> Attachments: LUCENE-7637.patch
>
>
> Spin-off from LUCENE-7624. Requiring that all terms are in the same field
> would make things simpler and more consistent with other queries. It might
> also make it easier to improve this query in the future since other similar
> queries like AutomatonQuery also work on the per-field basis. The only
> downside is that querying terms across multiple fields would be less
> efficient, but this does not seem to be a common use-case.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828371#comment-15828371
]
Paul Elschot commented on LUCENE-7633:
--
The Terms javadocs:
bq. Access to the terms in a specific field. See also Fields
But Terms does more than accessing the terms of a field, it also provides index
statistics for a field.
So in the o.a.l.index package, Field might actually be better name, but that
would be confusing with o.a.l.document.Field.
Perhaps this would be ideal:
Rename Fields to IndexedFields
Rename Terms to IndexedField
The current patch to rename Terms to FieldTerms changes 860 occurrences in 290
source code files,
so there is no point in keeping this open for a longer period.
I'll close in two weeks or so, unless there are other opinions.
> Rename Terms to FieldTerms
> --
>
> Key: LUCENE-7633
> URL: https://issues.apache.org/jira/browse/LUCENE-7633
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
Paul Elschot created LUCENE-7633:
Summary: Rename Terms to FieldTerms
Key: LUCENE-7633
URL: https://issues.apache.org/jira/browse/LUCENE-7633
Project: Lucene - Core
Issue Type: Improvement
Reporter: Paul Elschot
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15823102#comment-15823102
]
Paul Elschot commented on LUCENE-7624:
--
This one is an interesting target for surround, so I had a look.
Allowing more than one field for the terms also has an advantage in that only
one doc id set will be built for all the terms.
As to the code:
There is a small javadoc mistake in line 54 using both "@{" and "{@".
When constructing a Term a deep copy of the given BytesRef is taken, so the
deep copy in line 154 is superfluous.
(The deep copy in line 222 of the termEnum.term() is needed there.)
> Consider moving TermsQuery to core
> --
>
> Key: LUCENE-7624
> URL: https://issues.apache.org/jira/browse/LUCENE-7624
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7624.patch
>
>
> TermsQuery current sits in the queries module, but it's used in both
> spatial-extras and in facets, and currently is the only reason that the
> facets module has a dependency on queries. I think it's a generally useful
> query, and would fit in perfectly well in core.
> This would also allow us to explore rewriting BooleanQuery to TermsQuery to
> avoid the max-clauses limit
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822372#comment-15822372
]
Paul Elschot commented on LUCENE-7628:
--
Hopefully this is not getting too far off topic.
In case there is interest in taking DocIdSetIterator out of Spans, please let
me know, I have version that passes the lucene tests.
I don't know whether it speeds up Spans.
The split makes a lot of the Spans code more readable.
> Add a getMatchingChildren() method to DisjunctionScorer
> ---
>
> Key: LUCENE-7628
> URL: https://issues.apache.org/jira/browse/LUCENE-7628
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Minor
> Attachments: LUCENE-7628.patch
>
>
> This one is a bit convoluted, so bear with me...
> The luwak highlighter works by rewriting queries into their Span-equivalents,
> and then running them with a special Collector. At each matching doc, the
> highlighter gathers all the Spans objects positioned on the current doc and
> collects their positions using the SpanCollection API.
> Some queries can't be translated into Spans. For those queries that generate
> Scorers with ChildScorers, like BooleanQuery, we can call .getChildren() on
> the Scorer and see if any of them are SpanScorers, and for those that aren't
> we can call .getChildren() again and recurse down. For each child scorer, we
> check that it's positioned on the current document, so non-matching
> subscorers can be skipped.
> This all works correctly *except* in the case of a DisjunctionScorer where
> one of the children is a two-phase iterator that has matched its
> approximation, but not its refinement query. A SpanScorer in this situation
> will be correctly positioned on the current document, but its Spans will be
> in an undefined state, meaning the highlighter will either collect incorrect
> hits, or it will throw an Exception and prevent hits being collected from
> other subspans.
> We've tried various ways around this (including forking SpanNearQuery and
> adding a bunch of slow position checks to it that are used only by the
> highlighting code), but it turns out that the simplest fix is to add a new
> method to DisjunctionScorer that only returns the currently matching child
> Scorers. It's a bit of a hack, and it won't be used anywhere else, but it's
> a fairly small and contained hack.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15822361#comment-15822361
]
Paul Elschot commented on LUCENE-7628:
--
To continue about using Spans directly for this
(earlier posted on github, see
https://github.com/flaxsearch/luwak/commit/36c91e8bdd3ab0d07578b76359d1f2a87eb53797)
Other than AND and OR in the same field, what is also still needed is dealing
with multiple fields.
For this we need a Spans that can share its DocIdSetIterator with another Spans.
Iirc that is what LUCENE-2878 is about, so I'm finally beginning to understand
the real point of that issue, and why it is still open.
Meanwhile we had DocIdSetIterator split off from Searcher (for speed).
How about doing something similar for Spans? I think that would leave Spans
pretty close to the Positions of LUCENE-2787. The only change in semantics for
Spans would be that at least one of the Spans that share a DocIdSetIterator
should provide a real position in a document. Maybe we could have sth like
MultiFieldSpans for that.
> Add a getMatchingChildren() method to DisjunctionScorer
> ---
>
> Key: LUCENE-7628
> URL: https://issues.apache.org/jira/browse/LUCENE-7628
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Minor
> Attachments: LUCENE-7628.patch
>
>
> This one is a bit convoluted, so bear with me...
> The luwak highlighter works by rewriting queries into their Span-equivalents,
> and then running them with a special Collector. At each matching doc, the
> highlighter gathers all the Spans objects positioned on the current doc and
> collects their positions using the SpanCollection API.
> Some queries can't be translated into Spans. For those queries that generate
> Scorers with ChildScorers, like BooleanQuery, we can call .getChildren() on
> the Scorer and see if any of them are SpanScorers, and for those that aren't
> we can call .getChildren() again and recurse down. For each child scorer, we
> check that it's positioned on the current document, so non-matching
> subscorers can be skipped.
> This all works correctly *except* in the case of a DisjunctionScorer where
> one of the children is a two-phase iterator that has matched its
> approximation, but not its refinement query. A SpanScorer in this situation
> will be correctly positioned on the current document, but its Spans will be
> in an undefined state, meaning the highlighter will either collect incorrect
> hits, or it will throw an Exception and prevent hits being collected from
> other subspans.
> We've tried various ways around this (including forking SpanNearQuery and
> adding a bunch of slow position checks to it that are used only by the
> highlighting code), but it turns out that the simplest fix is to add a new
> method to DisjunctionScorer that only returns the currently matching child
> Scorers. It's a bit of a hack, and it won't be used anywhere else, but it's
> a fairly small and contained hack.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819228#comment-15819228
]
Paul Elschot commented on LUCENE-7628:
--
I'll answer myself.
With AND and OR available, i.e. the Spans parallels of ConjunctionScorer and
DisjunctionScorer, what would still be needed is the Spans parallel of
ReqExclScorer, for NOT at document level.
Something like ReqExclSpans for a SpanBooleanNotQuery.
> Add a getMatchingChildren() method to DisjunctionScorer
> ---
>
> Key: LUCENE-7628
> URL: https://issues.apache.org/jira/browse/LUCENE-7628
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Minor
>
> This one is a bit convoluted, so bear with me...
> The luwak highlighter works by rewriting queries into their Span-equivalents,
> and then running them with a special Collector. At each matching doc, the
> highlighter gathers all the Spans objects positioned on the current doc and
> collects their positions using the SpanCollection API.
> Some queries can't be translated into Spans. For those queries that generate
> Scorers with ChildScorers, like BooleanQuery, we can call .getChildren() on
> the Scorer and see if any of them are SpanScorers, and for those that aren't
> we can call .getChildren() again and recurse down. For each child scorer, we
> check that it's positioned on the current document, so non-matching
> subscorers can be skipped.
> This all works correctly *except* in the case of a DisjunctionScorer where
> one of the children is a two-phase iterator that has matched its
> approximation, but not its refinement query. A SpanScorer in this situation
> will be correctly positioned on the current document, but its Spans will be
> in an undefined state, meaning the highlighter will either collect incorrect
> hits, or it will throw an Exception and prevent hits being collected from
> other subspans.
> We've tried various ways around this (including forking SpanNearQuery and
> adding a bunch of slow position checks to it that are used only by the
> highlighting code), but it turns out that the simplest fix is to add a new
> method to DisjunctionScorer that only returns the currently matching child
> Scorers. It's a bit of a hack, and it won't be used anywhere else, but it's
> a fairly small and contained hack.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819158#comment-15819158
]
Paul Elschot edited comment on LUCENE-7628 at 1/11/17 9:09 PM:
---
And if there was also something like SpanAndMergeQuery that merges the Spans
positions when all of them are present in a document?
This could have an AndMergeSpans as a subclass of DisjunctionSpans above.
was (Author: paul.elsc...@xs4all.nl):
And if there was also something like SpanAndMergeQuery that merges the Spans
positions when all of them are present in a document?
> Add a getMatchingChildren() method to DisjunctionScorer
> ---
>
> Key: LUCENE-7628
> URL: https://issues.apache.org/jira/browse/LUCENE-7628
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Minor
>
> This one is a bit convoluted, so bear with me...
> The luwak highlighter works by rewriting queries into their Span-equivalents,
> and then running them with a special Collector. At each matching doc, the
> highlighter gathers all the Spans objects positioned on the current doc and
> collects their positions using the SpanCollection API.
> Some queries can't be translated into Spans. For those queries that generate
> Scorers with ChildScorers, like BooleanQuery, we can call .getChildren() on
> the Scorer and see if any of them are SpanScorers, and for those that aren't
> we can call .getChildren() again and recurse down. For each child scorer, we
> check that it's positioned on the current document, so non-matching
> subscorers can be skipped.
> This all works correctly *except* in the case of a DisjunctionScorer where
> one of the children is a two-phase iterator that has matched its
> approximation, but not its refinement query. A SpanScorer in this situation
> will be correctly positioned on the current document, but its Spans will be
> in an undefined state, meaning the highlighter will either collect incorrect
> hits, or it will throw an Exception and prevent hits being collected from
> other subspans.
> We've tried various ways around this (including forking SpanNearQuery and
> adding a bunch of slow position checks to it that are used only by the
> highlighting code), but it turns out that the simplest fix is to add a new
> method to DisjunctionScorer that only returns the currently matching child
> Scorers. It's a bit of a hack, and it won't be used anywhere else, but it's
> a fairly small and contained hack.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819158#comment-15819158
]
Paul Elschot commented on LUCENE-7628:
--
And if there was also something like SpanAndMergeQuery that merges the Spans
positions when all of them are present in a document?
> Add a getMatchingChildren() method to DisjunctionScorer
> ---
>
> Key: LUCENE-7628
> URL: https://issues.apache.org/jira/browse/LUCENE-7628
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Minor
>
> This one is a bit convoluted, so bear with me...
> The luwak highlighter works by rewriting queries into their Span-equivalents,
> and then running them with a special Collector. At each matching doc, the
> highlighter gathers all the Spans objects positioned on the current doc and
> collects their positions using the SpanCollection API.
> Some queries can't be translated into Spans. For those queries that generate
> Scorers with ChildScorers, like BooleanQuery, we can call .getChildren() on
> the Scorer and see if any of them are SpanScorers, and for those that aren't
> we can call .getChildren() again and recurse down. For each child scorer, we
> check that it's positioned on the current document, so non-matching
> subscorers can be skipped.
> This all works correctly *except* in the case of a DisjunctionScorer where
> one of the children is a two-phase iterator that has matched its
> approximation, but not its refinement query. A SpanScorer in this situation
> will be correctly positioned on the current document, but its Spans will be
> in an undefined state, meaning the highlighter will either collect incorrect
> hits, or it will throw an Exception and prevent hits being collected from
> other subspans.
> We've tried various ways around this (including forking SpanNearQuery and
> adding a bunch of slow position checks to it that are used only by the
> highlighting code), but it turns out that the simplest fix is to add a new
> method to DisjunctionScorer that only returns the currently matching child
> Scorers. It's a bit of a hack, and it won't be used anywhere else, but it's
> a fairly small and contained hack.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15819020#comment-15819020
]
Paul Elschot commented on LUCENE-7628:
--
bq. This all works correctly except in the case of a DisjunctionScorer where
one of the children is a two-phase iterator that has matched its approximation,
but not its refinement query. A SpanScorer in this situation will be correctly
positioned on the current document, but its Spans will be in an undefined
state, meaning the highlighter will either collect incorrect hits, or it will
throw an Exception and prevent hits being collected from other subspans.
Does the highlight code call collect() before nextStartPosition() ?
That should be avoided, see the javadocs of Spans.
For LUCENE-7580 I had a very similar issue, that one computes scores per
matching (i.e. to be highlighted) term occurrence.
The solution there is to split off DisjunctionSpans from SpanOrQuery and to add
these methods:
{code}
public List subSpans() {
return subSpans;
}
public void extractSubSpansAtCurrentDoc(List spansList) {
byPositionQueue.extractSpansList(spansList);
}
public Spans getCurrentPositionSpans() {
return byPositionQueue.top();
}
{code}
With that in place a highlighter could use SpanOrQuery instead of a
BooleanQuery OR, and then the highlighter should be able to do its work.
(Aside: getCurrentPositionSpans() is wrongly named getFirstPositionSpans() at
LUCENE-7580, I'll fix that later).
> Add a getMatchingChildren() method to DisjunctionScorer
> ---
>
> Key: LUCENE-7628
> URL: https://issues.apache.org/jira/browse/LUCENE-7628
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Alan Woodward
>Assignee: Alan Woodward
>Priority: Minor
>
> This one is a bit convoluted, so bear with me...
> The luwak highlighter works by rewriting queries into their Span-equivalents,
> and then running them with a special Collector. At each matching doc, the
> highlighter gathers all the Spans objects positioned on the current doc and
> collects their positions using the SpanCollection API.
> Some queries can't be translated into Spans. For those queries that generate
> Scorers with ChildScorers, like BooleanQuery, we can call .getChildren() on
> the Scorer and see if any of them are SpanScorers, and for those that aren't
> we can call .getChildren() again and recurse down. For each child scorer, we
> check that it's positioned on the current document, so non-matching
> subscorers can be skipped.
> This all works correctly *except* in the case of a DisjunctionScorer where
> one of the children is a two-phase iterator that has matched its
> approximation, but not its refinement query. A SpanScorer in this situation
> will be correctly positioned on the current document, but its Spans will be
> in an undefined state, meaning the highlighter will either collect incorrect
> hits, or it will throw an Exception and prevent hits being collected from
> other subspans.
> We've tried various ways around this (including forking SpanNearQuery and
> adding a bunch of slow position checks to it that are used only by the
> highlighting code), but it turns out that the simplest fix is to add a new
> method to DisjunctionScorer that only returns the currently matching child
> Scorers. It's a bit of a hack, and it won't be used anywhere else, but it's
> a fairly small and contained hack.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7613:
-
Attachment: LUCENE-7613-spanstree.patch
Patch of 7 Jan 2017, combine with LUCENE-7580.
This issue and LUCENE-7580 both depend on LUCENE-7615, and this patch is to use
that dependency only via LUCENE-7580.
To use this with SpansTreeQuery, apply the patch at LUCENE-7580 first, and then
apply this patch of 7 Jan 2017.
This contains the changes of this issue to surround/query, updates the surround
tests to use SpansTreeQuery.wrapAfterRewrite(), and changes a few expected
document orders in the surround tests.
> Update Surround query language
> --
>
> Key: LUCENE-7613
> URL: https://issues.apache.org/jira/browse/LUCENE-7613
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/queryparser
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: LUCENE-7613-spanstree.patch, LUCENE-7613.patch,
> LUCENE-7613.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15805767#comment-15805767
]
Paul Elschot commented on LUCENE-7580:
--
SpanSynonymQuery is unusual here because it uses a single SpansDocScorer per
segment, independent of the number of synonym terms.
Since the TermSpans for SynonymSpans are Spans without a SpansDocScorer it
makes some sense not to merge Spans and SpansDocScorer later.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: LUCENE-7580.patch, LUCENE-7580.patch, LUCENE-7580.patch,
> LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7580:
-
Attachment: LUCENE-7580.patch
Patch of 6 Jan 2017.
This contains:
The changes in the patch of 30 Dec 2016.
Support for SpanSynonymQuery, see SynonymSpans and SynonymSpansDocScorer.
Class AsSingleTermSpansDocScorer as common superclass for TermSpansDocScorer
and SynonymSpansDocScorer. This is the place where matching and non matching
term occurrences are scored with a SimScorer from Similarity while taking into
account the slop factors.
Method SpansTreeQuery.wrapAfterRewrite() to use SpansTreeQuery.wrap() at the
right moment.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: LUCENE-7580.patch, LUCENE-7580.patch, LUCENE-7580.patch,
> LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15802710#comment-15802710
]
Paul Elschot commented on LUCENE-7621:
--
Starting from the number of indexed terms in a doc, when more than one of any
synonym occurs, such extra occurrences would have to be ignored for counting
the number of present clauses.
> Per-document minShouldMatch
> ---
>
> Key: LUCENE-7621
> URL: https://issues.apache.org/jira/browse/LUCENE-7621
> Project: Lucene - Core
> Issue Type: New Feature
>Reporter: Adrien Grand
>Priority: Minor
>
> I have seen similar requirements a couple times but could not find any
> related issue so I am opening one now. The idea would be to allow passing a
> {{LongValuesSource}} rather than an integer as the {{minShouldMatch}}
> parameter of {{BooleanQuery}} so that the number of required clauses can
> depend on the document that is being matched. In terms of implementation, it
> looks like it would be straightforward as we would just have to update the
> value of {{minShouldMatch}} in {{MinShouldMatchSumScorer.setDocAndFreq}} and
> things would still be efficient, ie. we would still use advance on the costly
> clauses.
> This kind of feature would allow to run queries that must match eg. 80% of
> the terms that a document contains (by indexing the number of terms in a
> separate field). It would also make it possible for Luwak or ES' percolator
> to index boolean queries that have a value of {{minShouldMatch}} greater than
> 1 more efficiently.
> I do not have any plans to work on it soon but I am curious how much interest
> this feature would drive.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15802649#comment-15802649
]
Paul Elschot commented on LUCENE-7621:
--
Could this also work when the clauses are SynonymQueries?
> Per-document minShouldMatch
> ---
>
> Key: LUCENE-7621
> URL: https://issues.apache.org/jira/browse/LUCENE-7621
> Project: Lucene - Core
> Issue Type: New Feature
>Reporter: Adrien Grand
>Priority: Minor
>
> I have seen similar requirements a couple times but could not find any
> related issue so I am opening one now. The idea would be to allow passing a
> {{LongValuesSource}} rather than an integer as the {{minShouldMatch}}
> parameter of {{BooleanQuery}} so that the number of required clauses can
> depend on the document that is being matched. In terms of implementation, it
> looks like it would be straightforward as we would just have to update the
> value of {{minShouldMatch}} in {{MinShouldMatchSumScorer.setDocAndFreq}} and
> things would still be efficient, ie. we would still use advance on the costly
> clauses.
> This kind of feature would allow to run queries that must match eg. 80% of
> the terms that a document contains (by indexing the number of terms in a
> separate field). It would also make it possible for Luwak or ES' percolator
> to index boolean queries that have a value of {{minShouldMatch}} greater than
> 1 more efficiently.
> I do not have any plans to work on it soon but I am curious how much interest
> this feature would drive.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7613:
-
Attachment: LUCENE-7613.patch
Patch of 5 Jan 2017
This includes:
- the previous patch for using DisjunctionMaxQuery over fields,
- using (Span)SynonymQuery for truncations and prefixes, i.e. groups of terms.
- the patch of LUCENE-7615 for SpanSynonymQuery.
- Further improvements in the surround query code, mostly:
-- Removal of SimpleTerm implementing Comparable as deprecated in 2011.
-- Move all creation of primitive queries (i.e. rewrite results) into
BasicQueryFactory.
-- Use BytesRef for visiting index terms.
-- A Test for TooManyBasicQueries.
> Update Surround query language
> --
>
> Key: LUCENE-7613
> URL: https://issues.apache.org/jira/browse/LUCENE-7613
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/queryparser
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: LUCENE-7613.patch, LUCENE-7613.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15801532#comment-15801532
]
Paul Elschot commented on LUCENE-7615:
--
In SpanSynonymQuery.java here, this is not used:
{code}
import org.apache.lucene.search.similarities.Similarity;
{code}
> SpanSynonymQuery
>
>
> Key: LUCENE-7615
> URL: https://issues.apache.org/jira/browse/LUCENE-7615
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: LUCENE-7615.patch, LUCENE-7615.patch
>
>
> A SpanQuery that tries to score as SynonymQuery.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7615:
-
Attachment: LUCENE-7615.patch
Patch of 3 Jan 2017.
Compared to yesterday, this adds getTermContexts() in SynonymWeight for use in
SpanSynonymQuery.
> SpanSynonymQuery
>
>
> Key: LUCENE-7615
> URL: https://issues.apache.org/jira/browse/LUCENE-7615
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: LUCENE-7615.patch, LUCENE-7615.patch
>
>
> A SpanQuery that tries to score as SynonymQuery.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15795030#comment-15795030
]
Paul Elschot commented on LUCENE-7615:
--
In the patch of 2 Jan 2017 the term contexts are extracted twice, once in
SynonymWeight and once to create the SpanSynonymWeight.
I'll post a fix later.
> SpanSynonymQuery
>
>
> Key: LUCENE-7615
> URL: https://issues.apache.org/jira/browse/LUCENE-7615
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: LUCENE-7615.patch
>
>
> A SpanQuery that tries to score as SynonymQuery.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15792842#comment-15792842
]
Paul Elschot commented on LUCENE-7615:
--
Some plans for using this:
In LUCENE-7580 to get real synonym scoring behaviour.
In Surround to score truncations.
> SpanSynonymQuery
>
>
> Key: LUCENE-7615
> URL: https://issues.apache.org/jira/browse/LUCENE-7615
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: LUCENE-7615.patch
>
>
> A SpanQuery that tries to score as SynonymQuery.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7615:
-
Attachment: LUCENE-7615.patch
Patch of 2 Jan 2017.
This can be used as proximity subquery whenever SynonymQuery is used now, i.e.
for synonym terms.
I think this improves span scoring somewhat, see the tests and the test output
when uncommenting showQueryResults for the test cases with two terms.
Implementation:
SynonymQuery exposes new methods getField() and SynonymWeight.getSimScorer()
for use in SpanSynonymQuery.
Improved use of o.a.l.index.Terms and TermsEnum in SynonymQuery, at most a
single TermsEnum will be used.
Aside: how about renaming Terms to FieldTerms?
This takes DisjunctionSpans out of SpanOrQuery.
This adds SynonymSpans as (an almost empty) subclass of DisjunctionSpans, for
later further scoring improvement.
PHRASE_TO_SPAN_TERM_POSITIONS_COST is used from SpanTermQuery and made package
private there.
> SpanSynonymQuery
>
>
> Key: LUCENE-7615
> URL: https://issues.apache.org/jira/browse/LUCENE-7615
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: LUCENE-7615.patch
>
>
> A SpanQuery that tries to score as SynonymQuery.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
Paul Elschot created LUCENE-7615:
Summary: SpanSynonymQuery
Key: LUCENE-7615
URL: https://issues.apache.org/jira/browse/LUCENE-7615
Project: Lucene - Core
Issue Type: Improvement
Components: core/search
Affects Versions: master (7.0)
Reporter: Paul Elschot
Priority: Minor
A SpanQuery that tries to score as SynonymQuery.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15788338#comment-15788338
]
Paul Elschot commented on LUCENE-7613:
--
I would not mind to make a similar update for LUCENE-5205, but I am not
familiar enough with the code there.
> Make Surround use DisjunctionMaxQuery for multiple fields
> -
>
> Key: LUCENE-7613
> URL: https://issues.apache.org/jira/browse/LUCENE-7613
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/queryparser
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: LUCENE-7613.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15788332#comment-15788332
]
Paul Elschot edited comment on LUCENE-7613 at 12/30/16 9:01 PM:
Patch of 30 Dec 2016.
This does not affect the syntax of surround, this only adapts the lucene side
to make better use of lucene facilities that are newer than the initial version
of surround.
This uses DisjunctionMaxQuery when a query specifies multiple fields.
The method to convert to a lucene query also allows multiple default fields.
This adds methods to BasicQueryFactory to create a new SpanNearQuery and to
create a new DisjunctionMaxQuery.
This uses SpanBoostQuery when proximity (sub)queries are boosted. There is no
effect on the scores yet, LUCENE-7580 can change that.
This updates the test code to use CheckHits, and one test case is added.
The changes to the test code form the larger part of the patch.
was (Author: paul.elsc...@xs4all.nl):
Patch of 30 Dec 2016.
This does not affect the syntax of surround, this only adapts the lucene side
to make better use of lucene facilities that are newer than the current version.
This uses DisjunctionMaxQuery when a query specifies multiple fields.
The method to convert to a lucene query also allows multiple default fields.
This adds methods to BasicQueryFactory to create a new SpanNearQuery and to
create a new DisjunctionMaxQuery.
This uses SpanBoostQuery when proximity (sub)queries are boosted. There is no
effect on the scores yet, LUCENE-7580 can change that.
This updates the test code to use CheckHits, and one test case is added.
The changes to the test code form the larger part of the patch.
> Make Surround use DisjunctionMaxQuery for multiple fields
> -
>
> Key: LUCENE-7613
> URL: https://issues.apache.org/jira/browse/LUCENE-7613
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/queryparser
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: LUCENE-7613.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7613:
-
Attachment: LUCENE-7613.patch
Patch of 30 Dec 2016.
This does not affect the syntax of surround, this only adapts the lucene side
to make better use of lucene facilities that are newer than the current version.
This uses DisjunctionMaxQuery when a query specifies multiple fields.
The method to convert to a lucene query also allows multiple default fields.
This adds methods to BasicQueryFactory to create a new SpanNearQuery and to
create a new DisjunctionMaxQuery.
This uses SpanBoostQuery when proximity (sub)queries are boosted. There is no
effect on the scores yet, LUCENE-7580 can change that.
This updates the test code to use CheckHits, and one test case is added.
The changes to the test code form the larger part of the patch.
> Make Surround use DisjunctionMaxQuery for multiple fields
> -
>
> Key: LUCENE-7613
> URL: https://issues.apache.org/jira/browse/LUCENE-7613
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/queryparser
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: LUCENE-7613.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
Paul Elschot created LUCENE-7613:
Summary: Make Surround use DisjunctionMaxQuery for multiple fields
Key: LUCENE-7613
URL: https://issues.apache.org/jira/browse/LUCENE-7613
Project: Lucene - Core
Issue Type: Improvement
Components: core/queryparser
Reporter: Paul Elschot
Priority: Minor
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7580:
-
Attachment: LUCENE-7580.patch
Patch of 29 Dec 2016.
Compared to the previous patch, this adds:
Limiting the max allowed slop to Integer.MAX_VALUE-1 in the SpanNearQuery
constructor and in TestSpanSearchEquivalence. An actual slop of
Integer.MAX_VALUE causes an overflow in distance+1 that is used in
computeSlopFactor. Since the same limitation is already present for indexed
positions, I would not expect this slop factor miscalculation to actually occur.
The negative slops that occur for overlapping spans are changed to 0 before
passing them to computeSlopFactor in NearSpansDocScorer in the patch here.
The non match distance passed to SpanNearQuery in the patch is verified to be
at least the given slop.
A wrapper method SpansTreeScorer.wrap() is added that will wrap the span
(subqueries of a) given query in a SpansTreeQuery. This works for span
subqueries of BooleanQuery, DisjunctionMaxQuery and BoostQuery.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: LUCENE-7580.patch, LUCENE-7580.patch, LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785476#comment-15785476
]
Paul Elschot commented on LUCENE-7602:
--
After taking a closer look at the other issues:
How about renaming the ContextMap here to ValueSourceContext or to VSContext ?
> Fix compiler warnings for ant clean compile
> ---
>
> Key: LUCENE-7602
> URL: https://issues.apache.org/jira/browse/LUCENE-7602
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Labels: build
> Fix For: trunk
>
> Attachments: LUCENE-7602-ContextMap-lucene.patch,
> LUCENE-7602-ContextMap-solr.patch, LUCENE-7602.patch, LUCENE-7602.patch,
> LUCENE-7602.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785307#comment-15785307
]
Paul Elschot commented on LUCENE-7602:
--
Looking at that code, the copyOf method should be named asCopyOf..., and the
javadocs should be "as a copy of the given Map".
> Fix compiler warnings for ant clean compile
> ---
>
> Key: LUCENE-7602
> URL: https://issues.apache.org/jira/browse/LUCENE-7602
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Labels: build
> Fix For: trunk
>
> Attachments: LUCENE-7602-ContextMap-lucene.patch,
> LUCENE-7602-ContextMap-solr.patch, LUCENE-7602.patch, LUCENE-7602.patch,
> LUCENE-7602.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15785298#comment-15785298
]
Paul Elschot commented on LUCENE-7602:
--
For easy reference, ContextMap now looks like this:
{code}
/** Modifiable {@link Map} of key objects to value objects for {@link
ValueSource}. */
public class ContextMap extends HashMap
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7602:
-
Attachment: LUCENE-7602.patch
Patch of 29 Dec 2016.
This patch takes tries to separate interface from implementation by making the
constructors in ContextMap and QueryContext protected and using only a few
factory methods for ContextMap.
It would be nicer to use private inheritance from HashMap and
only expose Map, but java does not have private inheritance. The
alternative of object composition gives a detour to AbstractMap, and that only
complicates the code.
So the question is which is preferable: use of a local class ContextMap as in
the patch, or use of Map ?
> Fix compiler warnings for ant clean compile
> ---
>
> Key: LUCENE-7602
> URL: https://issues.apache.org/jira/browse/LUCENE-7602
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Labels: build
> Fix For: trunk
>
> Attachments: LUCENE-7602-ContextMap-lucene.patch,
> LUCENE-7602-ContextMap-solr.patch, LUCENE-7602.patch, LUCENE-7602.patch,
> LUCENE-7602.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779059#comment-15779059
]
Paul Elschot edited comment on LUCENE-7602 at 12/26/16 10:06 PM:
-
bq. can't we just use Map ?
ContextMap implements that interface.
Since this is widely used, I prefer to use a lucene class (ContextMap) over an
interface that is defined in the java language (Map), because it
allows a change in a single place.
We could still separate the implementation from the interface, but that would
be more than fixing the compiler warnings here.
was (Author: paul.elsc...@xs4all.nl):
bq. can't we just use Map ?
ContextMap implements that interface.
Since this is widely used, I prefer use a lucene class class (ContextMap) over
an interface that is defined in the java language (Map), because
it allows a change in a single place.
We could still separate the implementation from the interface, but that would
be more than fixing the compiler warnings here.
> Fix compiler warnings for ant clean compile
> ---
>
> Key: LUCENE-7602
> URL: https://issues.apache.org/jira/browse/LUCENE-7602
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Labels: build
> Fix For: trunk
>
> Attachments: LUCENE-7602-ContextMap-lucene.patch,
> LUCENE-7602-ContextMap-solr.patch, LUCENE-7602.patch, LUCENE-7602.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779059#comment-15779059
]
Paul Elschot commented on LUCENE-7602:
--
bq. can't we just use Map ?
ContextMap implements that interface.
Since this is widely used, I prefer not use a lucene class class (ContextMap)
over an interface that is defined in the java language (Map),
because it allows a change in a single place.
We could still separate the implementation from the interface, but that would
be more than fixing the compiler warnings here.
> Fix compiler warnings for ant clean compile
> ---
>
> Key: LUCENE-7602
> URL: https://issues.apache.org/jira/browse/LUCENE-7602
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Labels: build
> Fix For: trunk
>
> Attachments: LUCENE-7602-ContextMap-lucene.patch,
> LUCENE-7602-ContextMap-solr.patch, LUCENE-7602.patch, LUCENE-7602.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779059#comment-15779059
]
Paul Elschot edited comment on LUCENE-7602 at 12/26/16 10:02 PM:
-
bq. can't we just use Map ?
ContextMap implements that interface.
Since this is widely used, I prefer use a lucene class class (ContextMap) over
an interface that is defined in the java language (Map), because
it allows a change in a single place.
We could still separate the implementation from the interface, but that would
be more than fixing the compiler warnings here.
was (Author: paul.elsc...@xs4all.nl):
bq. can't we just use Map ?
ContextMap implements that interface.
Since this is widely used, I prefer not use a lucene class class (ContextMap)
over an interface that is defined in the java language (Map),
because it allows a change in a single place.
We could still separate the implementation from the interface, but that would
be more than fixing the compiler warnings here.
> Fix compiler warnings for ant clean compile
> ---
>
> Key: LUCENE-7602
> URL: https://issues.apache.org/jira/browse/LUCENE-7602
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Labels: build
> Fix For: trunk
>
> Attachments: LUCENE-7602-ContextMap-lucene.patch,
> LUCENE-7602-ContextMap-solr.patch, LUCENE-7602.patch, LUCENE-7602.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7602:
-
Attachment: LUCENE-7602.patch
Patch of 26 Dec 2016.
Mostly as discussed above.
ContextMap extends HashMap. I tried implementing AbstractMap, but that ends up
in a detour to a HashMap anyway, so I left it at direct extension.
Is there a way to quickly check for unused imports at top level?
I used ant precommit for that, but it is quite slow because it stops after the
first module with an error, and quite a few modules are involved here.
> Fix compiler warnings for ant clean compile
> ---
>
> Key: LUCENE-7602
> URL: https://issues.apache.org/jira/browse/LUCENE-7602
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Labels: build
> Fix For: trunk
>
> Attachments: LUCENE-7602-ContextMap-lucene.patch,
> LUCENE-7602-ContextMap-solr.patch, LUCENE-7602.patch, LUCENE-7602.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778692#comment-15778692
]
Paul Elschot commented on LUCENE-7602:
--
Meanwhile I tried implementing solr's QueryContext by extending ContextMap, and
no more wrapping of fcontext.qcontext in a SolrContextMap, see
FuncSlotAcc.setNextReader above.
The solr tests passed, so I think there is no more need for IdentityHashMap, in
both lucene and solr.
Shall I post a complete patch against master, or just the changes changes since
yesterday?
> Fix compiler warnings for ant clean compile
> ---
>
> Key: LUCENE-7602
> URL: https://issues.apache.org/jira/browse/LUCENE-7602
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Labels: build
> Fix For: trunk
>
> Attachments: LUCENE-7602-ContextMap-lucene.patch,
> LUCENE-7602-ContextMap-solr.patch, LUCENE-7602.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15778681#comment-15778681
]
Paul Elschot commented on LUCENE-7602:
--
Do you mean like this:
{code}
public class ContextMap extends AbstractMap
{ ... }
{code}
> Fix compiler warnings for ant clean compile
> ---
>
> Key: LUCENE-7602
> URL: https://issues.apache.org/jira/browse/LUCENE-7602
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Labels: build
> Fix For: trunk
>
> Attachments: LUCENE-7602-ContextMap-lucene.patch,
> LUCENE-7602-ContextMap-solr.patch, LUCENE-7602.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15776698#comment-15776698
]
Paul Elschot edited comment on LUCENE-7602 at 12/25/16 4:26 PM:
With both patches applied, ant clean compile-test gives a single warning.
Perhaps ContextMap should be separate issue.
For lucene I think it is worthwhile to use a lucene class without generics as
the context map instead of the Map variations that are present in the
current code.
For solr I have no idea. I would hope that using IdentityHashMap is no more
needed because queries are immutable.
was (Author: paul.elsc...@xs4all.nl):
With both patches applied, ant clean compile-test gives a single warning.
Perhaps ContextMap should be separate issue.
For lucene I think it is worthwhile to use a lucene class without generics as
the context map instead of the Map variations that are present in the
current code.
For solr I have no idea. I would hope that by using IdentityHashMap is no more
needed because queries are immutable.
> Fix compiler warnings for ant clean compile
> ---
>
> Key: LUCENE-7602
> URL: https://issues.apache.org/jira/browse/LUCENE-7602
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Labels: build
> Fix For: trunk
>
> Attachments: LUCENE-7602-ContextMap-lucene.patch,
> LUCENE-7602-ContextMap-solr.patch, LUCENE-7602.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15776698#comment-15776698
]
Paul Elschot commented on LUCENE-7602:
--
With both patches applied, ant clean compile-test gives a single warning.
Perhaps ContextMap should be separate issue.
For lucene I think it is worthwhile to use a lucene class without generics as
the context map instead of the Map variations that are present in the
current code.
For solr I have no idea. I would hope that by using IdentityHashMap is no more
needed because queries are immutable.
> Fix compiler warnings for ant clean compile
> ---
>
> Key: LUCENE-7602
> URL: https://issues.apache.org/jira/browse/LUCENE-7602
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Labels: build
> Fix For: trunk
>
> Attachments: LUCENE-7602-ContextMap-lucene.patch,
> LUCENE-7602-ContextMap-solr.patch, LUCENE-7602.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15776689#comment-15776689
]
Paul Elschot commented on LUCENE-7602:
--
The lucene patch replaces the use of IdentityHashMap in ValueSource.java by a
ContextMap, because I think it is wrong to use an IdentityHashMap there.
Strings are used as keys, and Strings are not unique objects in java.
However, this conflicts somewhat with the way ContextMap is used in solr. There
QueryContext needs an IdentityHashMap, but the solr tests pass even with the
patch applied.
> Fix compiler warnings for ant clean compile
> ---
>
> Key: LUCENE-7602
> URL: https://issues.apache.org/jira/browse/LUCENE-7602
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Labels: build
> Fix For: trunk
>
> Attachments: LUCENE-7602-ContextMap-lucene.patch,
> LUCENE-7602-ContextMap-solr.patch, LUCENE-7602.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15776683#comment-15776683
]
Paul Elschot commented on LUCENE-7602:
--
The interesting pieces of code for ContextMap:
The class itself:
{code}
public class ContextMap extends HashMap {
public ContextMap() { // empty
}
public ContextMap(ContextMap source) { // new copy
super(source);
}
public ContextMap(IdentityHashMap source) { // for solr
super(source);
}
}
{code}
the way it is used in solr, in SlotAcc.java:
{code}
class SolrContextMap extends ContextMap {
SolrContextMap(org.apache.solr.search.QueryContext context) {
super((java.util.IdentityHashMap)context); // CHECKME: copy ok?
}
}
// TODO: we should really have a decoupled value provider...
// This would enhance reuse and also prevent multiple lookups of same value
across diff stats
abstract class FuncSlotAcc extends SlotAcc {
protected final ValueSource valueSource;
protected FunctionValues values;
public FuncSlotAcc(ValueSource values, FacetContext fcontext, int numSlots) {
super(fcontext);
this.valueSource = values;
}
@Override
public void setNextReader(LeafReaderContext readerContext) throws IOException
{
values = valueSource.getValues(new SolrContextMap(fcontext.qcontext),
readerContext);
}
}
{code}
and some unchanged code from solr's QueryContext.java:
{code}
/*
* Bridge between old style context and a real class.
* This is currently slightly more heavy weight than necessary because of the
need to inherit from IdentityHashMap rather than
* instantiate it on demand (and the need to put "searcher" in the map)
* @lucene.experimental
*/
public class QueryContext extends IdentityHashMap implements Closeable {
...
}
{code}
> Fix compiler warnings for ant clean compile
> ---
>
> Key: LUCENE-7602
> URL: https://issues.apache.org/jira/browse/LUCENE-7602
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Labels: build
> Fix For: trunk
>
> Attachments: LUCENE-7602-ContextMap-lucene.patch,
> LUCENE-7602-ContextMap-solr.patch, LUCENE-7602.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7602:
-
Attachment: LUCENE-7602-ContextMap-lucene.patch
This ContextMap lucene patch includes the previous patch of 24 December 2016.
This also changes all use of Map for function queries to ContextMap, which
changes the public API, although not much.
All lucene tests pass.
> Fix compiler warnings for ant clean compile
> ---
>
> Key: LUCENE-7602
> URL: https://issues.apache.org/jira/browse/LUCENE-7602
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Labels: build
> Fix For: trunk
>
> Attachments: LUCENE-7602-ContextMap-lucene.patch, LUCENE-7602.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15775468#comment-15775468
]
Paul Elschot commented on LUCENE-7602:
--
The patch also contains a few similar fixes for test code, but it is too much
work to be complete for the test code, so I left it at that.
> Fix compiler warnings for ant clean compile
> ---
>
> Key: LUCENE-7602
> URL: https://issues.apache.org/jira/browse/LUCENE-7602
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Labels: build
> Fix For: trunk
>
> Attachments: LUCENE-7602.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7602:
-
Attachment: LUCENE-7602.patch
Patch of 24 Dec 2016.
This consists of:
- replacing Map by Map in a few submodules.
- 7 @SuppressWarnings for unchecked casts (see also below),
- a split off of Const*DocValues classes into their own source files,
- one removal of close() on an AutoClosable,
- a few minor generics improvements,
I have a question on these cases, there are 7 of them:
{code}
@SuppressWarnings("unchecked")
public void createWeight(Map context, IndexSearcher searcher) throws
IOException {
// FIXME: how to use a helper method here to avoid the unchecked cast? See
https://docs.oracle.com/javase/tutorial/java/generics/capture.html
((Map)context).put("searcher",searcher);
{code}
I tried to implement such a helper function, but I could not get it to compile
cleanly.
Any suggestions for this?
> Fix compiler warnings for ant clean compile
> ---
>
> Key: LUCENE-7602
> URL: https://issues.apache.org/jira/browse/LUCENE-7602
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Paul Elschot
>Priority: Minor
> Labels: build
> Fix For: trunk
>
> Attachments: LUCENE-7602.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
Paul Elschot created LUCENE-7602:
Summary: Fix compiler warnings for ant clean compile
Key: LUCENE-7602
URL: https://issues.apache.org/jira/browse/LUCENE-7602
Project: Lucene - Core
Issue Type: Improvement
Reporter: Paul Elschot
Priority: Minor
Fix For: trunk
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15774874#comment-15774874
]
Paul Elschot commented on LUCENE-7055:
--
bq. Intersecting such queries with a selective query is very inefficient since
these queries build a doc id set of matching documents for the entire index.
Just thinking out loud: how about also using a lazy doc id set builder that
works on the go?
This would use one extra bit per document to indicate whether the document is
already evaluated.
> Better execution path for costly queries
>
>
> Key: LUCENE-7055
> URL: https://issues.apache.org/jira/browse/LUCENE-7055
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
> Attachments: LUCENE-7055.patch
>
>
> In Lucene 5.0, we improved the execution path for queries that run costly
> operations on a per-document basis, like phrase queries or doc values
> queries. But we have another class of costly queries, that return fine
> iterators, but these iterators are very expensive to build. This is typically
> the case for queries that leverage DocIdSetBuilder, like TermsQuery,
> multi-term queries or the new point queries. Intersecting such queries with a
> selective query is very inefficient since these queries build a doc id set of
> matching documents for the entire index.
> Is there something we could do to improve the execution path for these
> queries?
> One idea that comes to mind is that most of these queries could also run on
> doc values, so maybe we could come up with something that would help decide
> how to run a query based on other parts of the query? (Just thinking out
> loud, other ideas are very welcome)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764653#comment-15764653
]
Paul Elschot edited comment on LUCENE-7580 at 12/20/16 4:53 PM:
I have started using this in the tests for the surround query language, I'll
open an issue for that later.
This found a bug in ConjunctionNearSpansDocScorer.recordMatch().
A -1 slop can occur when the same term is used twice in a SpanNearQuery, and
this causes a division by zero in computing the slop factor in from
recordMatch().
This can be easily avoided by using 0 slop in such cases.
was (Author: paul.elsc...@xs4all.nl):
I have started with on using this in the tests for the surround query language,
I'll open an issue for that later.
This found a bug in ConjunctionNearSpansDocScorer.recordMatch().
A -1 slop can occur when the same term is used twice in a SpanNearQuery, and
this causes a division by zero in computing the slop factor in from
recordMatch().
This can be easily avoided by using 0 slop in such cases.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: LUCENE-7580.patch, LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15764653#comment-15764653
]
Paul Elschot commented on LUCENE-7580:
--
I have started with on using this in the tests for the surround query language,
I'll open an issue for that later.
This found a bug in ConjunctionNearSpansDocScorer.recordMatch().
A -1 slop can occur when the same term is used twice in a SpanNearQuery, and
this causes a division by zero in computing the slop factor in from
recordMatch().
This can be easily avoided by using 0 slop in such cases.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: LUCENE-7580.patch, LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15740406#comment-15740406
]
Paul Elschot commented on LUCENE-7580:
--
Some scientific articles on this subject:
Metzler, Donald, and W. Bruce Croft.
"A Markov random field model for term dependencies."
Proceedings of the 28th annual international ACM SIGIR conference
on Research and development in information retrieval. ACM, 2005.
In section 2.3 they use terms and ordered and unordered phrases
The ranking function is a weighted linear combination for these.
The optimal weights are about 80/10/10 for simple terms, unordered, and ordered.
Here this led to the use of a weighting factor non matching occurrences.
They also found that the minimum distance is the best indicator of relevance.
Bendersky, Michael, and W. Bruce Croft.
"Modeling Higher-Order Term Dependencies in Information Retrieval using Query
Hypergraphs"
SIGIR'12.
The concepts there can be nested, like span queries.
The approach there is much more general. For example:
- Table 2 shows the use of the frequency of a concept in various collections
to determine its weight.
- In section 2.4.2 there is an indication that the slop factor needs attention:
"... the existing term proximity measures usually capture close, sentence-level,
co-occurrences of the query terms ... The dependency range is much longer for
concept dependencies."
Blanco, Roi, and Paolo Boldi.
"Extending BM25 with multiple query operators."
Proceedings of the 35th international ACM SIGIR conference
on Research and development in information retrieval. ACM, 2012.
This scores regions with BM25F.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: LUCENE-7580.patch, LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15740402#comment-15740402
]
Paul Elschot commented on LUCENE-7580:
--
This adds a nonMatchSlop attribute to SpanNearQuery,
and drops the nonMatchSlopFactor argument from SpansTreeQuery.
nonMatchSlop is the distance for determining a slop factor that is to be used
for non matching occurrences of a SpanNearQuery.
Smaller values for this distance will increase the score contribution of non
matching occurrences via
SimScorer.computeSlopFactor()
But smaller values for this distance, i.e. higher score contribution of non
matching occurrences,
may lead to a scoring inconsistency between two span near queries that only
differ in the allowed slop.
For example consider query A with a smaller allowed slop and query B with a
larger one.
For query B there can be more matches, and these should increase the score of B
when compared to the score of A.
So for each extra match at B, the non matching score for query A should be
lower than
the matching score for query B.
This may not be the case when the non matching score contribution is too high.
To have consistent scoring between two such queries,
choose a non matching slop that is larger than the largest allowed match slop,
and provide that non matching slop to both queries.
In case this consistency is not needed, nonMatchSlop can be chosen to be
somewhat
larger than the maximum allowed match slop.
This nonMatchSlop is used in SpansTreeWeight to compute a minimal nested slop
factor
from the maximum possible slops that can occur
in a SpanQuery for the nested SpanNearQueries and for nested SpanOrQueries with
distance.
Finally, this minimal nested slop factor is used as the weight for scoring non
matching terms.
The default nonMatchSlop for SpanNearQuery is large, Integer.MAX_VALUE/2.
Therefore by default non matching occurrences have no real score contribution.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: LUCENE-7580.patch, LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15740402#comment-15740402
]
Paul Elschot edited comment on LUCENE-7580 at 12/11/16 9:44 PM:
Compared to the previous patch, this adds a nonMatchSlop attribute to
SpanNearQuery,
and drops the nonMatchSlopFactor argument from SpansTreeQuery.
nonMatchSlop is the distance for determining a slop factor that is to be used
for non matching occurrences of a SpanNearQuery.
Smaller values for this distance will increase the score contribution of non
matching occurrences via
SimScorer.computeSlopFactor()
But smaller values for this distance, i.e. higher score contribution of non
matching occurrences,
may lead to a scoring inconsistency between two span near queries that only
differ in the allowed slop.
For example consider query A with a smaller allowed slop and query B with a
larger one.
For query B there can be more matches, and these should increase the score of B
when compared to the score of A.
So for each extra match at B, the non matching score for query A should be
lower than
the matching score for query B.
This may not be the case when the non matching score contribution is too high.
To have consistent scoring between two such queries,
choose a non matching slop that is larger than the largest allowed match slop,
and provide that non matching slop to both queries.
In case this consistency is not needed, nonMatchSlop can be chosen to be
somewhat
larger than the maximum allowed match slop.
This nonMatchSlop is used in SpansTreeWeight to compute a minimal nested slop
factor
from the maximum possible slops that can occur
in a SpanQuery for the nested SpanNearQueries and for nested SpanOrQueries with
distance.
Finally, this minimal nested slop factor is used as the weight for scoring non
matching terms.
The default nonMatchSlop for SpanNearQuery is large, Integer.MAX_VALUE/2.
Therefore by default non matching occurrences have no real score contribution.
was (Author: paul.elsc...@xs4all.nl):
This adds a nonMatchSlop attribute to SpanNearQuery,
and drops the nonMatchSlopFactor argument from SpansTreeQuery.
nonMatchSlop is the distance for determining a slop factor that is to be used
for non matching occurrences of a SpanNearQuery.
Smaller values for this distance will increase the score contribution of non
matching occurrences via
SimScorer.computeSlopFactor()
But smaller values for this distance, i.e. higher score contribution of non
matching occurrences,
may lead to a scoring inconsistency between two span near queries that only
differ in the allowed slop.
For example consider query A with a smaller allowed slop and query B with a
larger one.
For query B there can be more matches, and these should increase the score of B
when compared to the score of A.
So for each extra match at B, the non matching score for query A should be
lower than
the matching score for query B.
This may not be the case when the non matching score contribution is too high.
To have consistent scoring between two such queries,
choose a non matching slop that is larger than the largest allowed match slop,
and provide that non matching slop to both queries.
In case this consistency is not needed, nonMatchSlop can be chosen to be
somewhat
larger than the maximum allowed match slop.
This nonMatchSlop is used in SpansTreeWeight to compute a minimal nested slop
factor
from the maximum possible slops that can occur
in a SpanQuery for the nested SpanNearQueries and for nested SpanOrQueries with
distance.
Finally, this minimal nested slop factor is used as the weight for scoring non
matching terms.
The default nonMatchSlop for SpanNearQuery is large, Integer.MAX_VALUE/2.
Therefore by default non matching occurrences have no real score contribution.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: LUCENE-7580.patch, LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7580:
-
Attachment: LUCENE-7580.patch
Patch of 11 Dec 2016.
Add automatically determining a weight for non matching terms.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: LUCENE-7580.patch, LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720077#comment-15720077
]
Paul Elschot commented on LUCENE-7580:
--
What SpansTreeQuery does not do, and some rough edges:
The SpansDocScorer objects do the match recording and scoring, and there is one
for each Spans.
These SpansDocScorer objects might be merged into their Spans to reduce the
number of objects.
Related: how to deal with the same term occurring in more than one subquery?
See also LUCENE-7398.
Normally the term frequency score has a diminishing contribution for extra
occurrences.
In the patch the slop factors for a term are applied in decreasing order on
these diminished contributions.
This requires sorting of the slop factors.
Sorting the slop factors could be avoided when an actual score of a single term
occurrence was available.
In that case the given slop factor could be used as a weight on that score.
It might be possible to estimate an actual score for a single term occurrence
from the distances to other occurrences of the same term.
Similarly, the decreasing term frequency contributions can be seen as a
proximity weighting for the same term (or subquery):
the closer a term occurs to itself, the smaller its contribution.
This might be refined by using the actual distances to other the term
occurrences (or subquery occurrences)
to provide a weight for each term occurrence. This is unusual because the
weight decreases for smaller distances.
The slop factor from the Similarity may need to be adapted because of the way
it is combined here
with diminishing term contributions.
Another use of a score of each term occurrence could be to use the absolute
term position
to influence the score, possibly in combination with the field length.
There is an assert in TermSpansDocScorer.docScore() that verifies that
the smallest occurring slop factor is at least as large as the non matching
slop factor.
This condition is necessary for consistency.
Instead of using this assert, this condition might be enforced by somehow
automatically determining the non matching slop factor.
This is a prototype. No profiling has been done, it will take more CPU, but I
have no idea how much.
The sorting of the slop factors per matching term occurrence has roughly the
same
time complexity as the position priority queues used for SpanOr and SpanNear.
Garbage collection might be affected by the reference cycles between the
SpansDocScorers
and their Spans.
Since this allows weighting of subqueries, it might be possible to implement
synonym scoring
in SpanOrQuery by providing good subweights, and wrapping the whole thing in
SpansTreeQuery.
The only thing that might still be needed then is a SpansDocScorer that applies
the SimScorer.score()
over the total term frequency of the synonyms in a document.
SpansTreeScorer multiplies the slop factor for nested near queries at each
level.
Alternatively a minimum distance could be passed down.
This would need to change recordMatch(float slopFactor) to recordMatch(int
minDistance).
Would minDistance make sense, or is there a better distance?
What is a good way to test whether the score values from SpansTreeQuery
actually improve on
the score values from the current SpanScorer?
There are no tests for SpanFirstQuery/SpanContainingQuery/SpanWithinQuery.
These tests are not there because these queries provide FilterSpans and that is
already supported for SpanNotQuery.
The explain() method is not implemented for SpansTreeQuery.
This should be doable with an explain() method added to SpansTreeScorer to
provide the explanations.
There is no support for PayloadSpanQuery.
PayloadSpanQuery is not in here because it is not in the core module.
I think it can fit here in because PayloadSpanQuery also scores per matching
term occurrence.
Then Spans.doStartCurrentDoc() and Spans.doCurrentSpans() could be removed.
In case this is acceptable as a good way to score Spans:
Spans.width() and Scorer.freq() and SpansDocScorer.docMatchFreq() might be
removed.
Would it make sense to implement child Scorers in the tree of SpansDocScorer
objects?
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail:
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720073#comment-15720073
]
Paul Elschot commented on LUCENE-7580:
--
Some related issues, thanks for these discussions:
LUCENE-533
LUCENE-2878
LUCENE-2879
LUCENE-2880
LUCENE-6371
LUCENE-6466
LUCENE-7398
Some related web pages:
http://www.gossamer-threads.com/lists/lucene/java-user/33902 March 2006.
http://www.gossamer-threads.com/lists/lucene/java-user/53027 September 2007,
suggests to:
"recurse the spans tree to compose a score based on the type of subqueries
(near, and, or, not) and what matched."
http://www.gossamer-threads.com/lists/lucene/java-user/60103 April 2008.
http://www.flax.co.uk/blog/2016/04/26/can-make-contribution-apache-solr-core-development/
see point 4.
How to use BM25:
http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720070#comment-15720070
]
Paul Elschot commented on LUCENE-7580:
--
SpansTreeQuery is implemented as a wrapper in order to change the existing code
as little as possible.
But it was necessary to take DisjunctionSpans out of SpanOrQuery.
In DisjunctionSpans there are only additions for inspection at a match,
otherwise it is the same as in the current SpanOrQuery.
Changes to the current code are mostly additions to allow inspection of matches:
- For the ordered/unordered nearspans a common superclass ConjunctionNearSpans
is added that provides the SimScorer and a currentSlop() method.
- DisjunctionSpans allows inspection of all subspans, of the subspans at the
current doc, and of the subspans with the first and second positions.
SpanPositionQueue also has some additions for this.
- In the TermSpans constructor the currently unused SimScorer argument is saved
so it can be used to score() the various term frequencies.
- In Spans a reference to a SpansDocScorer object is added to allow direct
access by disjunctions.
The only existing state that is changed is the use of needsScores (instead of
the current false)
for weights of subqueries of SpanOrQuery and SpanNearQuery and for the weight
of the included subquery of SpanNotQuery.
All core tests pass with the patch applied on the master branch. Ant precommit
also passes.
There is a correction to the javadocs of Similarity.Simscorer on the use of
float for term frequencies.
The patch also adds a constructor for SpanOrQuery with an extra parameter
maxDistance.
When wrapped in a SpansTreeQuery, this SpanOrQuery will provide a slop factor
at each match
that is determined by the minimum distance between any two subspans where
possible,
and this distance is maximized to the given maxDistance.
The class DisjunctionNearSpans and its SpansDocScorer implement this.
All score calculations are done with doubles.
Most of the additions have public/protected visibility in order to allow easy
extension.
In case there is interest in back porting this, a patch for branch_6x can be
made available.
The tests on branch_6x disable the coordination in BooleanQuery and they only
use the BM25 similarity.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720066#comment-15720066
]
Paul Elschot commented on LUCENE-7580:
--
In the patch, SpansTreeQuery is a wrapper for SpanQuery that uses basically the
same scoring as the scoring for other queries.
When all term occurrences match at top level or at 0 distance the score is the
same as
the score for a boolean OR over the terms, independently of the Similarity that
is used.
SpansTreeScorer scores each query term matching occurrence, and it applies
discounts for non matching terms
and for distance matches. It also uses weights of subqueries.
The matching occurrences are recorded per document in the spans tree at each
top level match of a document.
For each match SpansTreeScorer descends the tree down to the leaf level of the
terms of each match.
SpansDocScorer objects are used as the tree nodes, there is one for each
supported Spans.
Each matching term occurrence is recorded with a slop factor.
At the top level this slop factor is normally 1, and for each span near nesting
level
the slop factor at the match is multiplied into this.
The term frequency scoring from the Similarity is used per matching term
occurrence,
and these term occurrence scores are weighted by the slop factors sorted in
decreasing order.
The purpose of using the given slop factors in decreasing order is to provide
scoring consistency
between span near queries that only differ in the maximum allowed slop.
This consistency requires that an extra match with a lower slop increases the
score of the document.
I would expect scoring to be consistent this way, but I'm not 100% sure.
The non matching term occurrences get a score that is the difference of
the normal document term frequency score and the term frequency score for the
matching terms.
This non matching score is weighted by the slop factor of a non matching
distance.
The non matching distance is a parameter that must be provided.
This non matching distance can for example be chosen as a little larger
than the largest distance used in the span near queries that are wrapped.
SpansTreeQuery is implemented for any combination of
SpanNearQuery, SpanOrQuery, SpanTermQuery, SpanBoostQuery,
SpanNotQuery, SpanFirstQuery, SpanContainingQuery and SpanWithinQuery.
See the javadocs and the test code on how to use SpansTreeQuery.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7580:
-
Attachment: LUCENE-7580.patch
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
> Attachments: LUCENE-7580.patch
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15720059#comment-15720059
]
Paul Elschot commented on LUCENE-7580:
--
"Recurse the spans tree to compose a score based on the type of subqueries ...
and what matched"
was suggested in September 2007 on the java-user list
http://www.gossamer-threads.com/lists/lucene/java-user/53027 .
Currently SpanScorer provides score values that have no real meaning when more
than one SpanTermQuery is used.
Patch follows.
> Spans tree scoring
> --
>
> Key: LUCENE-7580
> URL: https://issues.apache.org/jira/browse/LUCENE-7580
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Affects Versions: master (7.0)
>Reporter: Paul Elschot
>Priority: Minor
> Fix For: 6.x
>
>
> Recurse the spans tree to compose a score based on the type of subqueries and
> what matched
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
Paul Elschot created LUCENE-7580:
Summary: Spans tree scoring
Key: LUCENE-7580
URL: https://issues.apache.org/jira/browse/LUCENE-7580
Project: Lucene - Core
Issue Type: Improvement
Components: core/search
Affects Versions: master (7.0)
Reporter: Paul Elschot
Priority: Minor
Fix For: 6.x
Recurse the spans tree to compose a score based on the type of subqueries and
what matched
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15674083#comment-15674083
]
Paul Elschot commented on LUCENE-7398:
--
[~mikemccand], the patch is as you stated, and having MatchNear as an enum to
choose the matching method is easy to extend.
I would not mind to have some more opinions on whether the progress is enough
to actually add the code.
I know this MG4J paper and it could well be that theorem 11 in there proves
that no lazy algorithm is possible for the general case with more than 2
subqueries, but for now I cannot really follow their terminology. In
particular I'd like to know whether or not these efficient algorithms
correspond to the current lazy implementations in Lucene. I'm hoping that they
do not, because then there might be some room for improvement in Lucene without
losing speed.
As [~gol...@detego-software.de] stated above:
bq. I want a higher score if the user-query matches for more than one variant
I don't think the ORDERED_LOOKAHEAD of the patch does that, because it only
matches one variant.
I hope that there is a non backtracking implementation that can do this, but
I'm not sure.
> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398-20160924.patch,
> LUCENE-7398-20160925.patch, LUCENE-7398.patch, LUCENE-7398.patch,
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping],
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as
> "coordinate gene research". It does not match "coordinate gene mapping
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It
> probably stopped working with the changes on SpanQueries in 5.3. I will
> attach a unit test that shows the problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15671305#comment-15671305
]
Paul Elschot commented on LUCENE-7398:
--
bq. is this latest patch ready to be committed, or are there still known
problems?
Both actually, assuming that master has not had a conflicting update since.
To completely solve this backtracking is needed, and the patch does not provide
that.
To allow collecting/payloads easily, I'd rather accept the limitations/bugs of
the current lazy implementation.
As a minimum a reference to this issue could be added to the javadocs of the
(un)ordered near spans.
AFAIK:
- a complete solution that can be made with lazy iteration is a span near query
that has two subqueries
and that only checks the span starting positions,
- for subqueries that are terms or that do not vary in length, completeness for
two subqueries is already there.
In case there is interest in span near queries that only use starting
positions, well, that should be easy.
> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398-20160924.patch,
> LUCENE-7398-20160925.patch, LUCENE-7398.patch, LUCENE-7398.patch,
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping],
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as
> "coordinate gene research". It does not match "coordinate gene mapping
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It
> probably stopped working with the changes on SpanQueries in 5.3. I will
> attach a unit test that shows the problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7398:
-
Attachment: LUCENE-7398.patch
Patch of 4 Oct 2016.
This is the patch of 25 Sep 2016, but without the UNORDERED_STARTPOS case.
In a nutshell this:
- adds ORDERED_LOOKAHEAD,
- is backward compatible,
- tries to document the limitations of the matching methods for SpanNearQuery.
> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398-20160924.patch,
> LUCENE-7398-20160925.patch, LUCENE-7398.patch, LUCENE-7398.patch,
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping],
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as
> "coordinate gene research". It does not match "coordinate gene mapping
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It
> probably stopped working with the changes on SpanQueries in 5.3. I will
> attach a unit test that shows the problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7471:
-
Attachment: LUCENE-7471.patch
Patch of 29 Sep 2016.
Builds on the two phase approach.
This has 89 lines less code than master.
All tests pass.
It uses the same approach as UNORDERED_STARTPOS at LUCENE-7398.
> Simplify NearSpansOrdered
> -
>
> Key: LUCENE-7471
> URL: https://issues.apache.org/jira/browse/LUCENE-7471
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
>Reporter: Paul Elschot
>Priority: Minor
> Attachments: LUCENE-7471.patch
>
>
> Extend the span positions priority queue, remove SpansCell.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
Paul Elschot created LUCENE-7471:
Summary: Simplify NearSpansOrdered
Key: LUCENE-7471
URL: https://issues.apache.org/jira/browse/LUCENE-7471
Project: Lucene - Core
Issue Type: Improvement
Components: core/search
Reporter: Paul Elschot
Priority: Minor
Extend the span positions priority queue, remove SpansCell.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519320#comment-15519320
]
Paul Elschot edited comment on LUCENE-7398 at 9/25/16 10:00 PM:
Patch of 24 Sep 2016, work in progress. Edit: superseded on 25 Sep, this can be
ignored.
This introduces SpanNearQuery.MatchNear to choose the matching method.
The ORDERED_LAZY case is still the patch of 14 August, this should be changed
back to the current implementation, and be used to implement ORDERED_LOOKAHEAD.
This implements MatchNear.UNORDERED_STARTPOS and uses that as the default
implementation for the unordered case.
The implementation of UNORDERED_STARTPOS is in NearSpansUnorderedStartPos,
which is simpler than the current NearSpansUnordered, there is no SpansCell.
I'd expect this StartPos implementation to be a little faster, so I also
implemented it as default for the unordered case. In only one test case the
UNORDERED_LAZY method is needed to pass the test.
The question is whether it is ok to change the default unordered implementation
to only use the span start positions.
The collect() method is moved to the superclass ConjunctionSpans, this
simplification might be done at another issue.
was (Author: paul.elsc...@xs4all.nl):
Patch of 24 Sep 2016, work in progress.
This introduces SpanNearQuery.MatchNear to choose the matching method.
The ORDERED_LAZY case is still the patch of 14 August, this should be changed
back to the current implementation, and be used to implement ORDERED_LOOKAHEAD.
This implements MatchNear.UNORDERED_STARTPOS and uses that as the default
implementation for the unordered case.
The implementation of UNORDERED_STARTPOS is in NearSpansUnorderedStartPos,
which is simpler than the current NearSpansUnordered, there is no SpansCell.
I'd expect this StartPos implementation to be a little faster, so I also
implemented it as default for the unordered case. In only one test case the
UNORDERED_LAZY method is needed to pass the test.
The question is whether it is ok to change the default unordered implementation
to only use the span start positions.
The collect() method is moved to the superclass ConjunctionSpans, this
simplification might be done at another issue.
> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398-20160924.patch,
> LUCENE-7398-20160925.patch, LUCENE-7398.patch, LUCENE-7398.patch,
> TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping],
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as
> "coordinate gene research". It does not match "coordinate gene mapping
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It
> probably stopped working with the changes on SpanQueries in 5.3. I will
> attach a unit test that shows the problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7398:
-
Attachment: LUCENE-7398-20160925.patch
Patch of 25 Sep 2016.
Compared to the previous patch, this removes the ORDERED_STARTPOS case, because
I don't know whether that is needed.
Also this restores backward compatibility.
Compared to master, this has:
Four MatchNear methods, two are the current ones, they are called ORDERED_LAZY
and UNORDERED_LAZY, and these are used when the current builder and
constructors use a boolean ordered argument.
The third case is ORDERED_LOOKAHEAD, which is from the patch of 18 August.
The last case is UNORDERED_STARTPOS, which is simpler than UNORDERED_LAZY,
hopefully a little faster, and with better completeness of the result.
Javadocs for all four cases have been added.
All test cases from here have been added, and where necessary they have been
modified to use ORDERED_LOOKAHEAD and to not do span collection. These tests
pass.
For the last case, UNORDERED_STARTPOS, no test cases have been added yet. This
is still to be done. Does anyone have more difficult cases?
Minor point: the collect() method was moved to the superclass ConjunctionSpans.
Feedback welcome, especially on the javadocs of SpanNearQuery.MatchNear.
Instead of adding backtracking methods, it might be better to do counting of
input spans in a matching window. I'm hoping that the UNORDERED_STARTPOS case
can be extended for that. Any ideas there?
> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398-20160924.patch,
> LUCENE-7398-20160925.patch, LUCENE-7398.patch, LUCENE-7398.patch,
> TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping],
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as
> "coordinate gene research". It does not match "coordinate gene mapping
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It
> probably stopped working with the changes on SpanQueries in 5.3. I will
> attach a unit test that shows the problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Elschot updated LUCENE-7398:
-
Attachment: LUCENE-7398-20160924.patch
Patch of 24 Sep 2016, work in progress.
This introduces SpanNearQuery.MatchNear to choose the matching method.
The ORDERED_LAZY case is still the patch of 14 August, this should be changed
back to the current implementation, and be used to implement ORDERED_LOOKAHEAD.
This implements MatchNear.UNORDERED_STARTPOS and uses that as the default
implementation for the unordered case.
The implementation of UNORDERED_STARTPOS is in NearSpansUnorderedStartPos,
which is simpler than the current NearSpansUnordered, there is no SpansCell.
I'd expect this StartPos implementation to be a little faster, so I also
implemented it as default for the unordered case. In only one test case the
UNORDERED_LAZY method is needed to pass the test.
The question is whether it is ok to change the default unordered implementation
to only use the span start positions.
The collect() method is moved to the superclass ConjunctionSpans, this
simplification might be done at another issue.
> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398-20160924.patch,
> LUCENE-7398.patch, LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping],
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as
> "coordinate gene research". It does not match "coordinate gene mapping
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It
> probably stopped working with the changes on SpanQueries in 5.3. I will
> attach a unit test that shows the problem.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511145#comment-15511145
]
Paul Elschot edited comment on LUCENE-7453 at 9/21/16 8:56 PM:
---
bq. But the seg examples you have still have docid, just with seg prepended. It
still has the problem that it uses "id", when id means identifier,
This is meant as an identifier for a document within a segment; in a segment
this identifier is permanent. There may be another identifier in a document
field, but that is irrelevant here.
For compound readers there are multiple segments, and also in that case adding
seg to the name is correct.
was (Author: paul.elsc...@xs4all.nl):
bq. But the seg examples you have still have docid, just with seg prepended. It
still has the problem that it uses "id", when id means identifier,
This is meant as an identifier for a document within a segment; in a segment
this identifier is permanent, and the only one.
For compound readers there are multiple segments, and also in that case adding
seg to the name is correct.
> Change naming of variables/apis from docid to docnum
>
>
> Key: LUCENE-7453
> URL: https://issues.apache.org/jira/browse/LUCENE-7453
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Ryan Ernst
>
> In SOLR-9528 a suggestion was made to change {{docid}} to {{docnum}}. The
> reasoning for this is most notably that {{docid}} has a connotation about a
> persistent unique identifier (eg like {{_id}} in elasticsearch or {{id}} in
> solr), while {{docid}} in lucene is currently some local to a segment, and
> not comparable directly across segments.
> When I first started working on Lucene, I had this same confusion. {{docnum}}
> is a much better name for this transient, segment local identifier for a doc.
> Regardless of what solr wants to do in their api (eg keeping _docid_), I
> think we should switch the lucene apis and variable names to use docnum.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511145#comment-15511145
]
Paul Elschot commented on LUCENE-7453:
--
bq. But the seg examples you have still have docid, just with seg prepended. It
still has the problem that it uses "id", when id means identifier,
This is meant as an identifier for a document within a segment; in a segment
this identifier is permanent, and the only one.
For compound readers there are multiple segments, and also in that case adding
seg to the name is correct.
> Change naming of variables/apis from docid to docnum
>
>
> Key: LUCENE-7453
> URL: https://issues.apache.org/jira/browse/LUCENE-7453
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Ryan Ernst
>
> In SOLR-9528 a suggestion was made to change {{docid}} to {{docnum}}. The
> reasoning for this is most notably that {{docid}} has a connotation about a
> persistent unique identifier (eg like {{_id}} in elasticsearch or {{id}} in
> solr), while {{docid}} in lucene is currently some local to a segment, and
> not comparable directly across segments.
> When I first started working on Lucene, I had this same confusion. {{docnum}}
> is a much better name for this transient, segment local identifier for a doc.
> Regardless of what solr wants to do in their api (eg keeping _docid_), I
> think we should switch the lucene apis and variable names to use docnum.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org
[
https://issues.apache.org/jira/browse/LUCENE-7453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510965#comment-15510965
]
Paul Elschot commented on LUCENE-7453:
--
I tried an alternative that adds an variation of segment wherever docID is used
in some form.
Here is an overview of renaming possibilities for core/src/java, in three
column python strings.
The first column contains the current name, the second column a segment
variant, the third column an index variant.
Please assume an appropriate amount of question marks (??) in the second and
third columns.
{code}
classFileRenames = """
DocIdSet SegDocIdSet DocIndexSet
DocIdSetIterator SegDocIdSetIterator
DocIndexIterator
ConjunctionDISI ConjunctionSegDisi
ConjunctionDixi
DisjunctionDISIApproximation DisjunctionSegDisiApproximation
DisjunctionDixiApproximation
DisiPriorityQueueSegDisiPriorityQueue
DixiPriorityQueue
DisiWrapper SegDisiWrapper DixiWrapper
FilteredDocIdSetIterator FilteredSegDisi FilteredDixi
DocIdSetBuilder SegDocIdSetBuilder
DocIndexSetBuilder
RoaringDocIdSet RoaringSegDocIdSet
RoaringDocIndexSet
IntArrayDocIdSet IntArraySegDocIdSet
IntArrayDocIndexSet
NotDocIdSet NotSegDocIdSet NotDocIndexSet
BitDocIdSet BitSegDocIdSet BitDocIndexSet
DocIdsWriter SegDocIdsWriter
DocIndexesWriter
DocIdMerger SegDocIdMerger DocIndexMerger
"""
identifierRenames = classFileRenames + """
TwoPhaseIteratorAsDocIdSetIterator TwoPhaseIteratorAsSegDocIdSetIterator
TwoPhaseIteratorAsDocIndexIterator
BitSetConjunctionDISI BitSetConjunctionDisi
BitSetConjunctionDisi
IntArrayDocIdSetIterator IntArraySegDocIdSetIterator
IntArrayDocIndexIterator
asDocIdSetIterator asSegDocIdSetIterator
asDocIndexIterator
getDocId getSegDocId
getDocIndex
docID sdocID
docIndex
docID sdocIDdocIdx
docId sdocIddocIdx
docIDs sdocIDs docIdxs
docIds sdocIds docIdxs
disi sdisi dixi
docIdSet sDocIdSet
docIndexSet
"""
{code}
(The identifiers here are for local classes, methods and variables.)
I don't like overloading index for this, especially in the class names, so for
now I'd prefer the segment variants in second column.
Anyway, we could use the opportunity to shorten some of the longer names.
> Change naming of variables/apis from docid to docnum
>
>
> Key: LUCENE-7453
> URL: https://issues.apache.org/jira/browse/LUCENE-7453
> Project: Lucene - Core
> Issue Type: Improvement
>Reporter: Ryan Ernst
>
> In SOLR-9528 a suggestion was made to change {{docid}} to {{docnum}}. The
> reasoning for this is most notably that {{docid}} has a connotation about a
> persistent unique identifier (eg like {{_id}} in elasticsearch or {{id}} in
> solr), while {{docid}} in lucene is currently some local to a segment, and
> not comparable directly across segments.
> When I first started working on Lucene, I had this same confusion. {{docnum}}
> is a much better name for this transient, segment local identifier for a doc.
> Regardless of what solr wants to do in their api (eg keeping _docid_), I
> think we should switch the lucene apis and variable names to use docnum.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org