[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-12-07 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16283172#comment-16283172
 ] 

ASF subversion and git services commented on LUCENE-4100:
-

Commit 0e1d6682d6ca66590e279ee0c4ccce745f2accd6 in lucene-solr's branch 
refs/heads/master from [~jpountz]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0e1d668 ]

LUCENE-4100: Fix more queries to implement the new updated createWeight API.


> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: master (8.0)
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, LUCENE-4100.patch, 
> LUCENE-4100.patch, contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-12-07 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282738#comment-16282738
 ] 

Steve Rowe commented on LUCENE-4100:


More Solr tests than usual are failing after the {{4fc5a872d}} commit on this 
issue, e.g. {{TestHashQParserPlugin.testHashPartition}}, which fails for me 
with any seed.  From 
[https://builds.apache.org/job/Lucene-Solr-Tests-master/2210/]:

{noformat}
Checking out Revision 5448274f26191a9882aa5c3020e3cbdcbf93551c 
(refs/remotes/origin/master)
[...]
   [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestHashQParserPlugin -Dtests.method=testHashPartition 
-Dtests.seed=B372ECC8951DB18F -Dtests.multiplier=2 -Dtests.slow=true 
-Dtests.locale=cs -Dtests.timezone=Asia/Jayapura -Dtests.asserts=true 
-Dtests.file.encoding=ISO-8859-1
   [junit4] ERROR   3.01s J0 | TestHashQParserPlugin.testHashPartition <<<
   [junit4]> Throwable #1: java.lang.UnsupportedOperationException: Query  
does not implement createWeight
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([B372ECC8951DB18F:62CFCD2DAE7708E4]:0)
   [junit4]>at 
org.apache.lucene.search.Query.createWeight(Query.java:66)
   [junit4]>at 
org.apache.lucene.search.IndexSearcher.createWeight(IndexSearcher.java:734)
   [junit4]>at 
org.apache.lucene.search.IndexSearcher.createNormalizedWeight(IndexSearcher.java:724)
   [junit4]>at 
org.apache.solr.search.SolrIndexSearcher.getProcessedFilter(SolrIndexSearcher.java:1062)
   [junit4]>at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1540)
   [junit4]>at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1416)
   [junit4]>at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:583)
   [junit4]>at 
org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1435)
   [junit4]>at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:375)
   [junit4]>at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:295)
   [junit4]>at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)
   [junit4]>at 
org.apache.solr.core.SolrCore.execute(SolrCore.java:2503)
   [junit4]>at 
org.apache.solr.util.TestHarness.query(TestHarness.java:337)
   [junit4]>at 
org.apache.solr.util.TestHarness.query(TestHarness.java:319)
   [junit4]>at 
org.apache.solr.search.TestHashQParserPlugin.testHashPartition(TestHashQParserPlugin.java:89)
   [junit4]>at java.lang.Thread.run(Thread.java:748)
[...]
   [junit4]   2> NOTE: test params are: codec=Lucene70, 
sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@68e611ad),
 locale=cs, timezone=Asia/Jayapura
   [junit4]   2> NOTE: Linux 3.13.0-88-generic amd64/Oracle Corporation 
1.8.0_144 (64-bit)/cpus=4,threads=1,free=284221600,total=403177472
{noformat} 

My browser says that {{does not implement createWeight}} occurs 48 times in the 
[https://builds.apache.org/job/Lucene-Solr-Tests-master/2210/consoleText], so 
this is a problem for several tests.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: master (8.0)
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, LUCENE-4100.patch, 
> LUCENE-4100.patch, contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216925#comment-16216925
 ] 

Robert Muir commented on LUCENE-4100:
-

I changed the unreasonable fix version to set proper expectations, this isn't 
something we can introduce in a minor release, the underlying changes (e.g. to 
similarity) are heavy and we need to simplify stuff / do it correctly to 
prevent a mess.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: master (8.0)
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216916#comment-16216916
 ] 

Robert Muir commented on LUCENE-4100:
-

Well that's why i wrote the "note: " :)

Technically the better code would be:
{code}
if (indexOptions = DOCS_ONLY) {
  return 1; // tf values are omitted
}
...
{code}

And yes when frequencies are present, according to the docs its allowed to 
return -1 here, but that was solely motivated by PreFlexCodec: (Lucene 3.x 
indexes). What actually returns -1 here for this case? 

We will never have maxScore working as long as we have such complexity: lets 
remove the problem. I faced this same issue in LUCENE-7997 and i think its 
easier to just enforce that docCount and sumDocFreq are always present, and 
that sumTotalTermFreq, and totalTermFreq are always present when term freqs are 
stored.


> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-24 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216902#comment-16216902
 ] 

Adrien Grand commented on LUCENE-4100:
--

bq. // omitTFAP field, tf values are implicitly 1. 

I thought codecs could also deliberately not store ttf even if term frequencies 
are enabled? At least this is my understanding from the following sentence of 
{{TermsEnum.totalTermFreq}}'s javadocs: {{This will be -1 if the codec doesn't 
support this measure.}}.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216744#comment-16216744
 ] 

Robert Muir commented on LUCENE-4100:
-

Can we add this comment to DisjunctionMaxScorer (and not just leave at 
infinity):
{noformat}
// TODO: implement but be careful about floating-point errors.
{noformat}

Can we fix {{maxFreq}} calculation. Currently it is: 
{noformat}
// TODO: store the max term freq?
if (ttf == -1) {
  return Integer.MAX_VALUE;
} else {
  return ttf - df + 1;
}
{noformat}
It should be:
{noformat}
// TODO: store the max term freq?
if (ttf == -1) {
 // omitTFAP field, tf values are implicitly 1. 
 // note: it seems cleaner to check IndexOptions instead of looking at -1? 
  return 1;
} else {
  return Math.min(Integer.MAX_VALUE, ttf - df + 1);
}
{noformat}

I only partially reviewed the patch... will look at it again tonight.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-23 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215244#comment-16215244
 ] 

Robert Muir commented on LUCENE-4100:
-

I think we can also refine the upperbound to {{Math.min(Integer.MAX_VALUE, 1 + 
totalTermFreq - docFreq)}} today as well.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-23 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215238#comment-16215238
 ] 

Adrien Grand commented on LUCENE-4100:
--

I'll fix it at the same time as I make maxScore abstract on Scorer.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-23 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215224#comment-16215224
 ] 

Robert Muir commented on LUCENE-4100:
-

Yes but can we change the API on SimScorer to maxScore(freq)? We can improve 
already just by looking if the field omits TF (just pass 1), otherwise we pass 
Integer.MAX_VALUE.

I am also happy to factor this part of the patch (just the similarity api) out 
to LUCENE-7997 and add tests for it and discover any potential problems.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-23 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215213#comment-16215213
 ] 

Adrien Grand commented on LUCENE-4100:
--

Given how large the patch already is, would you be ok to delay exploring with 
maxFreq and minNorm, and looking into it in follow-up issues?

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-22 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214433#comment-16214433
 ] 

Adrien Grand commented on LUCENE-4100:
--

Agreed, we should make sure that eg. SynonymQuey exposes a max score.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-21 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213972#comment-16213972
 ] 

Robert Muir commented on LUCENE-4100:
-

Also what does the patch look like if Scorer.maxScore() is abstract instead of 
defaulting to +Infinity? I think it should be the same for Similarity. It is a 
little trappy, for example in the current patch you can't see that it is 
missing for things such as phrase scorers (but should work fine if done the 
same way as TermScorer). We know there are queryparsers forming disjunctions 
with these kind of subs and it might be able to help, presuming we can refine 
the algorithm a bit more.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-21 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213939#comment-16213939
 ] 

Robert Muir commented on LUCENE-4100:
-

{quote}
Store more metadata in order to be able to compute better higher bounds for the 
scores, like the maximum term frequency. Maybe even allow similarity-specific 
stats?
{quote}

maximum term frequency may be a reasonable idea. Its imperfect but its simple, 
can be re-computed on merge, validated by CheckIndex, etc. The downside is that 
per-term stats are still fairly costly i think for blocktree, so would the 
performance benefits be worth it? I do like that it'd require no additional 
storage for an {{omitTF}} field since it is 1 there by definition.

for phrases it could still work too (just take min or max across all the terms 
in the phrase and pass to sim as a bad approximation). So instead of:
{code}
/** Return the maximum score that this scorer may produce.
  * {@code Float.POSITIVE_INFINITY} is a fine return value if scores are not 
bounded. */
public abstract float maxScore();
{code}
we'd have
{code}
/** Return the maximum score that this scorer may produce.
  * {@code Float.POSITIVE_INFINITY} is a fine return value if scores are not 
bounded.
  * @param maxFreq maximum possible frequency value
  */
public abstract float maxScore(float maxFreq);
{code}

I wonder how much it would improve your performance for BM25 though? I think we 
shouldn't add such a stat unless it really helps our defaults due to the costly 
nature of such statistics (not just cpu/storage, but api complexity, codec 
requirements, etc). On the side, by my math (just playing around with numbers), 
its still not enough to make the optimization potent for ClassicSimilarity, 
which may be the worst-case here :) 

Theoretically it'd be simple to independently expose the min value for field's 
norm to get even better (similarity could pull that itself from its 
LeafReader), but it'd be so fragile: messy data such as a single very short doc 
for the field will effectively negate that opto for the entire field. 
Alternatively storing min norm's value per-term might do it, once written it 
does not change so can be easily merged, cross-verified with norm's value in 
CheckIndex, etc, but its another thing to benchmark and compare the cost.

Allowing the similarity to store its own stuff seems impractical for a number 
of reasons: for example we couldn't get any better for BM25 without breaking 
incremental and distributed search (avgFieldLength unknown until runtime).

Another way to put it: with the current patch the Similarity has all the raw 
statistics it needs to compute its maxScore, except that norm and tf values are 
left unbounded, so i'd rather see those raw statistics stored if we want to 
improve their maxScore functions. As attractive as it sounds to combine them 
up-front for the best possible performance, I think its not practical given the 
features we support.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204134#comment-16204134
 ] 

Robert Muir commented on LUCENE-4100:
-

Thanks for the benchmarking! It is unfortunate we have to make the api more 
complicated / specialize disjunctions even more, but seems like the right 
tradeoff i suppose.

{quote}
Can you elaborate on what you find confusing? This looks similar to how you 
should not call Score.score() if you passed needsScores=false to me?
{quote}

That's exactly it, i think we should try to avoid situations like that. its 
basically the opposite of type-safety, and the more of these conditionals / 
"methods you should not call" that we add, the more confusing it should get. 
That's why i'm still mulling what we can do to keep scorers simpler...

but for now, to move along, I think we have some basic idea of what to do to 
fix indexsearcher (a boolean about whether exact total hits are needed, for 
various purposes), but yeah lets keep it separate from what we do about 
createWeight. For the latter maybe an explicit boolean for maxScore is the 
simplest for now, and we can see where it goes.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-13 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203829#comment-16203829
 ] 

Adrien Grand commented on LUCENE-4100:
--

bq. indexsearcher already knows if scores are needed (e.g. from the Sort), but 
there is no way to tell it that approximate total hit count is acceptable. If 
we can do that, then I think we can make the early termination case really easy 
for the sorted case, index order case, and also this maxscore case.

+1 We need to add a new {{boolean needsTotalHits}} to the {{search(Query, 
int)}} and {{search(Query, int, Sort)}} methods.

bq. we do? [...] I'm just asking because the new scorer here looks a hell of a 
lot like a disjunction scorer

Well it is a disjunction. :) Our regular disjunction scorer maintains a single 
heap and only looks at the 'top' element and callso {{updateTop}} after calls 
to {{nextDoc}} or {{advance}}. If you want to be able to call {{advance()}} 
sometimes on low-scoring clauses even if only {{nextDoc()}} was called on the 
disjunction, you need to give it the ability to leave some scorers behind, as 
long as the sum of the max scores of scorers that are behind and scorers that 
are positioned on the current candidate is not greater than the minimum 
competitive score (otherwise you might be missing matches). This means you need 
to move scorers between at least 2 data-structures. In practice, we actually 
use 3 of them so that we can also easily differenciate scorers that are on the 
current candidate from scorers that are too advanced. Moving scorers between 
those data-structures has some overhead. For instance here is what I get when I 
benchmark the MaxScoreScorer against master's DisjunctionSumScorer (BS1 is 
disabled in both cases):

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff
   HighTermDayOfYearSort   33.34  (7.3%)   16.55  (2.6%)  
-50.4% ( -56% -  -43%)
  OrHighHigh   21.91  (4.1%)   12.36  (1.7%)  
-43.6% ( -47% -  -39%)
   OrHighLow   48.08  (4.0%)   29.27  (2.4%)  
-39.1% ( -43% -  -34%)
   OrHighMed   60.66  (4.0%)   39.32  (2.5%)  
-35.2% ( -40% -  -29%)
  Fuzzy2  117.30  (7.1%)  101.36  (7.5%)  
-13.6% ( -26% -1%)
  Fuzzy1  238.83 (10.7%)  212.80  (7.3%)  
-10.9% ( -26% -7%)
OrHighNotLow   70.64  (3.4%)   66.82  (2.0%)   
-5.4% ( -10% -0%)
OrHighNotMed   65.59  (3.0%)   62.63  (1.6%)   
-4.5% (  -8% -0%)
   OrNotHighHigh   34.55  (2.3%)   33.24  (1.1%)   
-3.8% (  -6% -0%)
   OrHighNotHigh   48.98  (2.4%)   47.17  (1.4%)   
-3.7% (  -7% -0%)
OrNotHighMed  107.24  (2.1%)  103.40  (1.1%)   
-3.6% (  -6% -0%)
  IntNRQ   26.26 (11.5%)   25.64 (12.3%)   
-2.3% ( -23% -   24%)
  AndHighLow  879.33  (3.7%)  860.16  (3.3%)   
-2.2% (  -8% -4%)
  AndHighMed  168.90  (1.7%)  165.48  (1.3%)   
-2.0% (  -4% -1%)
  HighPhrase   20.08  (2.8%)   19.69  (2.7%)   
-2.0% (  -7% -3%)
 MedSloppyPhrase   15.44  (1.7%)   15.15  (1.8%)   
-1.9% (  -5% -1%)
 LowSpanNear   44.70  (2.1%)   43.88  (1.9%)   
-1.8% (  -5% -2%)
   MedPhrase   52.07  (3.2%)   51.16  (3.1%)   
-1.7% (  -7% -4%)
 LowSloppyPhrase  150.90  (1.5%)  148.62  (1.5%)   
-1.5% (  -4% -1%)
OrNotHighLow 1174.47  (3.6%) 1157.52  (4.1%)   
-1.4% (  -8% -6%)
HighSpanNear   38.46  (3.0%)   37.92  (2.4%)   
-1.4% (  -6% -4%)
HighSloppyPhrase   28.13  (2.4%)   27.76  (2.5%)   
-1.3% (  -6% -3%)
   LowPhrase  131.67  (1.6%)  130.37  (2.0%)   
-1.0% (  -4% -2%)
 MedSpanNear   54.75  (3.4%)   54.22  (3.0%)   
-1.0% (  -7% -5%)
Wildcard   41.58  (5.1%)   41.23  (5.2%)   
-0.8% ( -10% -9%)
 AndHighHigh   71.08  (1.5%)   70.61  (1.3%)   
-0.7% (  -3% -2%)
 Prefix3   88.11  (6.4%)   87.57  (7.5%)   
-0.6% ( -13% -   14%)
 MedTerm  235.73  (1.4%)  235.35  (3.7%)   
-0.2% (  -5% -4%)
HighTerm  144.06  (1.3%)  143.93  (4.2%)   
-0.1% (  -5% -5%)
 LowTerm  711.49  (4.2%)  718.15  (4.7%)
0.9% (  -7% -   10%)
 Respell  203.69  (8.6%)  206.72  (7.0%)
1.5% ( -12% -   18%)
  

[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203693#comment-16203693
 ] 

Robert Muir commented on LUCENE-4100:
-

yeah, I think i am looking at it from the top-down (indexsearcher) vs bottom up 
(queries).

indexsearcher already knows if scores are needed (e.g. from the Sort), but 
there is no way to tell it that approximate total hit count is acceptable. If 
we can do that, then I think we can make the early termination case really easy 
for the sorted case, index order case, and also this maxscore case.

{quote}
Ideally we would not need another parameter on Query.createWeight for MAXSCORE 
either, but the issue is that depending on whether you want to collect all hits 
or only the top-scoring ones, then we need different Scorer impls.
{quote}

we do? (genuine question). We added needsScores because previously scorers had 
to always be ready for you to "lazily" call score(), and this prevented scoring 
from doing much more interesting things up-front like caching whole bitsets, 
but is it really the case for maxScore? I'm just asking because the new scorer 
here looks a hell of a lot like a disjunction scorer :) If we truly need a 
different impl, we should maybe still think it thru because of stuff like 
setMinCompetitiveScore() method, which would make no sense except for that 
case. I do like that in your patch AssertingScorer checks all that stuff, but 
there it is a bit confusing.


> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-13 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203676#comment-16203676
 ] 

Adrien Grand commented on LUCENE-4100:
--

Yes, but this 3rd case is fine since it can be handled solely from the 
collector side, so we don't need to make weights aware of it?

Ideally we would not need another parameter on {{Query.createWeight}} for 
{{MAXSCORE}} either, but the issue is that depending on whether you want to 
collect all hits or only the top-scoring ones, then we need different Scorer 
impls.

bq. There is the current sorted-index case today, but you have to be a rocket 
scientist (custom collector) to use it

+1 to fix it and bake support for early termination in TopFieldDocCollector.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203603#comment-16203603
 ] 

Robert Muir commented on LUCENE-4100:
-

{quote}
Sorry I wrote needsTotalHits because this is the option name I used on the 
TopScoreDocCollector factory method, but on the weight we'd need something 
different. Because there are two situations in which you would need scores but 
not visit all matches:
{quote}

I'm not sure thats true. There is another case right? There is the current 
sorted-index case today, but you have to be a rocket scientist (custom 
collector) to use it. Surely the API should handle that case? (it should "just 
work" from indexsearcher). 

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-13 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203596#comment-16203596
 ] 

Adrien Grand commented on LUCENE-4100:
--

bq. Why doesn't it make sense? If i do a query, sorting by reverse time 
(recency), and retrieve the top 20, then i don't need scores, why do i need an 
exact hit count too? I think an approximation would suffice.

Sorry I wrote {{needsTotalHits}} because this is the option name I used on the 
TopScoreDocCollector factory method, but on the weight we'd need something 
different. Because there are two situations in which you would need scores but 
not visit all matches:
 - sorting by score
 - sorting by a field (potentially with early-termination) then score

Yet these cases are different since we can only apply MAXSCORE in the first 
case. So what we need is more something like {{canSkipNonCompetitiveScores}} 
but then this implies that {{needsScores}} is true, so there would be illegal 
combinations. Which is why I went with the enum.

bq. I think naively we want to base it on where we early terminate (as oppose 
to maxdoc) but i get the idea with many clauses. still, i think this estimate 
may be "good enough" because as you paginate, the estimate would get better?

OK I'll give this approach a go.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-13 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203384#comment-16203384
 ] 

Robert Muir commented on LUCENE-4100:
-

{quote}
This is what I wanted to do first but I didn'l like the fact that it would 
allow passing needsScores=false and needsTotalHits=false, which doesn't make 
sense. If you still prefer the two booleans approach despite this, I'm happy to 
make the change.
{quote}

Why doesn't it make sense? If i do a query, sorting by reverse time (recency), 
and retrieve the top 20, then i don't need scores, why do i need an exact hit 
count too? I think an approximation would suffice.

{quote}
I agree with that statement, but how do we compute a good estimate? It sounds 
challenging as the number of collected documents might be much less than the 
actual number of hits while the cost of the scorer can be highly overestimated, 
eg. for phrase queries. Should I return the number of collected documents and 
add documentation that this is a lower bound of the total number of hits?
{quote}

I think naively we want to base it on where we early terminate (as oppose to 
maxdoc) but i get the idea with many clauses. still, i think this estimate may 
be "good enough" because as you paginate, the estimate would get better?

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-13 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203135#comment-16203135
 ] 

Adrien Grand commented on LUCENE-4100:
--

Thanks for looking!

bq. Can we avoid the ScoreMode.merge? This seems really, really confusing. In 
general I don't think we should support such merging in MultiCollector or 
anywhere else, we should simply throw exception if things are different.

We have been allowing it until now (by ORing the needsScores booleans) so I 
didn't want to break this behaviour. I can remove this method and make the 
logic contained in MultiCollector by essentially returning COMPLETE if any of 
the collectors' score modes are different but I think failing would be 
surprising to many users and would probably need to be a 8.0 change if we 
decide to do it?

bq. Perhaps instead of the enum two booleans would be easier for now.

This is what I wanted to do first but I didn'l like the fact that it would 
allow passing needsScores=false and needsTotalHits=false, which doesn't make 
sense. If you still prefer the two booleans approach despite this, I'm happy to 
make the change.

bq. I don't understand why we should set the totalHitCount to -1, vs setting to 
a useful approximation, like google. The user said they didn't need the exact 
total hit count, so it should be no surprise, and its a hell of a lot more 
useful than a negative number.

I agree with that statement, but how do we compute a good estimate? It sounds 
challenging as the number of collected documents might be much less than the 
actual number of hits while the cost of the scorer can be highly overestimated, 
eg. for phrase queries. Should I return the number of collected documents and 
add documentation that this is a lower bound of the total number of hits?

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-10-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203051#comment-16203051
 ] 

Robert Muir commented on LUCENE-4100:
-

Can we avoid the ScoreMode.merge? This seems really, really confusing. In 
general I don't think we should support such merging in MultiCollector or 
anywhere else, we should simply throw exception if things are different.

I think the enum should be further revisited/simplified: essentially at the 
minimum it must capture 2 booleans from the user: whether scores are needed, 
and whether exact total hit count is needed. Perhaps instead of the enum two 
booleans would be easier for now.

I don't understand why we should set the totalHitCount to -1, vs setting to a 
useful approximation, like google. The user said they didn't need the exact 
total hit count, so it should be no surprise, and its a hell of a lot more 
useful than a negative number.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: LUCENE-4100.patch, LUCENE-4100.patch, 
> contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2017-08-22 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136601#comment-16136601
 ] 

Adrien Grand commented on LUCENE-4100:
--

bq. how to safely implement the optimization for dynamic indexes: See Stefan's 
note above. I do agree there are use cases for static indexes, the problem is: 
its lucene, people will try to use it with dynamic indexes regardless

Maybe BM25 is making our lives easier with this? Scores are bounded by {{IDF * 
(k1 + 1)}} if I'm not mistaken so we could use it to feed the MAXSCORE 
algorithm? It would not be as good as the initial patch since we are totally 
ignoring the range of values that the term frequency and document length may 
take but this would not require any index-time changes, does not require the 
index to be static and might already allow for good speedups if query terms 
have very different IDF values?



> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, 6.0
>
> Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2016-03-13 Thread Furkan KAMACI (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192554#comment-15192554
 ] 

Furkan KAMACI commented on LUCENE-4100:
---

I would like to apply this issue as aGSoC project if someone is volunteer for 
being a mentor.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.9, master
>
> Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2014-03-15 Thread Rowanto Rowanto (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13936440#comment-13936440
 ] 

Rowanto Rowanto commented on LUCENE-4100:
-

Hi,

thank you for your email. Unfortunately, I'm not in the office until 4. April 
2014.
Please contact my team for important issue.
zx-ber-dg-devteam-g...@zanox.com

I apologize for the inconvenient it might have caused.
Thank you for your understanding.

Rowanto


> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, gsoc2014, patch, performance
> Fix For: 4.8
>
> Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2014-03-13 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13934226#comment-13934226
 ] 

Otis Gospodnetic commented on LUCENE-4100:
--

This patch seems almost 2 years old.
Would it make sense to label this for GSoC?


> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, patch, performance
> Fix For: 4.7
>
> Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2013-12-03 Thread Stefan Pohl (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13837654#comment-13837654
 ] 

Stefan Pohl commented on LUCENE-4100:
-

Hi Aaron,
I can't say much about any possible complications Solr might add here, but if 
you can get your index building and submitting some test queries code compile 
with the SVN-revision mentioned above (some almost 4.0 revision), and you're 
fine with the current limitations mentioned above (DefaultSimilarity, flat 
queries = BooleanQuery of TermQueries) you can simply try it out. The patch 
should cleanly apply and be functional.
It also shouldn't be too hard to upgrade the code to newer Lucene versions that 
use BlockPostingsFormat (4.1) and the new pull-based codec API (LUCENE-5123).

Either way, it would be very interesting to hear of speedups with respect to 
MLT-like-queries.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, patch, performance
> Fix For: 4.7
>
> Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2013-11-27 Thread Aaron Daubman (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13833812#comment-13833812
 ] 

Aaron Daubman commented on LUCENE-4100:
---

Thanks for the update [~spo] - My particular use-case seems tailor made for 
this. I have several decently large (10-30G indices) solr instances, all of 
which run in read-only mode and are created ~2x a day via a snapshot process 
that rolls the index out to load-balanced servers. Several of these instances 
routinely match 30-80% (custom MLT-like queries) of the 2-25M docs in the index 
per-query, so efficient scoring would be a huge win here.

I already have to patch and custom-build solr for our use (until I get around 
to creating required tests to haver SOLR-2052 accepted) and am wondering if you 
have any thoughts/guidance on trying out your patch?

The main use-case is from a custom extension of QueryComponent that overrides 
perpare() and essentially builds up a custom boosted boolean query and uses 
rb.setQueryString and rb.setFilters...

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, patch, performance
> Fix For: 4.7
>
> Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2013-11-27 Thread Stefan Pohl (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13833573#comment-13833573
 ] 

Stefan Pohl commented on LUCENE-4100:
-

Thanks for your interest, [~daubman].

[~rcmuir] accurately describes the current status and challenges here. There 
have to be made quite a few major and even more minor steps to eventually 
arrive at anything general-purpose and user-ready.
However, given the feedback that I got so far (there seem to be many people 
that don't have extremely high NRT-requirements and who rebuild their whole 
indexes every few hours/days anyways), I still think that it would be 
worthwhile to have this as-is (with the limitation to static indexes) in a 
separate contrib-package that users are discouraged to use, except they really 
know what they are doing. Adding a few exceptions to guard users from wrong 
behaviour/misusage might also be useful.

As long as I, as the reporter, don't close the issue, it won't 'bite rot', 
don't worry ;) I also got to hear of quite some interest when I mentioned it at 
this year's Lucene/Solr-Revolutions.EU conference.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, patch, performance
> Fix For: 4.7
>
> Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2013-11-26 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13832677#comment-13832677
 ] 

Robert Muir commented on LUCENE-4100:
-

There is at least one step we made for the codec API (LUCENE-5123) that might 
make it easier: the codec has the freedom to "pull" the data rather than push 
it now, so it can more easily calculate interesting things that it wants.

Still, there remains some fairly major challenges in my eyes:
* how to safely implement the optimization for dynamic indexes: See Stefan's 
note above. I do agree there are use cases for static indexes, the problem is: 
its lucene, people will try to use it with dynamic indexes regardless. 
* how to connect such optimizations in a fairly clean way to the scoring api. 
This is perhaps the hardest part. There is a lot of separation between the 
scoring API (e.g. pluggable similarity, Scorer, etc) and the codec API. 

But i don't want to discourage anyone, just saying its probably even more 
difficult from an API perspective than the stuff to support something 
"different" like BooleanScorer.java.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, patch, performance
> Fix For: 4.7
>
> Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2013-11-25 Thread Aaron Daubman (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13831784#comment-13831784
 ] 

Aaron Daubman commented on LUCENE-4100:
---

[~spo] Stefan, I am curious if there has been any continued work on this? I 
know early this year you mentioned there hasn't been much time to work on it. 
It sounds very promising to me, so I hope it does not bit-rot! =)

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, patch, performance
> Fix For: 4.7
>
> Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2013-01-02 Thread Stefan Pohl (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542058#comment-13542058
 ] 

Stefan Pohl commented on LUCENE-4100:
-

Otis, thank you for your interest and I wish everyone a happy New Year!

Regarding speedups, have a look at http://vimeo.com/44300228 from 12 minutes 
onwards. It obviously depends on your mix of queries and collection size, but 
on average more than 100% should be achievable for large collections and 
typical query sets, and much higher speedups for problem queries (many frequent 
terms).

The contribution as-is (after adaptation to latest Lucene API/code-base) could 
already be included as a separate Lucene module for people to use who can live 
with its current limitations (for static indexes only, smaller totalHitCount).
In my spare time (unfortunately not much of that recently), I continue working 
on different approaches to make this more general, but this is pure 
experimentation and any production-ready offspring of that will take longer 
than 4.x and might require API changes (such as the ones suggested above by 
Robert), so perhaps 5.0 is a good aim.

I suggest to continue rolling this ticket forward, or only attach version 5.0 
to it.

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, patch, performance
> Fix For: 4.2, 5.0
>
> Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2012-12-28 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13540741#comment-13540741
 ] 

Otis Gospodnetic commented on LUCENE-4100:
--

Stefan, this sounds exciting.  What sort of speedups did you get?  Are you 
still looking to include this into Lucene (4.1)?

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, patch, performance
> Fix For: 4.1
>
> Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2012-07-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413335#comment-13413335
 ] 

Robert Muir commented on LUCENE-4100:
-

{quote}
As a side note: I noticed a drop in the PKLookup benchmark, suggesting that it 
might be better not to extend the size of dictionary items, but to store 
maxscores in the beginning of inverted lists, or next to skip data. This effect 
should be smaller or disappear though when maxscores are not stored for many 
terms.
{quote}

I wouldn't worry about this, I noticed a few things that might speed that up:

1. Currently it does writeVInt(Float.floatToIntBits(term.maxscoreScore)) . But 
I think this should be writeInt, not writeVInt?
So I think currently we often write 5 bytes here, with all the vint checks for 
each byte, and as an Int it would always be 4 and faster.
2. Yes, with low freq terms (e.g. docFreq < skipMinimum), its probably best to 
just omit this at both read and write time. Then PK lookup would be fine.
3. As far as < 4 bytes ceiling, my motivation there was not to save in the term 
dictionary, but instead to make these smaller and allow us to add these at 
regular intervals. We can take advantage of a few things, e.g. it should never 
be a negative number for a 
well-formed Similarity (i think that would screw up the algorithm looking at 
your tests anyway).

{quote}
DefaultSimilarity is simple, but BM25 and LMDirichlet can't as easily be 
factored out, as you correctly point out, but we could come up with bounds for 
collection statistics (those that go into the score) within which it is safe to 
use maxscore, otherwise we fallback to score-all until a merge occurs, or we 
notify the user to better do a merge/optimize, or Lucene does a segment-rewrite 
with new maxscore and bound computations on basis of more current collection 
stats. I got first ideas for an algorithm to compute these bounds.
{quote}

Ok, I'm not sure I totally see how the bounds computation can work, but if it 
can we might be ok in general. If the different segments are somewhat 
homogeneous then these stats should pretty much be very close anyway.

The other idea i had was more intrusive, adding a computeImpact() etc to 
Similarity or whatever.

{quote}
If the score is anyway factored out, it might be better to simply store all 
document-dependent stats (TF, doclen) of the document with the maximum score 
contribution (as ints) instead of one aggregate intermediate float score 
contribution.
{quote}

That might be a good idea. with TF as a vint and doclen as a byte, we would 
typically only have two bytes but not actually lose any information (by 
default, all these sims encode doclen as a byte anyway).

{quote}
[implementation inside codec]
Please be aware that while terms are at some point excluded from merging, they 
still are advanced to the docs in other lists to gain complete document 
knowledge and compute exact scores. Maxscores can also be used to minimize how 
often this happens, but the gains are often compensated by the more complex 
scoring. Still having to skip inside of excluded terms complicates your 
suggested implementation. But we definitely should consider architecture 
alternatives. The MaxscoreCollector, for instance, does currently only have a 
user interface function, keeping track of the top-k and their entry threshold 
could well be done inside the Maxscorer.
I was thinking though to extend the MaxscoreCollector to provide different 
scoring information, e.g. an approximation of the number of hits next to the 
actual number of scored documents (currently totalHits).
{quote}

My current line of thinking is even crazier, but I don't yet have anything 
close to a plan.

As a start, I do think that IndexSearcher.search() methods should take a "Score 
Mode" of sorts from the user (some enum), which would allow Lucene to do less 
work if its not necessary. We would pass this down via Weight.scorer() as a 
parameter... solely
looking at the search side I think this would open up opportunities in general 
for us to optimize things: e.g. instantiate the appropriate Collector impl, and 
for Weights to create the most optimal Scorers. Not yet sure how it would tie 
into the code API.

I started hacking up on a prototype that looks like this (I might have tried to 
refactor too hard also shoving the Sort options in here...)
{noformat}
/**
 * Different modes of search.
 */
public enum ScoreMode {
  /** 
   * No guarantees that the ranking is correct,
   * the results may come back in a different order than if all
   * documents were actually scored. Total hit count may be 
   * unavailable or approximate.
   */
  APPROXIMATE,
  /** 
   * Ranking is the same as {@link COMPLETE}, but total hit 
   * count may be unavailable or approximate.
   */
  SAFE,
  /**
   * Guarantees complete iteration over all documents, but scores
   * may be unavailable.
   *

[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2012-07-12 Thread Stefan Pohl (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413246#comment-13413246
 ] 

Stefan Pohl commented on LUCENE-4100:
-

Thanks for the feedback! You're on spot with everything you're saying.

Yes, the methods as suggested in the different papers have (semi-)static 
indexes in mind, that is, such that batch-index many new documents, then 
recompute maxscores (hence, IndexRewriter) and roll out the new version of the 
indexes. This is a Lucene use-case common to many large installations (or part 
thereof) and as such important. Moreover, this approach can easily be 
generalized to the other Similarities, without that they necessarily have to 
know about maxscore, and can be simplified by some minor API changes within 
Lucene. The PoC code as-is might be of help to showcase dependencies in 
general, and such that currently are not well supported within Lucene (because 
there was no need for it yet).

If you really want to go the full distance: I already thought about doing 
maxscore live and got some ideas in this regard, see below.

Comments to your thoughts:

[PostingsWriter]
You're right. For simplicity, I was computing each term's overall contribution 
(as explained in the talk), including all but query-dependent factors. You can 
consider this as un-quantized impacts (in the sense of Anh et al.) which 
necessitates a second pass over a static index, hence IndexRewriter.

As a side note: I noticed a drop in the PKLookup benchmark, suggesting that it 
might be better not to extend the size of dictionary items, but to store 
maxscores in the beginning of inverted lists, or next to skip data. This effect 
should be smaller or disappear though when maxscores are not stored for many 
terms.

[Length normalization]
Yes, this might be a necessary dependency. It should be a general 
design-principle though to have as many as possible statistics at hand 
everywhere, as long as it doesn't hurt performance in terms of efficiency.

[splitting impacts / incremental indexing]
Yes, this would be more intrusive, requiring Similarity-dependent maxscore 
computations. Here is how it could work:
Very exotic scoring functions simply don't have to support maxscore and will 
thus fall back to the current score-all behaviour.
DefaultSimilarity is simple, but BM25 and LMDirichlet can't as easily be 
factored out, as you correctly point out, but we could come up with bounds for 
collection statistics (those that go into the score) within which it is safe to 
use maxscore, otherwise we fallback to score-all until a merge occurs, or we 
notify the user to better do a merge/optimize, or Lucene does a segment-rewrite 
with new maxscore and bound computations on basis of more current collection 
stats. I got first ideas for an algorithm to compute these bounds.

[docfreq=1 treatment]
Definitely agree. Possibly, terms with docfreq < x=10? could not store a 
maxscore. x configurable and default to be evaluated; x should be stored in 
index so that it can be determined which terms don't contain maxscores.
Having a special treatment for these terms (not considering them for exclusion 
within the algorithm) allows for easier exchange of the core of the algorithm 
to get the WAND algorithm, or also to ignore a maxscore for a term for which 
collection stats went out of bounds.

[maxscores per posting ranges]
+1. As indicated in the description, having multiple maxscores per term can be 
more efficient, possibly leading to tighter bounds and more skipping. 
Chakrabarti'11 opted for one extreme, computing a maxscore for each compressed 
posting block, whereas the optimal choice might have been a multiple of blocks, 
or a postings range not well aligned with block size.
Optimal choice will be very dependent on skip list implementation and its 
parameters, but also posting de-compression overhead.
The question is how to get access to this codec-dependent information inside of 
the scoring algorithm, tunneled through the TermQuery?

[store <4 bytes per maxscore]
Possible. As long as the next higher representable real number is stored (ceil, 
not floor), no docs will be missed and the algorithm remains correct. But 
because of more loose bounds the efficiency gain will be affected at some point 
with too few bits.

If the score is anyway factored out, it might be better to simply store all 
document-dependent stats (TF, doclen) of the document with the maximum score 
contribution (as ints) instead of one aggregate intermediate float score 
contribution.

[implementation inside codec]
Please be aware that while terms are at some point excluded from merging, they 
still are advanced to the docs in other lists to gain complete document 
knowledge and compute exact scores. Maxscores can also be used to minimize how 
often this happens, but the gains are often compensated by the more complex 
scoring. Still having to ski

[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2012-07-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413055#comment-13413055
 ] 

Robert Muir commented on LUCENE-4100:
-

{quote}
Your index at 1) does not have to be 'optimized' (it does not have to consist 
of one index segment only). In fact, maxscore can be more efficient with 
multiple segments because multiple maxscores are computed for many frequent 
terms for subsets of documents, resulting in tighter bounds and more effective 
pruning.
{quote}

I've been thinking about this a lot lately: while what you say is true, thats 
because you reprocess all segments with IndexRewriter (which is fine for a 
static collection).

But this algorithm in general is not rank safe with incremental indexing: the 
problem is that when doing actual scoring,
scores consist of per-segment/within document stats (term frequency, document 
length), but also are affected by collection-wide
statistics from many other segments (IDF, average document length, ...) or even 
machines in a distributed collection.

So I think for this to work and remain rank-safe, we cannot write the entire 
score into the segment, because the score
at actual search time is dependent on all the other segments being searched. 
Instead I think this can only work when
we can easily factor out an impact (e.g. in the case of DefaultSimilarity the 
indexed maxscore excludes the IDF component,
this is instead multiplied in at search time).

I don't see how it can be rank-safe with algorithms like BM25 and incremental 
indexing, where parameters like average document
length are not simple multiplicative factors into the formula: and determine 
exactly how important tf versus document length play
a role in the score, but I'll think about it some more.


> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0-ALPHA
>Reporter: Stefan Pohl
>  Labels: api-change, patch, performance
> Fix For: 4.0
>
> Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2012-07-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13407472#comment-13407472
 ] 

Robert Muir commented on LUCENE-4100:
-

I spun off a sub-issue (LUCENE-4198) to see how we can first fix this Codec API 
so that
you don't need an IndexRewriter and this patch could work "live".

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0
>Reporter: Stefan Pohl
>  Labels: api-change, patch, performance
> Fix For: 4.0
>
> Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4100) Maxscore - Efficient Scoring

2012-07-05 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406992#comment-13406992
 ] 

Robert Muir commented on LUCENE-4100:
-

Hello, thank you for working on this! 

I have just taken a rough glance at the code, and think we should probably look 
at what API changes would make 
this sort of thing fit better into Lucene and it easier to implement.

Random thoughts:

Specifically: what you are doing in the PostingsWriter is similar to computing 
impacts (I don't have a copy of
the paper so admittedly don't know the exact algorithm you are using). But it 
seems to me that you are putting 
a maxScore in the term dictionary metadata for all of the terms postings (as a 
float).

With the tool you provide, this works because you have access to e.g. the 
segment's length normalization information
etc (your postingswriter takes a reader). But we would have to think about how 
to give postingswriters access to this 
on flush... it seems possible to me though.

Giving the postingswriter full statistics (e.g. docfreq) for Similarity 
computation seems difficult: while I think
we could accum this stuff in FreqProxTermsWriter before we flush to the codec, 
it wouldn't solve the problem at merge time,
so you would have to do a 2-pass merge in the codec somehow...

But the alternative of splitting the "impact" (tf/norm) from the 
document-independent weight (e.g. IDF) isn't that pretty
either, because it limits the scoring systems (Similarity implementations) that 
could use the optimization.

as many terms will be low frequency (e.g. docfreq=1), i think its not
worth it to encode the maxscore for these low freq terms: we could save space 
by omitting maxscore for low freq terms 
and just treat it as infinitely large?

the opposite problem: is it really optimal to encode maxscore for the entire 
term? or would it be better for high-freq
terms to encode maxScore for a range of postings (e.g. block). This way, you 
could skip over ranges of postings that cannot
compete (rather than limiting the optimization to an entire term). A codec 
could put this information into a block header,
or at certain intervals, into the skip data, etc.

do we really need a full 4-byte float? How well would the algorithm work with 
degraded precision: e.g. something like
SmallFloat. (I think this SmallFloat currently computes a lower bound, we would 
have to bump to the next byte to make an upper bound).

another idea: it might be nice if this optimization could sit underneath the 
codec, such that you dont need a special
Scorer. One idea here would be for your collector to set an attribute on the 
DocsEnum (maxScore): of course a normal
codec would totally ignore this and proceed as today. But codecs like this one 
could return NO_MORE_DOCS when postings
for that term can no longer compete. I'm just not positive if this algorithm 
can be refactored in this way, and this
would also require some clean way of getting these attributes from Collector -> 
Scorer -> DocsEnum. Currently Scorer
is in the way here :)

Just some random thoughts, I'll try to get a copy of this paper so I have a 
better idea whats going on with this particular
optimization...

> Maxscore - Efficient Scoring
> 
>
> Key: LUCENE-4100
> URL: https://issues.apache.org/jira/browse/LUCENE-4100
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: core/codecs, core/query/scoring, core/search
>Affects Versions: 4.0
>Reporter: Stefan Pohl
>  Labels: api-change, patch, performance
> Fix For: 4.0
>
> Attachments: contrib_maxscore.tgz, maxscore.patch
>
>
> At Berlin Buzzwords 2012, I will be presenting 'maxscore', an efficient 
> algorithm first published in the IR domain in 1995 by H. Turtle & J. Flood, 
> that I find deserves more attention among Lucene users (and developers).
> I implemented a proof of concept and did some performance measurements with 
> example queries and lucenebench, the package of Mike McCandless, resulting in 
> very significant speedups.
> This ticket is to get started the discussion on including the implementation 
> into Lucene's codebase. Because the technique requires awareness about it 
> from the Lucene user/developer, it seems best to become a contrib/module 
> package so that it consciously can be chosen to be used.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h.