[jira] [Created] (SOLR-6613) TextField.analyzeMultiTerm should not throw exception when analyzer returns no term

2014-10-09 Thread Bruno Roustant (JIRA)
Bruno Roustant created SOLR-6613:


 Summary: TextField.analyzeMultiTerm should not throw exception 
when analyzer returns no term
 Key: SOLR-6613
 URL: https://issues.apache.org/jira/browse/SOLR-6613
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Affects Versions: 4.3.1, 4.10.2, Trunk
Reporter: Bruno Roustant


In TextField.analyzeMultiTerm()
at line
try {
  if (!source.incrementToken())
throw new SolrException();

The method should not throw an exception if there is no token because having no 
token is legitimate because all tokens may be filtered out (e.g. with a 
blocking Filter such as StopFilter).

In this case it should simply return null (as it already returns null in some 
cases, see first line of method).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6613) TextField.analyzeMultiTerm should not throw exception when analyzer returns no term

2014-10-09 Thread Bruno Roustant (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-6613:
-
Attachment: TestTextField.java

 TextField.analyzeMultiTerm should not throw exception when analyzer returns 
 no term
 ---

 Key: SOLR-6613
 URL: https://issues.apache.org/jira/browse/SOLR-6613
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Affects Versions: 4.3.1, 4.10.2, Trunk
Reporter: Bruno Roustant
 Attachments: TestTextField.java


 In TextField.analyzeMultiTerm()
 at line
 try {
   if (!source.incrementToken())
 throw new SolrException();
 The method should not throw an exception if there is no token because having 
 no token is legitimate because all tokens may be filtered out (e.g. with a 
 blocking Filter such as StopFilter).
 In this case it should simply return null (as it already returns null in some 
 cases, see first line of method).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-6613) TextField.analyzeMultiTerm should not throw exception when analyzer returns no term

2014-10-09 Thread Bruno Roustant (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-6613:
-
Description: 
In TextField.analyzeMultiTerm()
at line
try {
  if (!source.incrementToken())
throw new SolrException();

The method should not throw an exception if there is no token because having no 
token is legitimate because all tokens may be filtered out (e.g. with a 
blocking Filter such as StopFilter).

In this case it should simply return null (as it already returns null in some 
cases, see first line of method). However, SolrQueryParserBase needs also to be 
fixed to correctly handle null returned by TextField.analyzeMultiTerm().

See attached TestTextField for the corresponding new test class.

  was:
In TextField.analyzeMultiTerm()
at line
try {
  if (!source.incrementToken())
throw new SolrException();

The method should not throw an exception if there is no token because having no 
token is legitimate because all tokens may be filtered out (e.g. with a 
blocking Filter such as StopFilter).

In this case it should simply return null (as it already returns null in some 
cases, see first line of method).


 TextField.analyzeMultiTerm should not throw exception when analyzer returns 
 no term
 ---

 Key: SOLR-6613
 URL: https://issues.apache.org/jira/browse/SOLR-6613
 Project: Solr
  Issue Type: Bug
  Components: Schema and Analysis
Affects Versions: 4.3.1, 4.10.2, Trunk
Reporter: Bruno Roustant
 Attachments: TestTextField.java


 In TextField.analyzeMultiTerm()
 at line
 try {
   if (!source.incrementToken())
 throw new SolrException();
 The method should not throw an exception if there is no token because having 
 no token is legitimate because all tokens may be filtered out (e.g. with a 
 blocking Filter such as StopFilter).
 In this case it should simply return null (as it already returns null in some 
 cases, see first line of method). However, SolrQueryParserBase needs also to 
 be fixed to correctly handle null returned by TextField.analyzeMultiTerm().
 See attached TestTextField for the corresponding new test class.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods

2018-05-07 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465767#comment-16465767
 ] 

Bruno Roustant commented on LUCENE-8292:


I just realized that the current no-default-override behavior is actually 
enforced by a test TestFilterLeafReader.testOverrideMethods.

I still think all methods should be overridden, but I understand that this may 
not be the expected behavior currently.

> Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
> --
>
> Key: LUCENE-8292
> URL: https://issues.apache.org/jira/browse/LUCENE-8292
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2.1
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: trunk
>
> Attachments: 
> 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, 
> LUCENE-8292.patch
>
>
> FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many 
> methods.
> It misses some seekExact() methods, thus it is not possible to the delegate 
> to override these methods to have specific behavior (unlike the TermsEnum API 
> which allows that).
> The fix is straightforward: simply override these seekExact() methods and 
> delegate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods

2018-05-07 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465887#comment-16465887
 ] 

Bruno Roustant commented on LUCENE-8292:


[~dsmiley], if I create a subclass of FilterTermsEnum to override seekExact, 
how can I make other classes in Lucene create this subclass instead of 
FilterTermsEnum? Would I have to also override other classes or other factories?

> Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
> --
>
> Key: LUCENE-8292
> URL: https://issues.apache.org/jira/browse/LUCENE-8292
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2.1
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: trunk
>
> Attachments: 
> 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, 
> LUCENE-8292.patch
>
>
> FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many 
> methods.
> It misses some seekExact() methods, thus it is not possible to the delegate 
> to override these methods to have specific behavior (unlike the TermsEnum API 
> which allows that).
> The fix is straightforward: simply override these seekExact() methods and 
> delegate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-04-27 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456470#comment-16456470
 ] 

Bruno Roustant commented on SOLR-11865:
---

Actually the TrieSubsetMatcher introduced by the next patch does not support 
keepElevationPriority. If keepElevationPriority=true, this matcher is replaced 
by another, which keeps the order but which is less efficient. And this is done 
at component initialization time, in the inform() method (in 
loadElevationProvider()).

So I think it cannot be a query param because it is fixed in the data structure 
at initialization time.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, 
> 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods

2018-05-15 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475579#comment-16475579
 ] 

Bruno Roustant commented on LUCENE-8292:


Actually there is also another related issue with this 
FilterLeafReader#FilterTermsEnum delegate pattern.

It does not delegate termState() nor seekExact(ByteRef, TermState) methods. 
Which means the termState is never used, so the term queries repeat twice the 
same seek (seekCeil) instead of using the termState to improve performance 
(normally the termState is kept by TermContext#build()).

Practical example: When one configures a timeout for queries, internally a 
ExitableDirectoryReader is created. And its ExitableTermsEnum, which extends 
FilterTermsEnum, makes all term queries repeat twice the same seekCeil().

> Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
> --
>
> Key: LUCENE-8292
> URL: https://issues.apache.org/jira/browse/LUCENE-8292
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2.1
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: trunk
>
> Attachments: 
> 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, 
> LUCENE-8292.patch
>
>
> FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many 
> methods.
> It misses some seekExact() methods, thus it is not possible to the delegate 
> to override these methods to have specific behavior (unlike the TermsEnum API 
> which allows that).
> The fix is straightforward: simply override these seekExact() methods and 
> delegate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods

2018-05-15 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475579#comment-16475579
 ] 

Bruno Roustant edited comment on LUCENE-8292 at 5/15/18 9:57 AM:
-

Actually there is also another related issue with this 
FilterLeafReader#FilterTermsEnum delegate pattern.

It does not delegate termState() nor seekExact(ByteRef, TermState) methods. 
Which means the termState is never used, so the term queries repeat twice the 
same seek (seekCeil) instead of using the termState to improve performance 
(normally the termState is kept by TermContext#build()).

Practical example: When one configures a timeout for queries, internally an 
ExitableDirectoryReader is created. And its ExitableTermsEnum, which extends 
FilterTermsEnum, makes all term queries repeat twice the same seekCeil().


was (Author: bruno.roustant):
Actually there is also another related issue with this 
FilterLeafReader#FilterTermsEnum delegate pattern.

It does not delegate termState() nor seekExact(ByteRef, TermState) methods. 
Which means the termState is never used, so the term queries repeat twice the 
same seek (seekCeil) instead of using the termState to improve performance 
(normally the termState is kept by TermContext#build()).

Practical example: When one configures a timeout for queries, internally a 
ExitableDirectoryReader is created. And its ExitableTermsEnum, which extends 
FilterTermsEnum, makes all term queries repeat twice the same seekCeil().

> Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
> --
>
> Key: LUCENE-8292
> URL: https://issues.apache.org/jira/browse/LUCENE-8292
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2.1
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: trunk
>
> Attachments: 
> 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, 
> LUCENE-8292.patch
>
>
> FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many 
> methods.
> It misses some seekExact() methods, thus it is not possible to the delegate 
> to override these methods to have specific behavior (unlike the TermsEnum API 
> which allows that).
> The fix is straightforward: simply override these seekExact() methods and 
> delegate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-05-15 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476032#comment-16476032
 ] 

Bruno Roustant commented on SOLR-11865:
---

Great! I agree with all your points [~dsmiley].

Indeed the String IDs in Elevation would be clearer as BytesRefs. And I vote to 
apply the key String => indexed form as early as possible, if the code remains 
small.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, 
> 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Closed] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-06-19 Thread Bruno Roustant (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant closed SOLR-11865.
-

Work done

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Minor
>  Labels: QueryComponent
> Fix For: 7.5
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, 
> 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, 
> SOLR-11865.patch, SOLR-11865.patch, SOLR-11865.patch, SOLR-11865.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-06-19 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517247#comment-16517247
 ] 

Bruno Roustant commented on SOLR-11865:
---

Thanks for your incredible help [~dsmiley]!

Closing this PR.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Minor
>  Labels: QueryComponent
> Fix For: 7.5
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, 
> 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, 
> SOLR-11865.patch, SOLR-11865.patch, SOLR-11865.patch, SOLR-11865.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-05-31 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496788#comment-16496788
 ] 

Bruno Roustant commented on SOLR-11865:
---

You're right MapElevationProvider.buildElevationMap should merge in this case 
(which indeed should not happen since they have been merged earlier).

I have created the GitHub PR 
([https://github.com/apache/lucene-solr/pull/390),] to be enhanced with all 
your improvements.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, 
> 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, 
> SOLR-11865.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods

2018-05-03 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462394#comment-16462394
 ] 

Bruno Roustant edited comment on LUCENE-8292 at 5/3/18 1:03 PM:


When looking at TermsEnum API, what I understand is that seekExact() defaults 
to calling seekCeil(), but if needed (not for correctness but for performance 
consideration) we can override it to have a specialized seek that searches only 
the exact term and does not have to position to the next term if not found.

This may have an impact for some TermsEnum extensions (a really noticeable 
impact in my case, that's why I noticed this issue). To me the current behavior 
of FilterTermsEnum is not correct with regard to TermsEnum API. (And I noticed 
that AssertingLeafReader overrides seekExact()).

Adding these two methods in FilterTermsEnum fixes correctness, even if I agree 
it makes more room for bugs.


was (Author: bruno.roustant):
When looking at TermsEnum API, what I understand is that seekExact() defaults 
to calling seekCeil(), but if needed (not for correctness but for performance 
consideration) we can override it to have a specialized seek that searches only 
the exact term and does not have to position to the next term if not found.

This may have an impact for some TermsEnum extensions (a really noticeable 
impact in my case, that's why I noticed this issue). To me the current behavior 
of FilterTermsEnum is not correct with regard to TermsEnum API. (And I noticed 
that AssertingLeafReader overrides seekExact()).

Adding this two methods in FilterTermsEnum fixes correctness, even if I agree 
it makes more room for bugs.

> Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
> --
>
> Key: LUCENE-8292
> URL: https://issues.apache.org/jira/browse/LUCENE-8292
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2.1
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: trunk
>
> Attachments: 
> 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, 
> LUCENE-8292.patch
>
>
> FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many 
> methods.
> It misses some seekExact() methods, thus it is not possible to the delegate 
> to override these methods to have specific behavior (unlike the TermsEnum API 
> which allows that).
> The fix is straightforward: simply override these seekExact() methods and 
> delegate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods

2018-05-03 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462368#comment-16462368
 ] 

Bruno Roustant commented on LUCENE-8292:


1- "Not possible to override": I was not clear. It is still possible for a 
delegate TermsEnum to override the seekExact() method. But it will never be 
called since the FilterTermsEnum above always calls seekCeil().

2- "Two more methods to override": You're right. Although normally the same 
code should be reusable, it should not be tedious. I see the trappy point.

> Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
> --
>
> Key: LUCENE-8292
> URL: https://issues.apache.org/jira/browse/LUCENE-8292
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2.1
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: trunk
>
> Attachments: 
> 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, 
> LUCENE-8292.patch
>
>
> FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many 
> methods.
> It misses some seekExact() methods, thus it is not possible to the delegate 
> to override these methods to have specific behavior (unlike the TermsEnum API 
> which allows that).
> The fix is straightforward: simply override these seekExact() methods and 
> delegate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods

2018-05-03 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462394#comment-16462394
 ] 

Bruno Roustant commented on LUCENE-8292:


When looking at TermsEnum API, what I understand is that seekExact() defaults 
to calling seekCeil(), but if needed (not for correctness but for performance 
consideration) we can override it to have a specialized seek that searches only 
the exact term and does not have to position to the next term if not found.

This may have an impact for some TermsEnum extensions (a really noticeable 
impact in my case, that's why I noticed this issue). To me the current behavior 
of FilterTermsEnum is not correct with regard to TermsEnum API. (And I noticed 
that AssertingLeafReader overrides seekExact()).

Adding this two methods in FilterTermsEnum fixes correctness, even if I agree 
it makes more room for bugs.

> Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
> --
>
> Key: LUCENE-8292
> URL: https://issues.apache.org/jira/browse/LUCENE-8292
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2.1
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: trunk
>
> Attachments: 
> 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, 
> LUCENE-8292.patch
>
>
> FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many 
> methods.
> It misses some seekExact() methods, thus it is not possible to the delegate 
> to override these methods to have specific behavior (unlike the TermsEnum API 
> which allows that).
> The fix is straightforward: simply override these seekExact() methods and 
> delegate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods

2018-05-03 Thread Bruno Roustant (JIRA)
Bruno Roustant created LUCENE-8292:
--

 Summary: Fix FilterLeafReader.FilterTermsEnum to delegate all 
seekExact methods
 Key: LUCENE-8292
 URL: https://issues.apache.org/jira/browse/LUCENE-8292
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/index
Affects Versions: 7.2.1
Reporter: Bruno Roustant
 Fix For: trunk


FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many 
methods.

It misses some seekExact() methods, thus it is not possible to the delegate to 
override these methods to have specific behavior (unlike the TermsEnum API 
which allows that).

The fix is straightforward: simply override these seekExact() methods and 
delegate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods

2018-05-03 Thread Bruno Roustant (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated LUCENE-8292:
---
Attachment: LUCENE-8292.patch
0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch

> Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
> --
>
> Key: LUCENE-8292
> URL: https://issues.apache.org/jira/browse/LUCENE-8292
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2.1
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: trunk
>
> Attachments: 
> 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, 
> LUCENE-8292.patch
>
>
> FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many 
> methods.
> It misses some seekExact() methods, thus it is not possible to the delegate 
> to override these methods to have specific behavior (unlike the TermsEnum API 
> which allows that).
> The fix is straightforward: simply override these seekExact() methods and 
> delegate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods

2018-05-03 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462409#comment-16462409
 ] 

Bruno Roustant edited comment on LUCENE-8292 at 5/3/18 1:08 PM:


Another option would be to modify the TermsEnum.seekExact() method and make it 
final, or have the javadoc be explicit that it should not be overridden. 
(though I don't like this option)


was (Author: bruno.roustant):
Another option would be to modify the TermsEnum.seekExact() method and make it 
final, or have the javadoc be explicit that it should not be overridden.

> Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
> --
>
> Key: LUCENE-8292
> URL: https://issues.apache.org/jira/browse/LUCENE-8292
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2.1
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: trunk
>
> Attachments: 
> 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, 
> LUCENE-8292.patch
>
>
> FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many 
> methods.
> It misses some seekExact() methods, thus it is not possible to the delegate 
> to override these methods to have specific behavior (unlike the TermsEnum API 
> which allows that).
> The fix is straightforward: simply override these seekExact() methods and 
> delegate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods

2018-05-03 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462409#comment-16462409
 ] 

Bruno Roustant commented on LUCENE-8292:


Another option would be to modify the TermsEnum.seekExact() method and make it 
final, or have the javadoc be explicit that it should not be overridden.

> Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
> --
>
> Key: LUCENE-8292
> URL: https://issues.apache.org/jira/browse/LUCENE-8292
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2.1
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: trunk
>
> Attachments: 
> 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, 
> LUCENE-8292.patch
>
>
> FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many 
> methods.
> It misses some seekExact() methods, thus it is not possible to the delegate 
> to override these methods to have specific behavior (unlike the TermsEnum API 
> which allows that).
> The fix is straightforward: simply override these seekExact() methods and 
> delegate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-11866) Support efficient subset matching in query elevation rules

2018-01-17 Thread Bruno Roustant (JIRA)
Bruno Roustant created SOLR-11866:
-

 Summary: Support efficient subset matching in query elevation rules
 Key: SOLR-11866
 URL: https://issues.apache.org/jira/browse/SOLR-11866
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SearchComponents - other
Affects Versions: master (8.0)
Reporter: Bruno Roustant


Leverages the SOLR-11865 refactoring by introducing a 
SubsetMatchElevationProvider in QueryElevationComponent. This provider calls a 
new util class TrieSubsetMatcher to efficiently match all query elevation rules 
which subset is contained by the current query list of terms.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-01-17 Thread Bruno Roustant (JIRA)
Bruno Roustant created SOLR-11865:
-

 Summary: Refactor QueryElevationComponent to prepare query subset 
matching
 Key: SOLR-11865
 URL: https://issues.apache.org/jira/browse/SOLR-11865
 Project: Solr
  Issue Type: Improvement
  Security Level: Public (Default Security Level. Issues are Public)
  Components: SearchComponents - other
Affects Versions: master (8.0)
Reporter: Bruno Roustant
 Fix For: master (8.0)


The goal is to prepare a second improvement to support query terms subset 
matching or query elevation rules.

Before that, we need to refactor the QueryElevationComponent. We make it 
extendible. We introduce the ElevationProvider interface which will be 
implemented later in a second patch to support subset matching. The current 
full-query match policy becomes a default simple MapElevationProvider.

- Add overridable methods to handle exceptions during the component 
initialization.
- Add overridable methods to provide the default values for config properties.
- No functional change beyond refactoring.
- Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11866) Support efficient subset matching in query elevation rules

2018-01-18 Thread Bruno Roustant (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-11866:
--
Attachment: SOLR-11866.patch
0001-New-SubsetMatchElevationProvider-in-QueryElevationCo.patch

> Support efficient subset matching in query elevation rules
> --
>
> Key: SOLR-11866
> URL: https://issues.apache.org/jira/browse/SOLR-11866
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Major
> Attachments: 
> 0001-New-SubsetMatchElevationProvider-in-QueryElevationCo.patch, 
> SOLR-11866.patch
>
>
> Leverages the SOLR-11865 refactoring by introducing a 
> SubsetMatchElevationProvider in QueryElevationComponent. This provider calls 
> a new util class TrieSubsetMatcher to efficiently match all query elevation 
> rules which subset is contained by the current query list of terms.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton

2018-02-05 Thread Bruno Roustant (JIRA)
Bruno Roustant created LUCENE-8159:
--

 Summary: Add a copy constructor in AutomatonQuery to copy directly 
the compiled automaton
 Key: LUCENE-8159
 URL: https://issues.apache.org/jira/browse/LUCENE-8159
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search
Affects Versions: trunk
Reporter: Bruno Roustant


When the query is composed of multiple AutomatonQuery with the same automaton 
and which target different fields, it is much more efficient to reuse the 
already compiled automaton by copying it directly and just changing the target 
field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton

2018-02-05 Thread Bruno Roustant (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated LUCENE-8159:
---
Attachment: LUCENE-8159.patch
0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch

> Add a copy constructor in AutomatonQuery to copy directly the compiled 
> automaton
> 
>
> Key: LUCENE-8159
> URL: https://issues.apache.org/jira/browse/LUCENE-8159
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: trunk
>Reporter: Bruno Roustant
>Priority: Major
> Attachments: 
> 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, 
> LUCENE-8159.patch
>
>
> When the query is composed of multiple AutomatonQuery with the same automaton 
> and which target different fields, it is much more efficient to reuse the 
> already compiled automaton by copying it directly and just changing the 
> target field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-01-17 Thread Bruno Roustant (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-11865:
--
Attachment: SOLR-11865.patch
0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-04-05 Thread Bruno Roustant (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-11865:
--
Attachment: (was: SOLR-11865.patch)

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-04-05 Thread Bruno Roustant (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-11865:
--
Attachment: SOLR-11865.patch
0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, 
> 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-04-05 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426589#comment-16426589
 ] 

Bruno Roustant commented on SOLR-11865:
---

New delta patch with the modification mentioned.

Eventually I'll squash the commits to produce a single patch that should be 
supported by "Yetus" (currently I simply use git format-patch and it produces 
three separate patch files for three commits).

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, 
> 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-04-05 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426589#comment-16426589
 ] 

Bruno Roustant edited comment on SOLR-11865 at 4/5/18 7:44 AM:
---

New delta patch with the modifications mentioned.

Eventually I'll squash the commits to produce a single patch that should be 
supported by "Yetus" (currently I simply use git format-patch and it produces 
three separate patch files for three commits).


was (Author: bruno.roustant):
New delta patch with the modification mentioned.

Eventually I'll squash the commits to produce a single patch that should be 
supported by "Yetus" (currently I simply use git format-patch and it produces 
three separate patch files for three commits).

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, 
> 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-04-24 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450065#comment-16450065
 ] 

Bruno Roustant commented on SOLR-11865:
---

Sorry for the delay.

Yes, if you can take it from here, that would be awesome!
 * Getters for defaults: you're right, there is no need. Please remove them.
 * keepElevationPriority as a constant in QEC: good point.
 * keepElevationPriority meaning:
Actually the comment is not right, maybe the sorting has changed since the time 
I wrote this comment. I don't think it is linked anymore to forceElevation 
since the ElevationComparatorSource can be added as a SortField even if 
forceElevation=false when one sort by score.
The point is
- with keepElevationPriority=true, the behavior is unchanged, the elevated 
documents (on top) are sorted by the order of the elevation rules and elevated 
ids in the config file.
- with keepElevationPriority=false, the behavior changes, the elevated 
documents (still on top) are in any order, and they may be re-ordered by other 
sort fields (this will allow the use of the efficient but unsorted 
TrieSubsetMatcher in the other patch).

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, 
> 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-04-24 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450065#comment-16450065
 ] 

Bruno Roustant edited comment on SOLR-11865 at 4/24/18 3:37 PM:


Sorry for the delay.

Yes, if you can take it from here, that would be awesome!
 * Getters for defaults: you're right, there is no need. Please remove them.
 * keepElevationPriority as a constant in QEC: good point.
 * keepElevationPriority meaning:
 Actually the comment is not right, maybe the sorting has changed since the 
time I wrote this comment. I don't think it is linked anymore to forceElevation 
since the ElevationComparatorSource can be added as a SortField even if 
forceElevation=false when one sorts by score.
 The point is

 - with keepElevationPriority=true, the behavior is unchanged, the elevated 
documents (on top) are sorted by the order of the elevation rules and elevated 
ids in the config file.
 - with keepElevationPriority=false, the behavior changes, the elevated 
documents (still on top) are in any order (this will allow the use of the 
efficient but unsorted TrieSubsetMatcher in the other patch), and they may be 
re-ordered by other sort fields 


was (Author: bruno.roustant):
Sorry for the delay.

Yes, if you can take it from here, that would be awesome!
 * Getters for defaults: you're right, there is no need. Please remove them.
 * keepElevationPriority as a constant in QEC: good point.
 * keepElevationPriority meaning:
Actually the comment is not right, maybe the sorting has changed since the time 
I wrote this comment. I don't think it is linked anymore to forceElevation 
since the ElevationComparatorSource can be added as a SortField even if 
forceElevation=false when one sort by score.
The point is
- with keepElevationPriority=true, the behavior is unchanged, the elevated 
documents (on top) are sorted by the order of the elevation rules and elevated 
ids in the config file.
- with keepElevationPriority=false, the behavior changes, the elevated 
documents (still on top) are in any order, and they may be re-ordered by other 
sort fields (this will allow the use of the efficient but unsorted 
TrieSubsetMatcher in the other patch).

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, 
> 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420302#comment-16420302
 ] 

Bruno Roustant commented on SOLR-11865:
---

4- No "Can be overridden by extending this class".

Sure. Removed.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420301#comment-16420301
 ] 

Bruno Roustant commented on SOLR-11865:
---

3- The indentation around line ~671 (contents of the for loop) is messed up.

I didn't change that part. I'll try to fix the indentation.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-11865:
--
Attachment: SOLR-11865.patch

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-11865:
--
Attachment: (was: SOLR-11865.patch)

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420300#comment-16420300
 ] 

Bruno Roustant commented on SOLR-11865:
---

2- ElevationProvider should be immutable and simplified:

Good point. createElevationProvider() accepts the elevationBuilderMap. 
getElevationForQuery() does not throw IOException.

ElevationProvider.size is used by tests to verify the number of parsed rules. I 
added @VisibleForTesting annotation.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420322#comment-16420322
 ] 

Bruno Roustant commented on SOLR-11865:
---

10- seen.contains(id) == false.

I didn't know this Lucene practice. It explains why I see this strange 
construct.

"I recommend against modifying existing lines" - that's what I tried (see 
points 3,5,6 above) and I thought this "!seen.contains(id)" was tiny and 
harmless.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420317#comment-16420317
 ] 

Bruno Roustant commented on SOLR-11865:
---

8- Use a UnaryOperator instead of IndexedValueProvider.

Good point. It is still clear with less code.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420329#comment-16420329
 ] 

Bruno Roustant commented on SOLR-11865:
---

11- subsetMatch flag in ElevatingQuery.

Yes, the idea is to support some queries with subset match, and other without. 
This will be supported by the next ElevationProvider in the next patch.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420311#comment-16420311
 ] 

Bruno Roustant commented on SOLR-11865:
---

6- Use {{localBoosts.addAll(boosted.keySet());}} at line ~661 instead of manual 
looping.

Again, I didn't change that (and I didn't want to touch existing code without 
reason).

I fixed by directly removing localBoosts which was an exact copy of 
boosts.keySet() (boots parameter is a map).

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420308#comment-16420308
 ] 

Bruno Roustant commented on SOLR-11865:
---

5- Change comparator docVal (~line 1318) to use getOrDefault.

I didn't change that. Fixed.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420319#comment-16420319
 ] 

Bruno Roustant commented on SOLR-11865:
---

9- Make the constructor of ElevatingQuery protected.

Done.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420272#comment-16420272
 ] 

Bruno Roustant commented on SOLR-11865:
---

1- InitializationExceptionHandler & LoadingExceptionHandler:

At Salesforce (i.e. in a multi-tenant context) we allow each organization admin 
to update the list of elevation rules dynamically. When some rules are updated, 
the core corresponding to the organization is updated to reload the elevation 
rules XML. It is important to note that the organization admin - the person who 
defines the elevation rules - is not a Solr admin expert. He needs to get clear 
feedback on any error that may prevent the rules to be loaded. The XML rules 
are more considered as dynamic config rather than static config.

In its original version, the QueryElevationComponent simply throws an exception.

In this new version, it differentiates the error cause and lets an extending 
class (e.g. specific Salesforce extension) override the loading exception and 
take appropriate actions (logging, warning, etc) instead of simply throwing the 
Solr exception.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420313#comment-16420313
 ] 

Bruno Roustant commented on SOLR-11865:
---

7- In parseExcludedMarkerFieldName and parseEditorialMarkerFieldName.

Removed the if () block.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420322#comment-16420322
 ] 

Bruno Roustant edited comment on SOLR-11865 at 3/30/18 9:30 AM:


10- seen.contains(id) == false.

I didn't know this Lucene practice. It explains why I see this strange 
construct.

"I recommend against modifying existing lines" - that's what I tried (see 
points 3,5,6 above) and I thought this "!seen.contains(id)" was tiny and 
harmless. And that's a warning highlighted by IntelliJ by the way :)


was (Author: bruno.roustant):
10- seen.contains(id) == false.

I didn't know this Lucene practice. It explains why I see this strange 
construct.

"I recommend against modifying existing lines" - that's what I tried (see 
points 3,5,6 above) and I thought this "!seen.contains(id)" was tiny and 
harmless.

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-11865:
--
Attachment: 0002-Refactor-QueryElevationComponent-after-review.patch

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-11865:
--
Attachment: (was: SOLR-11865.patch)

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-11865:
--
Attachment: (was: 
0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch)

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-11865:
--
Attachment: (was: 
0002-Refactor-QueryElevationComponent-after-review.patch)

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-11865:
--
Attachment: SOLR-11865.patch
0002-Refactor-QueryElevationComponent-after-review.patch
0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching

2018-03-30 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420620#comment-16420620
 ] 

Bruno Roustant commented on SOLR-11865:
---

[~dsmiley] I uploaded a new patch. Is it better now?

> Refactor QueryElevationComponent to prepare query subset matching
> -
>
> Key: SOLR-11865
> URL: https://issues.apache.org/jira/browse/SOLR-11865
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: SearchComponents - other
>Affects Versions: master (8.0)
>Reporter: Bruno Roustant
>Priority: Minor
>  Labels: QueryComponent
> Fix For: master (8.0)
>
> Attachments: 
> 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, 
> 0002-Refactor-QueryElevationComponent-after-review.patch, SOLR-11865.patch
>
>
> The goal is to prepare a second improvement to support query terms subset 
> matching or query elevation rules.
> Before that, we need to refactor the QueryElevationComponent. We make it 
> extendible. We introduce the ElevationProvider interface which will be 
> implemented later in a second patch to support subset matching. The current 
> full-query match policy becomes a default simple MapElevationProvider.
> - Add overridable methods to handle exceptions during the component 
> initialization.
> - Add overridable methods to provide the default values for config properties.
> - No functional change beyond refactoring.
> - Adapt unit test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton

2018-03-05 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385824#comment-16385824
 ] 

Bruno Roustant commented on LUCENE-8159:


Ok. I'll let you guys decide whether to discard this patch.

[~jpountz] I'm curious about searching a lot of fields.
{quote}searching over lots of fields is a bad practice
{quote}
Could you tell me the reason for the bad practice? Is it due to bad performance 
impact? Are there other reasons by design?

Generally customer organizations love to have lots of fields. While I agree 
that sometimes they should revisit their data partitioning, there are cases 
where searching many fields help (e.g. CRM, field level security, ML ranking 
model based on field matches)

> Add a copy constructor in AutomatonQuery to copy directly the compiled 
> automaton
> 
>
> Key: LUCENE-8159
> URL: https://issues.apache.org/jira/browse/LUCENE-8159
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: trunk
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: 
> 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, 
> LUCENE-8159.patch
>
>
> When the query is composed of multiple AutomatonQuery with the same automaton 
> and which target different fields, it is much more efficient to reuse the 
> already compiled automaton by copying it directly and just changing the 
> target field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton

2018-02-27 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378373#comment-16378373
 ] 

Bruno Roustant edited comment on LUCENE-8159 at 2/27/18 10:42 AM:
--

{quote}I's rather like to expose an expert constructor that takes a compiled 
automaton and expect users to compile the automaton themselves if they plan to 
reuse it in multiple queries?
{quote}
I can speak as such a "user" as I'm having this use case. We often build 
queries with the same prefix/wildcard query for multiple different fields (and 
sometimes many fields). As a user I really appreciate to simply copy a 
PrefixQuery or WildcardQuery, rather than building the automaton myself. The 
inner automaton inside PrefixQuery is hidden, and the logic is internal to the 
PrefixQuery. I don't want to know myself how it is built.

I agree with exposing the compiled automaton.
{quote}Should PrefixQuery & WildcardQuery & TermRangeQuery have the same 
constructors too?
{quote}
I indeed prepared the same copy constructors for these classes. I didn't have 
time to resubmit the patch yet, but that's the idea, yes.


was (Author: bruno.roustant):
{quote}I's rather like to expose an expert constructor that takes a compiled 
automaton and expect users to compile the automaton themselves if they plan to 
reuse it in multiple queries?
{quote}
I can speak as such a "user" as I'm having this use case. We often build 
queries with the same prefix/wildcard query for multiple different fields (and 
sometimes many fields). As a user I really appreciate to simply copy a 
PrefixQuery or WildcardQuery, rather than building the automaton myself. The 
inner automaton inside PrefixQuery is hidden, and the logic is internal to the 
PrefixQuery. I don't want to know myself how it is built.

I agree with exposing the compiled automaton. Although I find the copy 
constructor easier to use.
{quote}Should PrefixQuery & WildcardQuery & TermRangeQuery have the same 
constructors too?
{quote}
I indeed prepared the same copy constructors for these classes. I didn't have 
time to resubmit the patch yet, but that's the idea, yes.

> Add a copy constructor in AutomatonQuery to copy directly the compiled 
> automaton
> 
>
> Key: LUCENE-8159
> URL: https://issues.apache.org/jira/browse/LUCENE-8159
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: trunk
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: 
> 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, 
> LUCENE-8159.patch
>
>
> When the query is composed of multiple AutomatonQuery with the same automaton 
> and which target different fields, it is much more efficient to reuse the 
> already compiled automaton by copying it directly and just changing the 
> target field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton

2018-02-27 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378373#comment-16378373
 ] 

Bruno Roustant edited comment on LUCENE-8159 at 2/27/18 11:29 AM:
--

{quote}I's rather like to expose an expert constructor that takes a compiled 
automaton and expect users to compile the automaton themselves if they plan to 
reuse it in multiple queries?
{quote}
I can speak as such a "user" as I'm having this use case. We often build 
queries with the same prefix/wildcard query for multiple different fields (and 
sometimes many fields - in this case the optimization does help). As a user I 
really appreciate to simply copy a PrefixQuery or WildcardQuery, rather than 
building the automaton myself. The inner automaton inside PrefixQuery is 
hidden, and the logic is internal to the PrefixQuery. I don't want to know 
myself how it is built.

I agree with exposing the compiled automaton. But I still think PrefixQuery and 
WildcardQuery would benefit from a new constructor. And this constructor cannot 
really take any automaton as parameter, it could potentially break the 
prefix/wildcard contract. So, to me, PrefixQuery and WildcardQuery should have 
their copy constructor.
{quote}Should PrefixQuery & WildcardQuery & TermRangeQuery have the same 
constructors too?
{quote}
I indeed prepared the same copy constructors for these classes. I didn't have 
time to resubmit the patch yet, but that's the idea, yes.


was (Author: bruno.roustant):
{quote}I's rather like to expose an expert constructor that takes a compiled 
automaton and expect users to compile the automaton themselves if they plan to 
reuse it in multiple queries?
{quote}
I can speak as such a "user" as I'm having this use case. We often build 
queries with the same prefix/wildcard query for multiple different fields (and 
sometimes many fields - in this case the optimization does help). As a user I 
really appreciate to simply copy a PrefixQuery or WildcardQuery, rather than 
building the automaton myself. The inner automaton inside PrefixQuery is 
hidden, and the logic is internal to the PrefixQuery. I don't want to know 
myself how it is built.

I agree with exposing the compiled automaton.
{quote}Should PrefixQuery & WildcardQuery & TermRangeQuery have the same 
constructors too?
{quote}
I indeed prepared the same copy constructors for these classes. I didn't have 
time to resubmit the patch yet, but that's the idea, yes.

> Add a copy constructor in AutomatonQuery to copy directly the compiled 
> automaton
> 
>
> Key: LUCENE-8159
> URL: https://issues.apache.org/jira/browse/LUCENE-8159
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: trunk
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: 
> 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, 
> LUCENE-8159.patch
>
>
> When the query is composed of multiple AutomatonQuery with the same automaton 
> and which target different fields, it is much more efficient to reuse the 
> already compiled automaton by copying it directly and just changing the 
> target field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton

2018-02-27 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378373#comment-16378373
 ] 

Bruno Roustant commented on LUCENE-8159:


{quote}I's rather like to expose an expert constructor that takes a compiled 
automaton and expect users to compile the automaton themselves if they plan to 
reuse it in multiple queries?
{quote}
I can speak as such a "user" as I'm having this use case. We often build 
queries with the same prefix/wildcard query for multiple different fields (and 
sometimes many fields). As a user I really appreciate to simply copy a 
PrefixQuery or WildcardQuery, rather than building the automaton myself. The 
inner automaton inside PrefixQuery is hidden, and the logic is internal to the 
PrefixQuery. I don't want to know myself how it is built.

I agree with exposing the compiled automaton. Although I find the copy 
constructor easier to use.
{quote}Should PrefixQuery & WildcardQuery & TermRangeQuery have the same 
constructors too?
{quote}
I indeed prepared the same copy constructors for these classes. I didn't have 
time to resubmit the patch yet, but that's the idea, yes.

> Add a copy constructor in AutomatonQuery to copy directly the compiled 
> automaton
> 
>
> Key: LUCENE-8159
> URL: https://issues.apache.org/jira/browse/LUCENE-8159
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: trunk
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: 
> 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, 
> LUCENE-8159.patch
>
>
> When the query is composed of multiple AutomatonQuery with the same automaton 
> and which target different fields, it is much more efficient to reuse the 
> already compiled automaton by copying it directly and just changing the 
> target field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton

2018-02-27 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378373#comment-16378373
 ] 

Bruno Roustant edited comment on LUCENE-8159 at 2/27/18 10:44 AM:
--

{quote}I's rather like to expose an expert constructor that takes a compiled 
automaton and expect users to compile the automaton themselves if they plan to 
reuse it in multiple queries?
{quote}
I can speak as such a "user" as I'm having this use case. We often build 
queries with the same prefix/wildcard query for multiple different fields (and 
sometimes many fields - in this case the optimization does help). As a user I 
really appreciate to simply copy a PrefixQuery or WildcardQuery, rather than 
building the automaton myself. The inner automaton inside PrefixQuery is 
hidden, and the logic is internal to the PrefixQuery. I don't want to know 
myself how it is built.

I agree with exposing the compiled automaton.
{quote}Should PrefixQuery & WildcardQuery & TermRangeQuery have the same 
constructors too?
{quote}
I indeed prepared the same copy constructors for these classes. I didn't have 
time to resubmit the patch yet, but that's the idea, yes.


was (Author: bruno.roustant):
{quote}I's rather like to expose an expert constructor that takes a compiled 
automaton and expect users to compile the automaton themselves if they plan to 
reuse it in multiple queries?
{quote}
I can speak as such a "user" as I'm having this use case. We often build 
queries with the same prefix/wildcard query for multiple different fields (and 
sometimes many fields). As a user I really appreciate to simply copy a 
PrefixQuery or WildcardQuery, rather than building the automaton myself. The 
inner automaton inside PrefixQuery is hidden, and the logic is internal to the 
PrefixQuery. I don't want to know myself how it is built.

I agree with exposing the compiled automaton.
{quote}Should PrefixQuery & WildcardQuery & TermRangeQuery have the same 
constructors too?
{quote}
I indeed prepared the same copy constructors for these classes. I didn't have 
time to resubmit the patch yet, but that's the idea, yes.

> Add a copy constructor in AutomatonQuery to copy directly the compiled 
> automaton
> 
>
> Key: LUCENE-8159
> URL: https://issues.apache.org/jira/browse/LUCENE-8159
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: trunk
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: 
> 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, 
> LUCENE-8159.patch
>
>
> When the query is composed of multiple AutomatonQuery with the same automaton 
> and which target different fields, it is much more efficient to reuse the 
> already compiled automaton by copying it directly and just changing the 
> target field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton

2018-02-28 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380407#comment-16380407
 ] 

Bruno Roustant commented on LUCENE-8159:


[~rcmuir] could you be a little more explicit?

Without context I don't understand why a copy constructor is bad in Java in 
general.

> Add a copy constructor in AutomatonQuery to copy directly the compiled 
> automaton
> 
>
> Key: LUCENE-8159
> URL: https://issues.apache.org/jira/browse/LUCENE-8159
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: trunk
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: 
> 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, 
> LUCENE-8159.patch
>
>
> When the query is composed of multiple AutomatonQuery with the same automaton 
> and which target different fields, it is much more efficient to reuse the 
> already compiled automaton by copying it directly and just changing the 
> target field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton

2018-02-28 Thread Bruno Roustant (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380407#comment-16380407
 ] 

Bruno Roustant edited comment on LUCENE-8159 at 2/28/18 2:58 PM:
-

[~rcmuir] could you be a little more explicit?

Without context I don't understand why a copy constructor is bad in Java in 
general.

Do you mean you prefer a copy method?

PrefixQuery copy(String field)


was (Author: bruno.roustant):
[~rcmuir] could you be a little more explicit?

Without context I don't understand why a copy constructor is bad in Java in 
general.

> Add a copy constructor in AutomatonQuery to copy directly the compiled 
> automaton
> 
>
> Key: LUCENE-8159
> URL: https://issues.apache.org/jira/browse/LUCENE-8159
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: trunk
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: 
> 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, 
> LUCENE-8159.patch
>
>
> When the query is composed of multiple AutomatonQuery with the same automaton 
> and which target different fields, it is much more efficient to reuse the 
> already compiled automaton by copying it directly and just changing the 
> target field.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-03 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809251#comment-16809251
 ] 

Bruno Roustant commented on LUCENE-8753:


{quote}I think this is similar to […] BlockTermsReader/Writer
{quote}
Indeed similar; it mainly differs from VariableGapTermsIndexWriter in the way 
it selects the best term to start a block. It is based on the minimal 
distinguishing prefix. The idea is to make the terms index FST more compact. 
That way, given a target max heap memory, we can have potentially more blocks, 
so smaller ones that are scanned faster. This requirement to consume less heap 
was strong with lucene 7.1, now maybe less with the recent off-heap FST.

 
{quote}Are you also doing something different to encode/decode postings?
{quote}
No, the postings are written with the regular PostingsWriterBase.

 
{quote}Can you post results on the full wikimediumall?
{quote}
 Good point. Will do tomorrow.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-03 Thread Bruno Roustant (JIRA)
Bruno Roustant created LUCENE-8753:
--

 Summary: New PostingFormat - UniformSplit
 Key: LUCENE-8753
 URL: https://issues.apache.org/jira/browse/LUCENE-8753
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs
Affects Versions: 8.0
Reporter: Bruno Roustant
 Attachments: Uniform Split Technique.pdf

This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
objectives:
- Clear design and simple code.
- Easily extensible, for both the logic and the index format.
- Light memory usage with a very compact FST.
- Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.

(the pdf attached explains visually the technique in more details)
 The principle is to split the list of terms into blocks and use a FST to 
access the block, but not as a prefix trie, rather with a seek-floor pattern. 
For the selection of the blocks, there is a target average block size (number 
of terms), with an allowed delta variation (10%) to compare the terms and 
select the one with the minimal distinguishing prefix.
There are also several optimizations inside the block to make it more compact 
and speed up the loading/scanning.

The performance obtained is interesting with the luceneutil benchmark, 
comparing UniformSplit with BlockTree. Find it in the first comment.
 
 Although the precise percentages vary between runs, three main points:
- TermQuery and PhraseQuery are improved.
- PrefixQuery and WildcardQuery are ok.
- Fuzzy queries are clearly less performant, because BlockTree is so optimized 
for them.

Compared to BlockTree, FST size is reduced by 15%, and segment writing time is 
reduced by 20%. So this PostingsFormat scales to lots of docs, as BlockTree.
 
 This initial version passes all Lucene tests. Use “ant test 
-Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.

Subjectively, we think we have fulfilled our goal of code simplicity. And we 
have already exercised this PostingsFormat extensibility to create a different 
flavor for our own use-case.
 
 Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-03 Thread Bruno Roustant (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated LUCENE-8753:
---
Attachment: luceneutil.benchmark.txt

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-03 Thread Bruno Roustant (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated LUCENE-8753:
---
Description: 
This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
objectives:
 - Clear design and simple code.
 - Easily extensible, for both the logic and the index format.
 - Light memory usage with a very compact FST.
 - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.

(the pdf attached explains visually the technique in more details)
 The principle is to split the list of terms into blocks and use a FST to 
access the block, but not as a prefix trie, rather with a seek-floor pattern. 
For the selection of the blocks, there is a target average block size (number 
of terms), with an allowed delta variation (10%) to compare the terms and 
select the one with the minimal distinguishing prefix.
 There are also several optimizations inside the block to make it more compact 
and speed up the loading/scanning.

The performance obtained is interesting with the luceneutil benchmark, 
comparing UniformSplit with BlockTree. Find it in the first comment and also 
attached for better formatting.

Although the precise percentages vary between runs, three main points:
 - TermQuery and PhraseQuery are improved.
 - PrefixQuery and WildcardQuery are ok.
 - Fuzzy queries are clearly less performant, because BlockTree is so optimized 
for them.

Compared to BlockTree, FST size is reduced by 15%, and segment writing time is 
reduced by 20%. So this PostingsFormat scales to lots of docs, as BlockTree.

This initial version passes all Lucene tests. Use “ant test 
-Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.

Subjectively, we think we have fulfilled our goal of code simplicity. And we 
have already exercised this PostingsFormat extensibility to create a different 
flavor for our own use-case.

Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley

  was:
This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
objectives:
- Clear design and simple code.
- Easily extensible, for both the logic and the index format.
- Light memory usage with a very compact FST.
- Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.

(the pdf attached explains visually the technique in more details)
 The principle is to split the list of terms into blocks and use a FST to 
access the block, but not as a prefix trie, rather with a seek-floor pattern. 
For the selection of the blocks, there is a target average block size (number 
of terms), with an allowed delta variation (10%) to compare the terms and 
select the one with the minimal distinguishing prefix.
There are also several optimizations inside the block to make it more compact 
and speed up the loading/scanning.

The performance obtained is interesting with the luceneutil benchmark, 
comparing UniformSplit with BlockTree. Find it in the first comment.
 
 Although the precise percentages vary between runs, three main points:
- TermQuery and PhraseQuery are improved.
- PrefixQuery and WildcardQuery are ok.
- Fuzzy queries are clearly less performant, because BlockTree is so optimized 
for them.

Compared to BlockTree, FST size is reduced by 15%, and segment writing time is 
reduced by 20%. So this PostingsFormat scales to lots of docs, as BlockTree.
 
 This initial version passes all Lucene tests. Use “ant test 
-Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.

Subjectively, we think we have fulfilled our goal of code simplicity. And we 
have already exercised this PostingsFormat extensibility to create a different 
flavor for our own use-case.
 
 Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley


> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf
>
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select 

[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-03 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808799#comment-16808799
 ] 

Bruno Roustant commented on LUCENE-8753:


Here's the Luceneutil benchmark with the wikimedium500k data set using Java 8. 
This is a bit dated using Lucene 7.1; it'd be nice to update to master.

 

Report after iter 19:
 TaskQPS blocktree StdDevQPS uniformsplit StdDev Pct diff
 Fuzzy1 508.47 (3.8%) 221.37 (0.9%) {color:#59afe1}-56.5%{color} ( -58% - -53%)
 Fuzzy2 171.73 (6.4%) 80.62 (1.4%) {color:#59afe1}-53.1%{color} ( -57% - -48%)
 PKLookup 182.47 (2.4%) 149.62 (2.5%) {color:#59afe1}-18.0%{color} ( -22% - 
-13%)
 Wildcard 1788.74 (5.9%) 1729.37 (4.5%) {color:#59afe1}-3.3%{color} ( -12% - 7%)
 IntNRQ 1561.48 (2.1%) 1564.33 (1.9%) {color:#59afe1}0.2%{color} ( -3% - 4%)
 Prefix3 1759.69 (5.0%) 1829.74 (4.8%) {color:#59afe1}4.0%{color} ( -5% - 14%)
 HighTermDayOfYearSort 586.06 (5.4%) 622.34 (8.2%) {color:#59afe1}6.2%{color} ( 
-6% - 20%)
 MedPhrase 1204.85 (5.5%) 1282.89 (7.7%) {color:#59afe1}6.5%{color} ( -6% - 20%)
 HighSpanNear 590.88 (4.1%) 629.64 (6.1%) {color:#59afe1}6.6%{color} ( -3% - 
17%)
 OrHighMed 1101.48 (4.5%) 1220.75 (6.2%) {color:#59afe1}10.8%{color} ( 0% - 22%)
 HighTermMonthSort 2617.10 (2.6%) 2916.34 (4.6%) {color:#59afe1}11.4%{color} ( 
4% - 19%)
 HighPhrase 961.04 (5.5%) 1073.62 (6.0%) {color:#59afe1}11.7%{color} ( 0% - 24%)
 MedSloppyPhrase 604.56 (13.3%) 680.31 (13.7%) {color:#59afe1}12.5%{color} ( 
-12% - 45%)
 LowSloppyPhrase 954.87 (8.1%) 1075.67 (5.4%) {color:#59afe1}12.7%{color} ( 0% 
- 28%)
 MedSpanNear 737.14 (5.8%) 830.68 (8.3%) {color:#59afe1}12.7%{color} ( -1% - 
28%)
 OrHighHigh 811.57 (5.7%) 915.01 (6.2%) {color:#59afe1}12.7%{color} ( 0% - 26%)
 AndHighMed 1157.45 (5.3%) 1317.78 (5.1%) {color:#59afe1}13.9%{color} ( 3% - 
25%)
 AndHighHigh 1095.29 (5.7%) 1254.16 (4.9%) {color:#59afe1}14.5%{color} ( 3% - 
26%)
 HighSloppyPhrase 880.42 (8.2%) 1009.72 (7.0%) {color:#59afe1}14.7%{color} ( 0% 
- 32%)
 LowPhrase 1245.33 (6.0%) 1473.57 (4.4%) {color:#59afe1}18.3%{color} ( 7% - 30%)
 Respell 81.10 (12.7%) 99.43 (10.3%) {color:#59afe1}22.6%{color} ( 0% - 52%)
 HighTerm 3733.81 (6.1%) 4599.96 (6.8%) {color:#59afe1}23.2%{color} ( 9% - 38%)
 OrHighLow 1960.13 (6.2%) 2415.81 (6.0%) {color:#59afe1}23.2%{color} ( 10% - 
37%)
 MedTerm 4411.60 (4.9%) 5450.56 (5.8%) {color:#59afe1}23.6%{color} ( 12% - 35%)
 LowSpanNear 1944.27 (5.3%) 2416.29 (4.5%) {color:#59afe1}24.3%{color} ( 13% - 
36%)
 AndHighLow 1978.10 (7.6%) 2500.74 (5.8%) {color:#59afe1}26.4%{color} ( 12% - 
43%)
 LowTerm 4949.24 (4.8%) 6589.86 (5.3%) {color:#59afe1}33.1%{color} ( 22% - 45%)

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf
>
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
> - Clear design and simple code.
> - Easily extensible, for both the logic and the index format.
> - Light memory usage with a very compact FST.
> - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
> There are also several optimizations inside the block to make it more compact 
> and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment.
>  
>  Although the precise percentages vary between runs, three main points:
> - TermQuery and PhraseQuery are improved.
> - PrefixQuery and WildcardQuery are ok.
> - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
>  
>  This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
>  
>  Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David 

[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-03 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808852#comment-16808852
 ] 

Bruno Roustant commented on LUCENE-8753:


{quote}Is it due to the fact that it doesn't have the ability to fail lookups 
early like BlockTree?
{quote}
This is one cause. While BlockTree builds a kind of prefix-trie and may stop if 
the prefix is not matched, UniformSplit doesn't, so it loads a block.

That said I remarked that PKLookup performance varies a lot. It is sometimes in 
favor of UniformSplit. Actually I don't know how the benchmark generates the 
test set. It clearly has an influence on the metric.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-09 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813171#comment-16813171
 ] 

Bruno Roustant edited comment on LUCENE-8753 at 4/9/19 9:26 AM:


I agree.

We profiled wikimediumall and we saw that 90% of the time is spent in the 
scoring, and less than a couple of percent is spent to access the dictionary 
blocks.

Our own use-case is to have multiple small-to-medium cores, the size of 
wikimedium500k, that's why we studied it more.


was (Author: bruno.roustant):
I agree.

We profile wikimediumall and we saw that 90% of the time is spent in the 
scoring, and less than a couple of percent is spent to access the dictionary 
blocks.

Our own use-case is to have multiple small-to-medium cores, the size of 
wikimedium500k, that's why we studied it more.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-09 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813171#comment-16813171
 ] 

Bruno Roustant commented on LUCENE-8753:


I agree.

We profile wikimediumall and we saw that 90% of the time is spent in the 
scoring, and less than a couple of percent is spent to access the dictionary 
blocks.

Our own use-case is to have multiple small-to-medium cores, the size of 
wikimedium500k, that's why we studied it more.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-09 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813106#comment-16813106
 ] 

Bruno Roustant commented on LUCENE-8753:


It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h 
indexing initially - a little less for UniformSplit, then I had an exception 
about facets).

Then I got results which surprised me. BlockTree and UniformSplit had the same 
QPS for Term and Phrase queries. I didn't understand why a different behavior 
between a small and a large index.

Then I thought about 2 explanations:
 * Much larger index could mean less OS IO cache hits. I ran the benchmark with 
a 16 GB laptop and a 64 GB desktop. Actually I got nearly no difference in my 
test.
 * Much larger index could mean more results. So the time spent to score and 
rank the results could become much larger and diminish the effect of a change 
in the dictionary. I have no clue there at the moment.

Here is the result of wikimedimall on a 64 GB desktop:

||Task||QPS BT||StdDev BT||QPS CUS||StdDev CUS||Pct diff
|Fuzzy1|72.81|3.11|21.77|0.71|\{color:red}72%\{color}-\{color:red}67%\{color}|
|Fuzzy2|66.77|3.77|20.41|0.67|\{color:red}72%\{color}-\{color:red}66%\{color}|
|Respell|8.85|0.64|6.02|0.33|\{color:red}40%\{color}-\{color:red}22%\{color}|
|PKLookup|130.83|3.96|121.66|12.37|\{color:red}18%\{color}-\{color:green}5%\{color}|
|Wildcard|25.03|1.33|23.93|1.19|\{color:red}13%\{color}-\{color:green}6%\{color}|
|HighTermMonthSort|19.03|2.55|18.40|1.56|\{color:red}21%\{color}-\{color:green}21%\{color}|
|Prefix3|12.47|0.82|12.10|0.78|\{color:red}14%\{color}-\{color:green}10%\{color}|
|LowTerm|182.95|14.94|177.97|18.67|\{color:red}19%\{color}-\{color:green}17%\{color}|
|IntNRQ|5.21|0.54|5.09|0.56|\{color:red}21%\{color}-\{color:green}21%\{color}|
|MedTerm|90.74|3.99|89.14|4.24|\{color:red}10%\{color}-\{color:green}7%\{color}|
|HighTerm|42.54|1.95|41.86|2.00|\{color:red}10%\{color}-\{color:green}8%\{color}|
|OrNotHighLow|532.96|16.16|526.86|24.40|\{color:red}8%\{color}-\{color:green}6%\{color}|
|HighSloppyPhrase|12.00|0.39|11.90|0.48|\{color:red}7%\{color}-\{color:green}6%\{color}|
|OrNotHighMed|53.64|1.08|53.37|1.22|\{color:red}4%\{color}-\{color:green}3%\{color}|
|MedSloppyPhrase|31.83|0.59|31.67|0.78|\{color:red}4%\{color}-\{color:green}3%\{color}|
|HighPhrase|32.24|0.85|32.09|0.81|\{color:red}5%\{color}-\{color:green}4%\{color}|
|LowSloppyPhrase|29.51|0.43|29.40|0.58|\{color:red}3%\{color}-\{color:green}3%\{color}|
|AndHighHigh|26.97|0.31|26.88|0.37|\{color:red}2%\{color}-\{color:green}2%\{color}|
|MedPhrase|4.95|0.16|4.94|0.15|\{color:red}6%\{color}-\{color:green}6%\{color}|
|AndHighMed|50.03|0.72|49.97|0.72|\{color:red}2%\{color}-\{color:green}2%\{color}|
|OrNotHighHigh|18.85|0.76|18.85|0.82|\{color:red}8%\{color}-\{color:green}8%\{color}|
|OrHighNotHigh|9.35|0.32|9.35|0.35|\{color:red}6%\{color}-\{color:green}7%\{color}|
|OrHighLow|15.85|0.59|15.85|0.52|\{color:red}6%\{color}-\{color:green}7%\{color}|
|OrHighNotLow|17.56|0.71|17.57|0.70|\{color:red}7%\{color}-\{color:green}8%\{color}|
|AndHighLow|284.39|4.41|284.60|5.65|\{color:red}3%\{color}-\{color:green}3%\{color}|
|LowPhrase|224.73|4.35|224.97|4.84|\{color:red}3%\{color}-\{color:green}4%\{color}|
|OrHighNotMed|13.21|0.49|13.22|0.50|\{color:red}7%\{color}-\{color:green}7%\{color}|
|OrHighMed|13.22|0.73|13.30|0.70|\{color:red}9%\{color}-\{color:green}12%\{color}|
|OrHighHigh|7.56|0.43|7.62|0.41|\{color:red}9%\{color}-\{color:green}12%\{color}|
|BrowseMonthTaxoFacets|7.96|1.92|8.06|1.78|\{color:red}36%\{color}-\{color:green}63%\{color}|
|LowSpanNear|11.84|0.19|11.99|0.21|\{color:red}2%\{color}-\{color:green}4%\{color}|
|HighTermDayOfYearSort|20.05|1.40|20.31|2.15|\{color:red}15%\{color}-\{color:green}20%\{color}|
|BrowseDayOfYearTaxoFacets|7.96|1.91|8.07|1.85|\{color:red}37%\{color}-\{color:green}64%\{color}|
|BrowseMonthSSDVFacets|7.95|1.90|8.07|1.87|\{color:red}37%\{color}-\{color:green}64%\{color}|
|BrowseDayOfYearSSDVFacets|7.96|1.93|8.08|1.84|\{color:red}36%\{color}-\{color:green}64%\{color}|
|MedSpanNear|10.50|0.18|10.67|0.21|\{color:red}2%\{color}-\{color:green}5%\{color}|
|BrowseDateTaxoFacets|7.91|1.81|8.07|1.83|\{color:red}35%\{color}-\{color:green}62%\{color}|
|HighSpanNear|8.68|0.19|8.88|0.19|\{color:red}2%\{color}-\{color:green}6%\{color}|

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> 

[jira] [Comment Edited] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-09 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813106#comment-16813106
 ] 

Bruno Roustant edited comment on LUCENE-8753 at 4/9/19 8:15 AM:


It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h 
indexing initially - a little less for UniformSplit, then I had an exception 
about facets).

Then I got results which surprised me. BlockTree and UniformSplit had the same 
QPS for Term and Phrase queries. I didn't understand why a different behavior 
between a small and a large index.

Then I thought about 2 explanations:
 * Much larger index could mean less OS IO cache hits. I ran the benchmark with 
a 16 GB laptop and a 64 GB desktop. Actually I got nearly no difference in my 
test.
 * Much larger index could mean more results. So the time spent to score and 
rank the results could become much larger and diminish the effect of a change 
in the dictionary. I have no clue there at the moment.

Here is the result of wikimedimall on a 64 GB desktop:

(I used -Jira option, but it does not seem to recognize the "color" tag)

||Task||QPS BT||StdDev BT||QPS CUS||StdDev CUS||Pct diff
|Fuzzy1|72.81|3.11|21.77|0.71|\{color:red}72%\{color}-\{color:red}67%\{color}|
|Fuzzy2|66.77|3.77|20.41|0.67|\{color:red}72%\{color}-\{color:red}66%\{color}|
|Respell|8.85|0.64|6.02|0.33|\{color:red}40%\{color}-\{color:red}22%\{color}|
|PKLookup|130.83|3.96|121.66|12.37|\{color:red}18%\{color}-\{color:green}5%\{color}|
|Wildcard|25.03|1.33|23.93|1.19|\{color:red}13%\{color}-\{color:green}6%\{color}|
|HighTermMonthSort|19.03|2.55|18.40|1.56|\{color:red}21%\{color}-\{color:green}21%\{color}|
|Prefix3|12.47|0.82|12.10|0.78|\{color:red}14%\{color}-\{color:green}10%\{color}|
|LowTerm|182.95|14.94|177.97|18.67|\{color:red}19%\{color}-\{color:green}17%\{color}|
|IntNRQ|5.21|0.54|5.09|0.56|\{color:red}21%\{color}-\{color:green}21%\{color}|
|MedTerm|90.74|3.99|89.14|4.24|\{color:red}10%\{color}-\{color:green}7%\{color}|
|HighTerm|42.54|1.95|41.86|2.00|\{color:red}10%\{color}-\{color:green}8%\{color}|
|OrNotHighLow|532.96|16.16|526.86|24.40|\{color:red}8%\{color}-\{color:green}6%\{color}|
|HighSloppyPhrase|12.00|0.39|11.90|0.48|\{color:red}7%\{color}-\{color:green}6%\{color}|
|OrNotHighMed|53.64|1.08|53.37|1.22|\{color:red}4%\{color}-\{color:green}3%\{color}|
|MedSloppyPhrase|31.83|0.59|31.67|0.78|\{color:red}4%\{color}-\{color:green}3%\{color}|
|HighPhrase|32.24|0.85|32.09|0.81|\{color:red}5%\{color}-\{color:green}4%\{color}|
|LowSloppyPhrase|29.51|0.43|29.40|0.58|\{color:red}3%\{color}-\{color:green}3%\{color}|
|AndHighHigh|26.97|0.31|26.88|0.37|\{color:red}2%\{color}-\{color:green}2%\{color}|
|MedPhrase|4.95|0.16|4.94|0.15|\{color:red}6%\{color}-\{color:green}6%\{color}|
|AndHighMed|50.03|0.72|49.97|0.72|\{color:red}2%\{color}-\{color:green}2%\{color}|
|OrNotHighHigh|18.85|0.76|18.85|0.82|\{color:red}8%\{color}-\{color:green}8%\{color}|
|OrHighNotHigh|9.35|0.32|9.35|0.35|\{color:red}6%\{color}-\{color:green}7%\{color}|
|OrHighLow|15.85|0.59|15.85|0.52|\{color:red}6%\{color}-\{color:green}7%\{color}|
|OrHighNotLow|17.56|0.71|17.57|0.70|\{color:red}7%\{color}-\{color:green}8%\{color}|
|AndHighLow|284.39|4.41|284.60|5.65|\{color:red}3%\{color}-\{color:green}3%\{color}|
|LowPhrase|224.73|4.35|224.97|4.84|\{color:red}3%\{color}-\{color:green}4%\{color}|
|OrHighNotMed|13.21|0.49|13.22|0.50|\{color:red}7%\{color}-\{color:green}7%\{color}|
|OrHighMed|13.22|0.73|13.30|0.70|\{color:red}9%\{color}-\{color:green}12%\{color}|
|OrHighHigh|7.56|0.43|7.62|0.41|\{color:red}9%\{color}-\{color:green}12%\{color}|
|BrowseMonthTaxoFacets|7.96|1.92|8.06|1.78|\{color:red}36%\{color}-\{color:green}63%\{color}|
|LowSpanNear|11.84|0.19|11.99|0.21|\{color:red}2%\{color}-\{color:green}4%\{color}|
|HighTermDayOfYearSort|20.05|1.40|20.31|2.15|\{color:red}15%\{color}-\{color:green}20%\{color}|
|BrowseDayOfYearTaxoFacets|7.96|1.91|8.07|1.85|\{color:red}37%\{color}-\{color:green}64%\{color}|
|BrowseMonthSSDVFacets|7.95|1.90|8.07|1.87|\{color:red}37%\{color}-\{color:green}64%\{color}|
|BrowseDayOfYearSSDVFacets|7.96|1.93|8.08|1.84|\{color:red}36%\{color}-\{color:green}64%\{color}|
|MedSpanNear|10.50|0.18|10.67|0.21|\{color:red}2%\{color}-\{color:green}5%\{color}|
|BrowseDateTaxoFacets|7.91|1.81|8.07|1.83|\{color:red}35%\{color}-\{color:green}62%\{color}|
|HighSpanNear|8.68|0.19|8.88|0.19|\{color:red}2%\{color}-\{color:green}6%\{color}|
 


was (Author: bruno.roustant):
It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h 
indexing initially - a little less for UniformSplit, then I had an exception 
about facets).

Then I got results which surprised me. BlockTree and UniformSplit had the same 
QPS for Term and Phrase queries. I didn't understand why a different behavior 
between a small and a large index.

Then I thought about 2 explanations:
 * Much larger index could mean less OS IO cache 

[jira] [Comment Edited] (LUCENE-8753) New PostingFormat - UniformSplit

2019-04-09 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813106#comment-16813106
 ] 

Bruno Roustant edited comment on LUCENE-8753 at 4/9/19 8:13 AM:


It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h 
indexing initially - a little less for UniformSplit, then I had an exception 
about facets).

Then I got results which surprised me. BlockTree and UniformSplit had the same 
QPS for Term and Phrase queries. I didn't understand why a different behavior 
between a small and a large index.

Then I thought about 2 explanations:
 * Much larger index could mean less OS IO cache hits. I ran the benchmark with 
a 16 GB laptop and a 64 GB desktop. Actually I got nearly no difference in my 
test.
 * Much larger index could mean more results. So the time spent to score and 
rank the results could become much larger and diminish the effect of a change 
in the dictionary. I have no clue there at the moment.

Here is the result of wikimedimall on a 64 GB desktop:

(I used -Jira option, but it does not seem to recognize the \{color} tag)
||Task||QPS BT||StdDev BT||QPS CUS||StdDev CUS||Pct diff||
|Fuzzy1|72.81|3.11|21.77|0.71|{color:red}72%\{color}-\{color:red}67%\{color}|
|Fuzzy2|66.77|3.77|20.41|0.67|{color:red}72%\{color}-\{color:red}66%\{color}|
|Respell|8.85|0.64|6.02|0.33|{color:red}40%\{color}-\{color:red}22%\{color}|
|PKLookup|130.83|3.96|121.66|12.37|{color:red}18%\{color}-\{color:green}5%\{color}|
|Wildcard|25.03|1.33|23.93|1.19|{color:red}13%\{color}-\{color:green}6%\{color}|
|HighTermMonthSort|19.03|2.55|18.40|1.56|{color:red}21%\{color}-\{color:green}21%\{color}|
|Prefix3|12.47|0.82|12.10|0.78|{color:red}14%\{color}-\{color:green}10%\{color}|
|LowTerm|182.95|14.94|177.97|18.67|{color:red}19%\{color}-\{color:green}17%\{color}|
|IntNRQ|5.21|0.54|5.09|0.56|{color:red}21%\{color}-\{color:green}21%\{color}|
|MedTerm|90.74|3.99|89.14|4.24|{color:red}10%\{color}-\{color:green}7%\{color}|
|HighTerm|42.54|1.95|41.86|2.00|{color:red}10%\{color}-\{color:green}8%\{color}|
|OrNotHighLow|532.96|16.16|526.86|24.40|{color:red}8%\{color}-\{color:green}6%\{color}|
|HighSloppyPhrase|12.00|0.39|11.90|0.48|{color:red}7%\{color}-\{color:green}6%\{color}|
|OrNotHighMed|53.64|1.08|53.37|1.22|{color:red}4%\{color}-\{color:green}3%\{color}|
|MedSloppyPhrase|31.83|0.59|31.67|0.78|{color:red}4%\{color}-\{color:green}3%\{color}|
|HighPhrase|32.24|0.85|32.09|0.81|{color:red}5%\{color}-\{color:green}4%\{color}|
|LowSloppyPhrase|29.51|0.43|29.40|0.58|{color:red}3%\{color}-\{color:green}3%\{color}|
|AndHighHigh|26.97|0.31|26.88|0.37|{color:red}2%\{color}-\{color:green}2%\{color}|
|MedPhrase|4.95|0.16|4.94|0.15|{color:red}6%\{color}-\{color:green}6%\{color}|
|AndHighMed|50.03|0.72|49.97|0.72|{color:red}2%\{color}-\{color:green}2%\{color}|
|OrNotHighHigh|18.85|0.76|18.85|0.82|{color:red}8%\{color}-\{color:green}8%\{color}|
|OrHighNotHigh|9.35|0.32|9.35|0.35|{color:red}6%\{color}-\{color:green}7%\{color}|
|OrHighLow|15.85|0.59|15.85|0.52|{color:red}6%\{color}-\{color:green}7%\{color}|
|OrHighNotLow|17.56|0.71|17.57|0.70|{color:red}7%\{color}-\{color:green}8%\{color}|
|AndHighLow|284.39|4.41|284.60|5.65|{color:red}3%\{color}-\{color:green}3%\{color}|
|LowPhrase|224.73|4.35|224.97|4.84|{color:red}3%\{color}-\{color:green}4%\{color}|
|OrHighNotMed|13.21|0.49|13.22|0.50|{color:red}7%\{color}-\{color:green}7%\{color}|
|OrHighMed|13.22|0.73|13.30|0.70|{color:red}9%\{color}-\{color:green}12%\{color}|
|OrHighHigh|7.56|0.43|7.62|0.41|{color:red}9%\{color}-\{color:green}12%\{color}|
|BrowseMonthTaxoFacets|7.96|1.92|8.06|1.78|{color:red}36%\{color}-\{color:green}63%\{color}|
|LowSpanNear|11.84|0.19|11.99|0.21|{color:red}2%\{color}-\{color:green}4%\{color}|
|HighTermDayOfYearSort|20.05|1.40|20.31|2.15|{color:red}15%\{color}-\{color:green}20%\{color}|
|BrowseDayOfYearTaxoFacets|7.96|1.91|8.07|1.85|{color:red}37%\{color}-\{color:green}64%\{color}|
|BrowseMonthSSDVFacets|7.95|1.90|8.07|1.87|{color:red}37%\{color}-\{color:green}64%\{color}|
|BrowseDayOfYearSSDVFacets|7.96|1.93|8.08|1.84|{color:red}36%\{color}-\{color:green}64%\{color}|
|MedSpanNear|10.50|0.18|10.67|0.21|{color:red}2%\{color}-\{color:green}5%\{color}|
|BrowseDateTaxoFacets|7.91|1.81|8.07|1.83|{color:red}35%\{color}-\{color:green}62%\{color}|
|HighSpanNear|8.68|0.19|8.88|0.19|{color:red}2%\{color}-\{color:green}6%\{color}|


was (Author: bruno.roustant):
It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h 
indexing initially - a little less for UniformSplit, then I had an exception 
about facets).

Then I got results which surprised me. BlockTree and UniformSplit had the same 
QPS for Term and Phrase queries. I didn't understand why a different behavior 
between a small and a large index.

Then I thought about 2 explanations:
 * Much larger index could mean less OS IO cache hits. I ran the benchmark with 
a 16 GB 

[jira] [Created] (LUCENE-8836) Optimize DocValues TermsDict to continue scanning from the last position when possible

2019-06-06 Thread Bruno Roustant (JIRA)
Bruno Roustant created LUCENE-8836:
--

 Summary: Optimize DocValues TermsDict to continue scanning from 
the last position when possible
 Key: LUCENE-8836
 URL: https://issues.apache.org/jira/browse/LUCENE-8836
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Bruno Roustant


Lucene80DocValuesProducer.TermsDict is used to lookup for either a term or a 
term ordinal.

Currently it does not have the optimization the FSTEnum has: to be able to 
continue a sequential scan from where the last lookup was in the IndexInput. 
For sparse lookups (when searching only a few terms or ordinal) it is not an 
issue. But for multiple lookups in a row this optimization could save 
re-scanning all the terms from the block start (since they are delat encoded).

This patch proposes the optimization.

To estimate the gain, we ran 3 Lucene tests while counting the seeks and the 
term reads in the IndexInput, with and without the optimization:

TestLucene70DocValuesFormat - the optimization saves 24% seeks and 15% term 
reads.
TestDocValuesQueries - the optimization adds 0.7% seeks and 0.003% term reads.
TestDocValuesRewriteMethod.testRegexps - the optimization saves 71% seeks and 
82% term reads.

In some cases, when scanning many terms in lexicographical order, the 
optimization saves a lot. In some case, when only looking for some sparse 
terms, the optimization does not bring improvement, but does not penalize 
neither. It seems to be worth to always have it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-05-14 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839429#comment-16839429
 ] 

Bruno Roustant commented on LUCENE-8753:


Beyond the performance aspects, we developed UniformSplit to be extensible. To 
give an idea of how it can be extended, I have added a new PR#676: SharedTerms 
UniformSplit.

The use-case is when there are many fields. We want to take advantage of the 
FST property to share the terms between all the fields, by replacing one FST 
per field by a single FST containing the shared terms. In this case each term 
is stored only once in the block file, and its block line contains the 
TermState for each different field for which the term occurs.

term A -> field1 TermState, field2 TermState, field3 TermState

term B -> field3 TermState, field5 TermState

The FST is compact and this posting format also unlocks the possibility to 
cache when the same term is searched in many fields (but this is not part of 
this PR).

My goal here is to showcase the extensibility of this posting format. This 
extension is in a separate sub-package sharedterms and is quite concise. (the 
only tricky part is the custom merge to merge efficiently two segments by 
accessing directly the sharedterms posting format)

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11866) Support efficient subset matching in query elevation rules

2019-07-12 Thread Bruno Roustant (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-11866:
--
Attachment: (was: 
0001-New-SubsetMatchElevationProvider-in-QueryElevationCo.patch)

> Support efficient subset matching in query elevation rules
> --
>
> Key: SOLR-11866
> URL: https://issues.apache.org/jira/browse/SOLR-11866
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: SOLR-11866.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Leverages the SOLR-11865 refactoring by introducing a 
> SubsetMatchElevationProvider in QueryElevationComponent. This provider calls 
> a new util class TrieSubsetMatcher to efficiently match all query elevation 
> rules which subset is contained by the current query list of terms.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-11866) Support efficient subset matching in query elevation rules

2019-07-12 Thread Bruno Roustant (JIRA)


 [ 
https://issues.apache.org/jira/browse/SOLR-11866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruno Roustant updated SOLR-11866:
--
Attachment: (was: SOLR-11866.patch)

> Support efficient subset matching in query elevation rules
> --
>
> Key: SOLR-11866
> URL: https://issues.apache.org/jira/browse/SOLR-11866
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Leverages the SOLR-11865 refactoring by introducing a 
> SubsetMatchElevationProvider in QueryElevationComponent. This provider calls 
> a new util class TrieSubsetMatcher to efficiently match all query elevation 
> rules which subset is contained by the current query list of terms.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11866) Support efficient subset matching in query elevation rules

2019-07-12 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883816#comment-16883816
 ] 

Bruno Roustant commented on SOLR-11866:
---

I have updated with PR [#780|https://github.com/apache/lucene-solr/pull/780]. 
Should I remove the obsolete patch files from this Jira issue?

> Support efficient subset matching in query elevation rules
> --
>
> Key: SOLR-11866
> URL: https://issues.apache.org/jira/browse/SOLR-11866
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: 
> 0001-New-SubsetMatchElevationProvider-in-QueryElevationCo.patch, 
> SOLR-11866.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Leverages the SOLR-11865 refactoring by introducing a 
> SubsetMatchElevationProvider in QueryElevationComponent. This provider calls 
> a new util class TrieSubsetMatcher to efficiently match all query elevation 
> rules which subset is contained by the current query list of terms.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-11866) Support efficient subset matching in query elevation rules

2019-07-12 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-11866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883819#comment-16883819
 ] 

Bruno Roustant commented on SOLR-11866:
---

Also, the doc will need to be updated to explain the support of the new 
match="subset" param in the elevation rule (in addition to match="exact").

.

> Support efficient subset matching in query elevation rules
> --
>
> Key: SOLR-11866
> URL: https://issues.apache.org/jira/browse/SOLR-11866
> Project: Solr
>  Issue Type: Improvement
>  Components: SearchComponents - other
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: 
> 0001-New-SubsetMatchElevationProvider-in-QueryElevationCo.patch, 
> SOLR-11866.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Leverages the SOLR-11865 refactoring by introducing a 
> SubsetMatchElevationProvider in QueryElevationComponent. This provider calls 
> a new util class TrieSubsetMatcher to efficiently match all query elevation 
> rules which subset is contained by the current query list of terms.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8906) Lucene50PostingsReader.postings() casts BlockTermState param to private IntBlockTermState

2019-07-09 Thread Bruno Roustant (JIRA)
Bruno Roustant created LUCENE-8906:
--

 Summary: Lucene50PostingsReader.postings() casts BlockTermState 
param to private IntBlockTermState
 Key: LUCENE-8906
 URL: https://issues.apache.org/jira/browse/LUCENE-8906
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/codecs
Reporter: Bruno Roustant


Lucene50PostingsReader is the public API that offers the postings() method to 
read the postings. Any PostingFormat can use it (as well as 
Lucene50PostingsWriter) to read/write postings.

But the postings() method asks for a (public) BlockTermState param which is 
internally cast to the private IntBlockTermState. This BlockTermState is 
provided by Lucene50PostingsReader.newTermState().

public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, 
PostingsEnum reuse, int flags)

This actually makes impossible to a custom PostingFormat customizing the Block 
file structure to use this postings() method by providing their 
(Int)BlockTermState, because they cannot access the FP fields of the 
IntBlockTermState returned by PostingsReaderBase.newTermState().

Proposed change:
 * Either make IntBlockTermState public, as well as its fields.
 * Or replace it by an interface in the postings() method. In this case the 
IntBlockTermState fields currently accessed directly would be replaced by 
getter/setter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8906) Lucene50PostingsReader.postings() casts BlockTermState param to private IntBlockTermState

2019-07-09 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881037#comment-16881037
 ] 

Bruno Roustant commented on LUCENE-8906:


This issue has been encountered in LUCENE-8753 (Uniform Split posting format).

> Lucene50PostingsReader.postings() casts BlockTermState param to private 
> IntBlockTermState
> -
>
> Key: LUCENE-8906
> URL: https://issues.apache.org/jira/browse/LUCENE-8906
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Bruno Roustant
>Priority: Major
>
> Lucene50PostingsReader is the public API that offers the postings() method to 
> read the postings. Any PostingFormat can use it (as well as 
> Lucene50PostingsWriter) to read/write postings.
> But the postings() method asks for a (public) BlockTermState param which is 
> internally cast to the private IntBlockTermState. This BlockTermState is 
> provided by Lucene50PostingsReader.newTermState().
> public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, 
> PostingsEnum reuse, int flags)
> This actually makes impossible to a custom PostingFormat customizing the 
> Block file structure to use this postings() method by providing their 
> (Int)BlockTermState, because they cannot access the FP fields of the 
> IntBlockTermState returned by PostingsReaderBase.newTermState().
> Proposed change:
>  * Either make IntBlockTermState public, as well as its fields.
>  * Or replace it by an interface in the postings() method. In this case the 
> IntBlockTermState fields currently accessed directly would be replaced by 
> getter/setter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-07-09 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881046#comment-16881046
 ] 

Bruno Roustant commented on LUCENE-8753:


I have created a related Jira issue LUCENE-8906 
(Lucene50PostingsReader.postings() casts BlockTermState param to private 
IntBlockTermState) to make the PR review advance.

If we find a solution for this issue, then UniformSplit posting format will be 
fully isolated in a separate package in codecs, with no intrusion anymore 
elsewhere.

The goal is to have it as an additional optional posting format (not to replace 
BlockTree) for the following use-cases: customizable by extension, shared-terms 
extension available, low memory on-heap footprint, best efficiency when dealing 
with small to medium indexes.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-08-13 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906219#comment-16906219
 ] 

Bruno Roustant commented on LUCENE-8753:


New [PR 828|https://github.com/apache/lucene-solr/pull/828] to have this 
PostingsFormat inside codecs/uniformsplit with no code elsewhere. I added 
package javadoc and lucene.experimental annotation.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-05 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923463#comment-16923463
 ] 

Bruno Roustant commented on LUCENE-8920:


Based on some heuristics, Direct-Addressing is the good choice. For example if 
num labels / (max label - min label) >= 75%.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-05 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923428#comment-16923428
 ] 

Bruno Roustant commented on LUCENE-8920:


[~sokolov]  There may be another option to speed-up FST arc lookup while 
limiting the memory increase.

Direct-Addressing option looks up by accessing directly 1 label, and costs up 
to (num labels x 4 x num bytes to encode) bytes.

Label-List option is the opposite, look up needs on average N/2 label 
comparisons, and costs (num labels x var bytes to encode) bytes.

 

Another option is to use open-addressing. Look up would be <= L comparisons 
where we can fix L < log(N)/2 (to be faster than binary search), and would cost 
< (num labels x 2 x num bytes to encode).

The idea is to have an array of size 2^p such as 2^(p-1) < N < 2^p. We hash the 
labels and store them in the array using the open-addressing idea: if a slot is 
occupied, try with the next block. If we can’t store a label in less than L 
tries, then abort and fallback to Label-List or Binary-Search option. At lookup 
we hash the input label and know that we have less than L tries to compare.

This is another compromise speed/memory: faster than binary search (constant 
L), with at least 2x less memory than Direct-Addressing.

It is also possible to combine open-addressing and variable length encoding, by 
finding the first byte starting a label based on the bit used to encode the var 
length additional bytes.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit

2019-09-04 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922609#comment-16922609
 ] 

Bruno Roustant commented on LUCENE-8753:


Ok, I followed your advice to include the "shared terms" extension (subpackage) 
in the same PR #828. I'm going to close the two previous ones.

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-05 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923428#comment-16923428
 ] 

Bruno Roustant edited comment on LUCENE-8920 at 9/5/19 8:46 PM:


[~sokolov]  There may be another option to speed-up FST arc lookup while 
limiting the memory increase.

Direct-Addressing option looks up by accessing directly 1 label, and costs up 
to (num labels x 4 x num bytes to encode) bytes.

Label-List option is the opposite, look up needs on average N/2 label 
comparisons, and costs (num labels x var bytes to encode) bytes.

 

Another option is to use open-addressing. Look up would be <= L comparisons 
where we can fix L < log(N)/2 (to be faster than binary search), and would cost 
< (num labels x 2 x num bytes to encode).

The idea is to have an array of size 2^p such as 2^(p-1) < N < 2^p. We hash the 
labels and store them in the array using the open-addressing idea: if a slot is 
occupied, try with the next block. If we can’t store a label in less than L 
tries, then abort and fallback to Label-List or Binary-Search option. At lookup 
we hash the input label and know that we have less than L tries to compare.

This is another compromise speed/memory: faster than binary search (constant 
L), with at least 2x less memory than Direct-Addressing.

On the Binary-Search side, it could be possible to support variable length 
encoding, by finding the first byte starting a label based on the bit used to 
encode the var length additional bytes.


was (Author: bruno.roustant):
[~sokolov]  There may be another option to speed-up FST arc lookup while 
limiting the memory increase.

Direct-Addressing option looks up by accessing directly 1 label, and costs up 
to (num labels x 4 x num bytes to encode) bytes.

Label-List option is the opposite, look up needs on average N/2 label 
comparisons, and costs (num labels x var bytes to encode) bytes.

 

Another option is to use open-addressing. Look up would be <= L comparisons 
where we can fix L < log(N)/2 (to be faster than binary search), and would cost 
< (num labels x 2 x num bytes to encode).

The idea is to have an array of size 2^p such as 2^(p-1) < N < 2^p. We hash the 
labels and store them in the array using the open-addressing idea: if a slot is 
occupied, try with the next block. If we can’t store a label in less than L 
tries, then abort and fallback to Label-List or Binary-Search option. At lookup 
we hash the input label and know that we have less than L tries to compare.

This is another compromise speed/memory: faster than binary search (constant 
L), with at least 2x less memory than Direct-Addressing.

It is also possible to combine open-addressing and variable length encoding, by 
finding the first byte starting a label based on the bit used to encode the var 
length additional bytes.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-06 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924178#comment-16924178
 ] 

Bruno Roustant commented on LUCENE-8920:


I'd love to work on that, but I'm pretty busy so I can't start immediately. If 
you can start on it soon I'll be happy to help and review.

I'll try to think more about the subject. Where should I post my remarks/ideas? 
Here in the thread or in an attached doc?

Some additional thoughts:
 * Threshold T1 to find to decide when direct-addressing is best (N / (max 
label - min label) >= T1). E.g. with T1 = 50% worst case is memory x2 right? 
(although there is the var length encoding difference...). Did you try that, 
what is the perf?
 * Threshold T2 to find to decide if a list is better (N < T2) or if 
open-addressing is more appropriate.
 * If N is close to 2^p, the probability that open-addressing aborts (can't 
store a label in less than L tries) is high. Do we double the array size 
(2^(p+1)) or can we take 1.5x2^p to save memory? (my intuition is the second, 
but need some testing about the load factor)
 * I think var-length List and fixed-length Binary-Search options could be 
merged to always have a var-length List that can be binary searched with low 
impact on perf. This is a work in itself, but it can help reduce the FST memory 
and thus free some bytes for the faster options.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-09-06 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924178#comment-16924178
 ] 

Bruno Roustant edited comment on LUCENE-8920 at 9/6/19 11:57 AM:
-

I'd love to work on that, but I'm pretty busy so I can't start immediately. If 
you can start on it soon I'll be happy to help and review.

I'll try to think more about the subject. Where should I post my remarks/ideas? 
Here in the thread or in an attached doc?

Some additional thoughts:
 * Threshold T1 to find to decide when direct-addressing is best (N / (max 
label - min label) >= T1). E.g. with T1 = 50% worst case is memory x2 right? 
(although there is the var length encoding difference...). Did you try that, 
what is the perf?
 * Threshold T2 to find to decide if a list is better (N < T2) or if 
open-addressing is more appropriate.
 * If N is close to 2^p, the probability that open-addressing aborts (can't 
store a label in less than L tries) is high. Do we double the array size 
(2^(p+1)) or can we take 1.5x2^p to save memory? (my intuition is the second, 
but need some testing about the load factor)


was (Author: bruno.roustant):
I'd love to work on that, but I'm pretty busy so I can't start immediately. If 
you can start on it soon I'll be happy to help and review.

I'll try to think more about the subject. Where should I post my remarks/ideas? 
Here in the thread or in an attached doc?

Some additional thoughts:
 * Threshold T1 to find to decide when direct-addressing is best (N / (max 
label - min label) >= T1). E.g. with T1 = 50% worst case is memory x2 right? 
(although there is the var length encoding difference...). Did you try that, 
what is the perf?
 * Threshold T2 to find to decide if a list is better (N < T2) or if 
open-addressing is more appropriate.
 * If N is close to 2^p, the probability that open-addressing aborts (can't 
store a label in less than L tries) is high. Do we double the array size 
(2^(p+1)) or can we take 1.5x2^p to save memory? (my intuition is the second, 
but need some testing about the load factor)
 * I think var-length List and fixed-length Binary-Search options could be 
merged to always have a var-length List that can be binary searched with low 
impact on perf. This is a work in itself, but it can help reduce the FST memory 
and thus free some bytes for the faster options.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mike Sokolov
>Priority: Blocker
> Fix For: 8.3
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8921) IndexSearcher.termStatistics should not require TermStates but docFreq and totalTermFreq

2019-07-19 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1630#comment-1630
 ] 

Bruno Roustant commented on LUCENE-8921:


PR added

> IndexSearcher.termStatistics should not require TermStates but docFreq and 
> totalTermFreq
> 
>
> Key: LUCENE-8921
> URL: https://issues.apache.org/jira/browse/LUCENE-8921
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 8.1
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: master (9.0)
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> IndexSearcher.termStatistics(Term term, TermStates context) is the way to 
> create a TermStatistics. It requires a TermStates param although it only 
> cares about the docFreq and totalTermFreq.
>  
> For customizations that what to create TermStatistics based on docFreq and 
> totalTermFreq, but that do not have available TermStates, this method forces 
> to create a TermStates instance (which is not very lightweight) only to pass 
> two ints.
> termStatistics could be modified to the following signature:
> termStatistics(Term term, int docFreq, int totalTermFreq)
> Since it would change the API, it could be done in master for next major 
> release.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8906) Lucene50PostingsReader.postings() casts BlockTermState param to private IntBlockTermState

2019-07-19 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1682#comment-1682
 ] 

Bruno Roustant commented on LUCENE-8906:


PR added

> Lucene50PostingsReader.postings() casts BlockTermState param to private 
> IntBlockTermState
> -
>
> Key: LUCENE-8906
> URL: https://issues.apache.org/jira/browse/LUCENE-8906
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Bruno Roustant
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Lucene50PostingsReader is the public API that offers the postings() method to 
> read the postings. Any PostingFormat can use it (as well as 
> Lucene50PostingsWriter) to read/write postings.
> But the postings() method asks for a (public) BlockTermState param which is 
> internally cast to the private IntBlockTermState. This BlockTermState is 
> provided by Lucene50PostingsReader.newTermState().
> public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, 
> PostingsEnum reuse, int flags)
> This actually makes impossible to a custom PostingFormat customizing the 
> Block file structure to use this postings() method by providing their 
> (Int)BlockTermState, because they cannot access the FP fields of the 
> IntBlockTermState returned by PostingsReaderBase.newTermState().
> Proposed change:
>  * Either make IntBlockTermState public, as well as its fields.
>  * Or replace it by an interface in the postings() method. In this case the 
> IntBlockTermState fields currently accessed directly would be replaced by 
> getter/setter.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8921) IndexSearcher.termStatistics should not require TermStates but docFreq and totalTermFreq

2019-07-17 Thread Bruno Roustant (JIRA)
Bruno Roustant created LUCENE-8921:
--

 Summary: IndexSearcher.termStatistics should not require 
TermStates but docFreq and totalTermFreq
 Key: LUCENE-8921
 URL: https://issues.apache.org/jira/browse/LUCENE-8921
 Project: Lucene - Core
  Issue Type: Improvement
  Components: core/search
Affects Versions: 8.1
Reporter: Bruno Roustant
 Fix For: master (9.0)


IndexSearcher.termStatistics(Term term, TermStates context) is the way to 
create a TermStatistics. It requires a TermStates param although it only cares 
about the docFreq and totalTermFreq.

 

For customizations that what to create TermStatistics based on docFreq and 
totalTermFreq, but that do not have available TermStates, this method forces to 
create a TermStates instance (which is not very lightweight) only to pass two 
ints.

termStatistics could be modified to the following signature:

termStatistics(Term term, int docFreq, int totalTermFreq)

Since it would change the API, it could be done in master for next major 
release.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8921) IndexSearcher.termStatistics should not require TermStates but docFreq and totalTermFreq

2019-07-17 Thread Bruno Roustant (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886889#comment-16886889
 ] 

Bruno Roustant commented on LUCENE-8921:


Yes, sure. I could work on a PR for 8.2.

> IndexSearcher.termStatistics should not require TermStates but docFreq and 
> totalTermFreq
> 
>
> Key: LUCENE-8921
> URL: https://issues.apache.org/jira/browse/LUCENE-8921
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 8.1
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: master (9.0)
>
>
> IndexSearcher.termStatistics(Term term, TermStates context) is the way to 
> create a TermStatistics. It requires a TermStates param although it only 
> cares about the docFreq and totalTermFreq.
>  
> For customizations that what to create TermStatistics based on docFreq and 
> totalTermFreq, but that do not have available TermStates, this method forces 
> to create a TermStates instance (which is not very lightweight) only to pass 
> two ints.
> termStatistics could be modified to the following signature:
> termStatistics(Term term, int docFreq, int totalTermFreq)
> Since it would change the API, it could be done in master for next major 
> release.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org