[jira] [Created] (SOLR-6613) TextField.analyzeMultiTerm should not throw exception when analyzer returns no term
Bruno Roustant created SOLR-6613: Summary: TextField.analyzeMultiTerm should not throw exception when analyzer returns no term Key: SOLR-6613 URL: https://issues.apache.org/jira/browse/SOLR-6613 Project: Solr Issue Type: Bug Components: Schema and Analysis Affects Versions: 4.3.1, 4.10.2, Trunk Reporter: Bruno Roustant In TextField.analyzeMultiTerm() at line try { if (!source.incrementToken()) throw new SolrException(); The method should not throw an exception if there is no token because having no token is legitimate because all tokens may be filtered out (e.g. with a blocking Filter such as StopFilter). In this case it should simply return null (as it already returns null in some cases, see first line of method). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-6613) TextField.analyzeMultiTerm should not throw exception when analyzer returns no term
[ https://issues.apache.org/jira/browse/SOLR-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-6613: - Attachment: TestTextField.java TextField.analyzeMultiTerm should not throw exception when analyzer returns no term --- Key: SOLR-6613 URL: https://issues.apache.org/jira/browse/SOLR-6613 Project: Solr Issue Type: Bug Components: Schema and Analysis Affects Versions: 4.3.1, 4.10.2, Trunk Reporter: Bruno Roustant Attachments: TestTextField.java In TextField.analyzeMultiTerm() at line try { if (!source.incrementToken()) throw new SolrException(); The method should not throw an exception if there is no token because having no token is legitimate because all tokens may be filtered out (e.g. with a blocking Filter such as StopFilter). In this case it should simply return null (as it already returns null in some cases, see first line of method). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-6613) TextField.analyzeMultiTerm should not throw exception when analyzer returns no term
[ https://issues.apache.org/jira/browse/SOLR-6613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-6613: - Description: In TextField.analyzeMultiTerm() at line try { if (!source.incrementToken()) throw new SolrException(); The method should not throw an exception if there is no token because having no token is legitimate because all tokens may be filtered out (e.g. with a blocking Filter such as StopFilter). In this case it should simply return null (as it already returns null in some cases, see first line of method). However, SolrQueryParserBase needs also to be fixed to correctly handle null returned by TextField.analyzeMultiTerm(). See attached TestTextField for the corresponding new test class. was: In TextField.analyzeMultiTerm() at line try { if (!source.incrementToken()) throw new SolrException(); The method should not throw an exception if there is no token because having no token is legitimate because all tokens may be filtered out (e.g. with a blocking Filter such as StopFilter). In this case it should simply return null (as it already returns null in some cases, see first line of method). TextField.analyzeMultiTerm should not throw exception when analyzer returns no term --- Key: SOLR-6613 URL: https://issues.apache.org/jira/browse/SOLR-6613 Project: Solr Issue Type: Bug Components: Schema and Analysis Affects Versions: 4.3.1, 4.10.2, Trunk Reporter: Bruno Roustant Attachments: TestTextField.java In TextField.analyzeMultiTerm() at line try { if (!source.incrementToken()) throw new SolrException(); The method should not throw an exception if there is no token because having no token is legitimate because all tokens may be filtered out (e.g. with a blocking Filter such as StopFilter). In this case it should simply return null (as it already returns null in some cases, see first line of method). However, SolrQueryParserBase needs also to be fixed to correctly handle null returned by TextField.analyzeMultiTerm(). See attached TestTextField for the corresponding new test class. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
[ https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465767#comment-16465767 ] Bruno Roustant commented on LUCENE-8292: I just realized that the current no-default-override behavior is actually enforced by a test TestFilterLeafReader.testOverrideMethods. I still think all methods should be overridden, but I understand that this may not be the expected behavior currently. > Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods > -- > > Key: LUCENE-8292 > URL: https://issues.apache.org/jira/browse/LUCENE-8292 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.2.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: trunk > > Attachments: > 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, > LUCENE-8292.patch > > > FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many > methods. > It misses some seekExact() methods, thus it is not possible to the delegate > to override these methods to have specific behavior (unlike the TermsEnum API > which allows that). > The fix is straightforward: simply override these seekExact() methods and > delegate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
[ https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465887#comment-16465887 ] Bruno Roustant commented on LUCENE-8292: [~dsmiley], if I create a subclass of FilterTermsEnum to override seekExact, how can I make other classes in Lucene create this subclass instead of FilterTermsEnum? Would I have to also override other classes or other factories? > Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods > -- > > Key: LUCENE-8292 > URL: https://issues.apache.org/jira/browse/LUCENE-8292 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.2.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: trunk > > Attachments: > 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, > LUCENE-8292.patch > > > FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many > methods. > It misses some seekExact() methods, thus it is not possible to the delegate > to override these methods to have specific behavior (unlike the TermsEnum API > which allows that). > The fix is straightforward: simply override these seekExact() methods and > delegate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456470#comment-16456470 ] Bruno Roustant commented on SOLR-11865: --- Actually the TrieSubsetMatcher introduced by the next patch does not support keepElevationPriority. If keepElevationPriority=true, this matcher is replaced by another, which keeps the order but which is less efficient. And this is done at component initialization time, in the inform() method (in loadElevationProvider()). So I think it cannot be a query param because it is fixed in the data structure at initialization time. > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > 0002-Refactor-QueryElevationComponent-after-review.patch, > 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
[ https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475579#comment-16475579 ] Bruno Roustant commented on LUCENE-8292: Actually there is also another related issue with this FilterLeafReader#FilterTermsEnum delegate pattern. It does not delegate termState() nor seekExact(ByteRef, TermState) methods. Which means the termState is never used, so the term queries repeat twice the same seek (seekCeil) instead of using the termState to improve performance (normally the termState is kept by TermContext#build()). Practical example: When one configures a timeout for queries, internally a ExitableDirectoryReader is created. And its ExitableTermsEnum, which extends FilterTermsEnum, makes all term queries repeat twice the same seekCeil(). > Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods > -- > > Key: LUCENE-8292 > URL: https://issues.apache.org/jira/browse/LUCENE-8292 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.2.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: trunk > > Attachments: > 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, > LUCENE-8292.patch > > > FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many > methods. > It misses some seekExact() methods, thus it is not possible to the delegate > to override these methods to have specific behavior (unlike the TermsEnum API > which allows that). > The fix is straightforward: simply override these seekExact() methods and > delegate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
[ https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475579#comment-16475579 ] Bruno Roustant edited comment on LUCENE-8292 at 5/15/18 9:57 AM: - Actually there is also another related issue with this FilterLeafReader#FilterTermsEnum delegate pattern. It does not delegate termState() nor seekExact(ByteRef, TermState) methods. Which means the termState is never used, so the term queries repeat twice the same seek (seekCeil) instead of using the termState to improve performance (normally the termState is kept by TermContext#build()). Practical example: When one configures a timeout for queries, internally an ExitableDirectoryReader is created. And its ExitableTermsEnum, which extends FilterTermsEnum, makes all term queries repeat twice the same seekCeil(). was (Author: bruno.roustant): Actually there is also another related issue with this FilterLeafReader#FilterTermsEnum delegate pattern. It does not delegate termState() nor seekExact(ByteRef, TermState) methods. Which means the termState is never used, so the term queries repeat twice the same seek (seekCeil) instead of using the termState to improve performance (normally the termState is kept by TermContext#build()). Practical example: When one configures a timeout for queries, internally a ExitableDirectoryReader is created. And its ExitableTermsEnum, which extends FilterTermsEnum, makes all term queries repeat twice the same seekCeil(). > Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods > -- > > Key: LUCENE-8292 > URL: https://issues.apache.org/jira/browse/LUCENE-8292 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.2.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: trunk > > Attachments: > 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, > LUCENE-8292.patch > > > FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many > methods. > It misses some seekExact() methods, thus it is not possible to the delegate > to override these methods to have specific behavior (unlike the TermsEnum API > which allows that). > The fix is straightforward: simply override these seekExact() methods and > delegate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476032#comment-16476032 ] Bruno Roustant commented on SOLR-11865: --- Great! I agree with all your points [~dsmiley]. Indeed the String IDs in Elevation would be clearer as BytesRefs. And I vote to apply the key String => indexed form as early as possible, if the code remains small. > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > 0002-Refactor-QueryElevationComponent-after-review.patch, > 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Closed] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant closed SOLR-11865. - Work done > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Minor > Labels: QueryComponent > Fix For: 7.5 > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > 0002-Refactor-QueryElevationComponent-after-review.patch, > 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, > SOLR-11865.patch, SOLR-11865.patch, SOLR-11865.patch, SOLR-11865.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517247#comment-16517247 ] Bruno Roustant commented on SOLR-11865: --- Thanks for your incredible help [~dsmiley]! Closing this PR. > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Minor > Labels: QueryComponent > Fix For: 7.5 > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > 0002-Refactor-QueryElevationComponent-after-review.patch, > 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, > SOLR-11865.patch, SOLR-11865.patch, SOLR-11865.patch, SOLR-11865.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16496788#comment-16496788 ] Bruno Roustant commented on SOLR-11865: --- You're right MapElevationProvider.buildElevationMap should merge in this case (which indeed should not happen since they have been merged earlier). I have created the GitHub PR ([https://github.com/apache/lucene-solr/pull/390),] to be enhanced with all your improvements. > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > 0002-Refactor-QueryElevationComponent-after-review.patch, > 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, > SOLR-11865.patch > > Time Spent: 20m > Remaining Estimate: 0h > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
[ https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462394#comment-16462394 ] Bruno Roustant edited comment on LUCENE-8292 at 5/3/18 1:03 PM: When looking at TermsEnum API, what I understand is that seekExact() defaults to calling seekCeil(), but if needed (not for correctness but for performance consideration) we can override it to have a specialized seek that searches only the exact term and does not have to position to the next term if not found. This may have an impact for some TermsEnum extensions (a really noticeable impact in my case, that's why I noticed this issue). To me the current behavior of FilterTermsEnum is not correct with regard to TermsEnum API. (And I noticed that AssertingLeafReader overrides seekExact()). Adding these two methods in FilterTermsEnum fixes correctness, even if I agree it makes more room for bugs. was (Author: bruno.roustant): When looking at TermsEnum API, what I understand is that seekExact() defaults to calling seekCeil(), but if needed (not for correctness but for performance consideration) we can override it to have a specialized seek that searches only the exact term and does not have to position to the next term if not found. This may have an impact for some TermsEnum extensions (a really noticeable impact in my case, that's why I noticed this issue). To me the current behavior of FilterTermsEnum is not correct with regard to TermsEnum API. (And I noticed that AssertingLeafReader overrides seekExact()). Adding this two methods in FilterTermsEnum fixes correctness, even if I agree it makes more room for bugs. > Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods > -- > > Key: LUCENE-8292 > URL: https://issues.apache.org/jira/browse/LUCENE-8292 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.2.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: trunk > > Attachments: > 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, > LUCENE-8292.patch > > > FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many > methods. > It misses some seekExact() methods, thus it is not possible to the delegate > to override these methods to have specific behavior (unlike the TermsEnum API > which allows that). > The fix is straightforward: simply override these seekExact() methods and > delegate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
[ https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462368#comment-16462368 ] Bruno Roustant commented on LUCENE-8292: 1- "Not possible to override": I was not clear. It is still possible for a delegate TermsEnum to override the seekExact() method. But it will never be called since the FilterTermsEnum above always calls seekCeil(). 2- "Two more methods to override": You're right. Although normally the same code should be reusable, it should not be tedious. I see the trappy point. > Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods > -- > > Key: LUCENE-8292 > URL: https://issues.apache.org/jira/browse/LUCENE-8292 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.2.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: trunk > > Attachments: > 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, > LUCENE-8292.patch > > > FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many > methods. > It misses some seekExact() methods, thus it is not possible to the delegate > to override these methods to have specific behavior (unlike the TermsEnum API > which allows that). > The fix is straightforward: simply override these seekExact() methods and > delegate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
[ https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462394#comment-16462394 ] Bruno Roustant commented on LUCENE-8292: When looking at TermsEnum API, what I understand is that seekExact() defaults to calling seekCeil(), but if needed (not for correctness but for performance consideration) we can override it to have a specialized seek that searches only the exact term and does not have to position to the next term if not found. This may have an impact for some TermsEnum extensions (a really noticeable impact in my case, that's why I noticed this issue). To me the current behavior of FilterTermsEnum is not correct with regard to TermsEnum API. (And I noticed that AssertingLeafReader overrides seekExact()). Adding this two methods in FilterTermsEnum fixes correctness, even if I agree it makes more room for bugs. > Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods > -- > > Key: LUCENE-8292 > URL: https://issues.apache.org/jira/browse/LUCENE-8292 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.2.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: trunk > > Attachments: > 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, > LUCENE-8292.patch > > > FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many > methods. > It misses some seekExact() methods, thus it is not possible to the delegate > to override these methods to have specific behavior (unlike the TermsEnum API > which allows that). > The fix is straightforward: simply override these seekExact() methods and > delegate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
Bruno Roustant created LUCENE-8292: -- Summary: Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods Key: LUCENE-8292 URL: https://issues.apache.org/jira/browse/LUCENE-8292 Project: Lucene - Core Issue Type: Bug Components: core/index Affects Versions: 7.2.1 Reporter: Bruno Roustant Fix For: trunk FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many methods. It misses some seekExact() methods, thus it is not possible to the delegate to override these methods to have specific behavior (unlike the TermsEnum API which allows that). The fix is straightforward: simply override these seekExact() methods and delegate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
[ https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated LUCENE-8292: --- Attachment: LUCENE-8292.patch 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch > Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods > -- > > Key: LUCENE-8292 > URL: https://issues.apache.org/jira/browse/LUCENE-8292 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.2.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: trunk > > Attachments: > 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, > LUCENE-8292.patch > > > FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many > methods. > It misses some seekExact() methods, thus it is not possible to the delegate > to override these methods to have specific behavior (unlike the TermsEnum API > which allows that). > The fix is straightforward: simply override these seekExact() methods and > delegate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
[ https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462409#comment-16462409 ] Bruno Roustant edited comment on LUCENE-8292 at 5/3/18 1:08 PM: Another option would be to modify the TermsEnum.seekExact() method and make it final, or have the javadoc be explicit that it should not be overridden. (though I don't like this option) was (Author: bruno.roustant): Another option would be to modify the TermsEnum.seekExact() method and make it final, or have the javadoc be explicit that it should not be overridden. > Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods > -- > > Key: LUCENE-8292 > URL: https://issues.apache.org/jira/browse/LUCENE-8292 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.2.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: trunk > > Attachments: > 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, > LUCENE-8292.patch > > > FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many > methods. > It misses some seekExact() methods, thus it is not possible to the delegate > to override these methods to have specific behavior (unlike the TermsEnum API > which allows that). > The fix is straightforward: simply override these seekExact() methods and > delegate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
[ https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462409#comment-16462409 ] Bruno Roustant commented on LUCENE-8292: Another option would be to modify the TermsEnum.seekExact() method and make it final, or have the javadoc be explicit that it should not be overridden. > Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods > -- > > Key: LUCENE-8292 > URL: https://issues.apache.org/jira/browse/LUCENE-8292 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.2.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: trunk > > Attachments: > 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, > LUCENE-8292.patch > > > FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many > methods. > It misses some seekExact() methods, thus it is not possible to the delegate > to override these methods to have specific behavior (unlike the TermsEnum API > which allows that). > The fix is straightforward: simply override these seekExact() methods and > delegate. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-11866) Support efficient subset matching in query elevation rules
Bruno Roustant created SOLR-11866: - Summary: Support efficient subset matching in query elevation rules Key: SOLR-11866 URL: https://issues.apache.org/jira/browse/SOLR-11866 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: SearchComponents - other Affects Versions: master (8.0) Reporter: Bruno Roustant Leverages the SOLR-11865 refactoring by introducing a SubsetMatchElevationProvider in QueryElevationComponent. This provider calls a new util class TrieSubsetMatcher to efficiently match all query elevation rules which subset is contained by the current query list of terms. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
Bruno Roustant created SOLR-11865: - Summary: Refactor QueryElevationComponent to prepare query subset matching Key: SOLR-11865 URL: https://issues.apache.org/jira/browse/SOLR-11865 Project: Solr Issue Type: Improvement Security Level: Public (Default Security Level. Issues are Public) Components: SearchComponents - other Affects Versions: master (8.0) Reporter: Bruno Roustant Fix For: master (8.0) The goal is to prepare a second improvement to support query terms subset matching or query elevation rules. Before that, we need to refactor the QueryElevationComponent. We make it extendible. We introduce the ElevationProvider interface which will be implemented later in a second patch to support subset matching. The current full-query match policy becomes a default simple MapElevationProvider. - Add overridable methods to handle exceptions during the component initialization. - Add overridable methods to provide the default values for config properties. - No functional change beyond refactoring. - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11866) Support efficient subset matching in query elevation rules
[ https://issues.apache.org/jira/browse/SOLR-11866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-11866: -- Attachment: SOLR-11866.patch 0001-New-SubsetMatchElevationProvider-in-QueryElevationCo.patch > Support efficient subset matching in query elevation rules > -- > > Key: SOLR-11866 > URL: https://issues.apache.org/jira/browse/SOLR-11866 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Major > Attachments: > 0001-New-SubsetMatchElevationProvider-in-QueryElevationCo.patch, > SOLR-11866.patch > > > Leverages the SOLR-11865 refactoring by introducing a > SubsetMatchElevationProvider in QueryElevationComponent. This provider calls > a new util class TrieSubsetMatcher to efficiently match all query elevation > rules which subset is contained by the current query list of terms. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton
Bruno Roustant created LUCENE-8159: -- Summary: Add a copy constructor in AutomatonQuery to copy directly the compiled automaton Key: LUCENE-8159 URL: https://issues.apache.org/jira/browse/LUCENE-8159 Project: Lucene - Core Issue Type: Improvement Components: core/search Affects Versions: trunk Reporter: Bruno Roustant When the query is composed of multiple AutomatonQuery with the same automaton and which target different fields, it is much more efficient to reuse the already compiled automaton by copying it directly and just changing the target field. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton
[ https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated LUCENE-8159: --- Attachment: LUCENE-8159.patch 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch > Add a copy constructor in AutomatonQuery to copy directly the compiled > automaton > > > Key: LUCENE-8159 > URL: https://issues.apache.org/jira/browse/LUCENE-8159 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: trunk >Reporter: Bruno Roustant >Priority: Major > Attachments: > 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, > LUCENE-8159.patch > > > When the query is composed of multiple AutomatonQuery with the same automaton > and which target different fields, it is much more efficient to reuse the > already compiled automaton by copying it directly and just changing the > target field. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-11865: -- Attachment: SOLR-11865.patch 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-11865: -- Attachment: (was: SOLR-11865.patch) > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > 0002-Refactor-QueryElevationComponent-after-review.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-11865: -- Attachment: SOLR-11865.patch 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > 0002-Refactor-QueryElevationComponent-after-review.patch, > 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426589#comment-16426589 ] Bruno Roustant commented on SOLR-11865: --- New delta patch with the modification mentioned. Eventually I'll squash the commits to produce a single patch that should be supported by "Yetus" (currently I simply use git format-patch and it produces three separate patch files for three commits). > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > 0002-Refactor-QueryElevationComponent-after-review.patch, > 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16426589#comment-16426589 ] Bruno Roustant edited comment on SOLR-11865 at 4/5/18 7:44 AM: --- New delta patch with the modifications mentioned. Eventually I'll squash the commits to produce a single patch that should be supported by "Yetus" (currently I simply use git format-patch and it produces three separate patch files for three commits). was (Author: bruno.roustant): New delta patch with the modification mentioned. Eventually I'll squash the commits to produce a single patch that should be supported by "Yetus" (currently I simply use git format-patch and it produces three separate patch files for three commits). > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > 0002-Refactor-QueryElevationComponent-after-review.patch, > 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450065#comment-16450065 ] Bruno Roustant commented on SOLR-11865: --- Sorry for the delay. Yes, if you can take it from here, that would be awesome! * Getters for defaults: you're right, there is no need. Please remove them. * keepElevationPriority as a constant in QEC: good point. * keepElevationPriority meaning: Actually the comment is not right, maybe the sorting has changed since the time I wrote this comment. I don't think it is linked anymore to forceElevation since the ElevationComparatorSource can be added as a SortField even if forceElevation=false when one sort by score. The point is - with keepElevationPriority=true, the behavior is unchanged, the elevated documents (on top) are sorted by the order of the elevation rules and elevated ids in the config file. - with keepElevationPriority=false, the behavior changes, the elevated documents (still on top) are in any order, and they may be re-ordered by other sort fields (this will allow the use of the efficient but unsorted TrieSubsetMatcher in the other patch). > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > 0002-Refactor-QueryElevationComponent-after-review.patch, > 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450065#comment-16450065 ] Bruno Roustant edited comment on SOLR-11865 at 4/24/18 3:37 PM: Sorry for the delay. Yes, if you can take it from here, that would be awesome! * Getters for defaults: you're right, there is no need. Please remove them. * keepElevationPriority as a constant in QEC: good point. * keepElevationPriority meaning: Actually the comment is not right, maybe the sorting has changed since the time I wrote this comment. I don't think it is linked anymore to forceElevation since the ElevationComparatorSource can be added as a SortField even if forceElevation=false when one sorts by score. The point is - with keepElevationPriority=true, the behavior is unchanged, the elevated documents (on top) are sorted by the order of the elevation rules and elevated ids in the config file. - with keepElevationPriority=false, the behavior changes, the elevated documents (still on top) are in any order (this will allow the use of the efficient but unsorted TrieSubsetMatcher in the other patch), and they may be re-ordered by other sort fields was (Author: bruno.roustant): Sorry for the delay. Yes, if you can take it from here, that would be awesome! * Getters for defaults: you're right, there is no need. Please remove them. * keepElevationPriority as a constant in QEC: good point. * keepElevationPriority meaning: Actually the comment is not right, maybe the sorting has changed since the time I wrote this comment. I don't think it is linked anymore to forceElevation since the ElevationComparatorSource can be added as a SortField even if forceElevation=false when one sort by score. The point is - with keepElevationPriority=true, the behavior is unchanged, the elevated documents (on top) are sorted by the order of the elevation rules and elevated ids in the config file. - with keepElevationPriority=false, the behavior changes, the elevated documents (still on top) are in any order, and they may be re-ordered by other sort fields (this will allow the use of the efficient but unsorted TrieSubsetMatcher in the other patch). > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > 0002-Refactor-QueryElevationComponent-after-review.patch, > 0003-Remove-exception-handlers-and-refactor-getBoostDocs.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420302#comment-16420302 ] Bruno Roustant commented on SOLR-11865: --- 4- No "Can be overridden by extending this class". Sure. Removed. > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420301#comment-16420301 ] Bruno Roustant commented on SOLR-11865: --- 3- The indentation around line ~671 (contents of the for loop) is messed up. I didn't change that part. I'll try to fix the indentation. > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-11865: -- Attachment: SOLR-11865.patch > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-11865: -- Attachment: (was: SOLR-11865.patch) > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420300#comment-16420300 ] Bruno Roustant commented on SOLR-11865: --- 2- ElevationProvider should be immutable and simplified: Good point. createElevationProvider() accepts the elevationBuilderMap. getElevationForQuery() does not throw IOException. ElevationProvider.size is used by tests to verify the number of parsed rules. I added @VisibleForTesting annotation. > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420322#comment-16420322 ] Bruno Roustant commented on SOLR-11865: --- 10- seen.contains(id) == false. I didn't know this Lucene practice. It explains why I see this strange construct. "I recommend against modifying existing lines" - that's what I tried (see points 3,5,6 above) and I thought this "!seen.contains(id)" was tiny and harmless. > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420317#comment-16420317 ] Bruno Roustant commented on SOLR-11865: --- 8- Use a UnaryOperator instead of IndexedValueProvider. Good point. It is still clear with less code. > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420329#comment-16420329 ] Bruno Roustant commented on SOLR-11865: --- 11- subsetMatch flag in ElevatingQuery. Yes, the idea is to support some queries with subset match, and other without. This will be supported by the next ElevationProvider in the next patch. > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420311#comment-16420311 ] Bruno Roustant commented on SOLR-11865: --- 6- Use {{localBoosts.addAll(boosted.keySet());}} at line ~661 instead of manual looping. Again, I didn't change that (and I didn't want to touch existing code without reason). I fixed by directly removing localBoosts which was an exact copy of boosts.keySet() (boots parameter is a map). > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420308#comment-16420308 ] Bruno Roustant commented on SOLR-11865: --- 5- Change comparator docVal (~line 1318) to use getOrDefault. I didn't change that. Fixed. > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420319#comment-16420319 ] Bruno Roustant commented on SOLR-11865: --- 9- Make the constructor of ElevatingQuery protected. Done. > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420272#comment-16420272 ] Bruno Roustant commented on SOLR-11865: --- 1- InitializationExceptionHandler & LoadingExceptionHandler: At Salesforce (i.e. in a multi-tenant context) we allow each organization admin to update the list of elevation rules dynamically. When some rules are updated, the core corresponding to the organization is updated to reload the elevation rules XML. It is important to note that the organization admin - the person who defines the elevation rules - is not a Solr admin expert. He needs to get clear feedback on any error that may prevent the rules to be loaded. The XML rules are more considered as dynamic config rather than static config. In its original version, the QueryElevationComponent simply throws an exception. In this new version, it differentiates the error cause and lets an extending class (e.g. specific Salesforce extension) override the loading exception and take appropriate actions (logging, warning, etc) instead of simply throwing the Solr exception. > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420313#comment-16420313 ] Bruno Roustant commented on SOLR-11865: --- 7- In parseExcludedMarkerFieldName and parseEditorialMarkerFieldName. Removed the if () block. > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420322#comment-16420322 ] Bruno Roustant edited comment on SOLR-11865 at 3/30/18 9:30 AM: 10- seen.contains(id) == false. I didn't know this Lucene practice. It explains why I see this strange construct. "I recommend against modifying existing lines" - that's what I tried (see points 3,5,6 above) and I thought this "!seen.contains(id)" was tiny and harmless. And that's a warning highlighted by IntelliJ by the way :) was (Author: bruno.roustant): 10- seen.contains(id) == false. I didn't know this Lucene practice. It explains why I see this strange construct. "I recommend against modifying existing lines" - that's what I tried (see points 3,5,6 above) and I thought this "!seen.contains(id)" was tiny and harmless. > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-11865: -- Attachment: 0002-Refactor-QueryElevationComponent-after-review.patch > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > 0002-Refactor-QueryElevationComponent-after-review.patch, SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-11865: -- Attachment: (was: SOLR-11865.patch) > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-11865: -- Attachment: (was: 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch) > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-11865: -- Attachment: (was: 0002-Refactor-QueryElevationComponent-after-review.patch) > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-11865: -- Attachment: SOLR-11865.patch 0002-Refactor-QueryElevationComponent-after-review.patch 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > 0002-Refactor-QueryElevationComponent-after-review.patch, SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11865) Refactor QueryElevationComponent to prepare query subset matching
[ https://issues.apache.org/jira/browse/SOLR-11865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16420620#comment-16420620 ] Bruno Roustant commented on SOLR-11865: --- [~dsmiley] I uploaded a new patch. Is it better now? > Refactor QueryElevationComponent to prepare query subset matching > - > > Key: SOLR-11865 > URL: https://issues.apache.org/jira/browse/SOLR-11865 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: SearchComponents - other >Affects Versions: master (8.0) >Reporter: Bruno Roustant >Priority: Minor > Labels: QueryComponent > Fix For: master (8.0) > > Attachments: > 0001-Refactor-QueryElevationComponent-to-introduce-Elevat.patch, > 0002-Refactor-QueryElevationComponent-after-review.patch, SOLR-11865.patch > > > The goal is to prepare a second improvement to support query terms subset > matching or query elevation rules. > Before that, we need to refactor the QueryElevationComponent. We make it > extendible. We introduce the ElevationProvider interface which will be > implemented later in a second patch to support subset matching. The current > full-query match policy becomes a default simple MapElevationProvider. > - Add overridable methods to handle exceptions during the component > initialization. > - Add overridable methods to provide the default values for config properties. > - No functional change beyond refactoring. > - Adapt unit test. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton
[ https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385824#comment-16385824 ] Bruno Roustant commented on LUCENE-8159: Ok. I'll let you guys decide whether to discard this patch. [~jpountz] I'm curious about searching a lot of fields. {quote}searching over lots of fields is a bad practice {quote} Could you tell me the reason for the bad practice? Is it due to bad performance impact? Are there other reasons by design? Generally customer organizations love to have lots of fields. While I agree that sometimes they should revisit their data partitioning, there are cases where searching many fields help (e.g. CRM, field level security, ML ranking model based on field matches) > Add a copy constructor in AutomatonQuery to copy directly the compiled > automaton > > > Key: LUCENE-8159 > URL: https://issues.apache.org/jira/browse/LUCENE-8159 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: trunk >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: > 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, > LUCENE-8159.patch > > > When the query is composed of multiple AutomatonQuery with the same automaton > and which target different fields, it is much more efficient to reuse the > already compiled automaton by copying it directly and just changing the > target field. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton
[ https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378373#comment-16378373 ] Bruno Roustant edited comment on LUCENE-8159 at 2/27/18 10:42 AM: -- {quote}I's rather like to expose an expert constructor that takes a compiled automaton and expect users to compile the automaton themselves if they plan to reuse it in multiple queries? {quote} I can speak as such a "user" as I'm having this use case. We often build queries with the same prefix/wildcard query for multiple different fields (and sometimes many fields). As a user I really appreciate to simply copy a PrefixQuery or WildcardQuery, rather than building the automaton myself. The inner automaton inside PrefixQuery is hidden, and the logic is internal to the PrefixQuery. I don't want to know myself how it is built. I agree with exposing the compiled automaton. {quote}Should PrefixQuery & WildcardQuery & TermRangeQuery have the same constructors too? {quote} I indeed prepared the same copy constructors for these classes. I didn't have time to resubmit the patch yet, but that's the idea, yes. was (Author: bruno.roustant): {quote}I's rather like to expose an expert constructor that takes a compiled automaton and expect users to compile the automaton themselves if they plan to reuse it in multiple queries? {quote} I can speak as such a "user" as I'm having this use case. We often build queries with the same prefix/wildcard query for multiple different fields (and sometimes many fields). As a user I really appreciate to simply copy a PrefixQuery or WildcardQuery, rather than building the automaton myself. The inner automaton inside PrefixQuery is hidden, and the logic is internal to the PrefixQuery. I don't want to know myself how it is built. I agree with exposing the compiled automaton. Although I find the copy constructor easier to use. {quote}Should PrefixQuery & WildcardQuery & TermRangeQuery have the same constructors too? {quote} I indeed prepared the same copy constructors for these classes. I didn't have time to resubmit the patch yet, but that's the idea, yes. > Add a copy constructor in AutomatonQuery to copy directly the compiled > automaton > > > Key: LUCENE-8159 > URL: https://issues.apache.org/jira/browse/LUCENE-8159 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: trunk >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: > 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, > LUCENE-8159.patch > > > When the query is composed of multiple AutomatonQuery with the same automaton > and which target different fields, it is much more efficient to reuse the > already compiled automaton by copying it directly and just changing the > target field. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton
[ https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378373#comment-16378373 ] Bruno Roustant edited comment on LUCENE-8159 at 2/27/18 11:29 AM: -- {quote}I's rather like to expose an expert constructor that takes a compiled automaton and expect users to compile the automaton themselves if they plan to reuse it in multiple queries? {quote} I can speak as such a "user" as I'm having this use case. We often build queries with the same prefix/wildcard query for multiple different fields (and sometimes many fields - in this case the optimization does help). As a user I really appreciate to simply copy a PrefixQuery or WildcardQuery, rather than building the automaton myself. The inner automaton inside PrefixQuery is hidden, and the logic is internal to the PrefixQuery. I don't want to know myself how it is built. I agree with exposing the compiled automaton. But I still think PrefixQuery and WildcardQuery would benefit from a new constructor. And this constructor cannot really take any automaton as parameter, it could potentially break the prefix/wildcard contract. So, to me, PrefixQuery and WildcardQuery should have their copy constructor. {quote}Should PrefixQuery & WildcardQuery & TermRangeQuery have the same constructors too? {quote} I indeed prepared the same copy constructors for these classes. I didn't have time to resubmit the patch yet, but that's the idea, yes. was (Author: bruno.roustant): {quote}I's rather like to expose an expert constructor that takes a compiled automaton and expect users to compile the automaton themselves if they plan to reuse it in multiple queries? {quote} I can speak as such a "user" as I'm having this use case. We often build queries with the same prefix/wildcard query for multiple different fields (and sometimes many fields - in this case the optimization does help). As a user I really appreciate to simply copy a PrefixQuery or WildcardQuery, rather than building the automaton myself. The inner automaton inside PrefixQuery is hidden, and the logic is internal to the PrefixQuery. I don't want to know myself how it is built. I agree with exposing the compiled automaton. {quote}Should PrefixQuery & WildcardQuery & TermRangeQuery have the same constructors too? {quote} I indeed prepared the same copy constructors for these classes. I didn't have time to resubmit the patch yet, but that's the idea, yes. > Add a copy constructor in AutomatonQuery to copy directly the compiled > automaton > > > Key: LUCENE-8159 > URL: https://issues.apache.org/jira/browse/LUCENE-8159 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: trunk >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: > 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, > LUCENE-8159.patch > > > When the query is composed of multiple AutomatonQuery with the same automaton > and which target different fields, it is much more efficient to reuse the > already compiled automaton by copying it directly and just changing the > target field. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton
[ https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378373#comment-16378373 ] Bruno Roustant commented on LUCENE-8159: {quote}I's rather like to expose an expert constructor that takes a compiled automaton and expect users to compile the automaton themselves if they plan to reuse it in multiple queries? {quote} I can speak as such a "user" as I'm having this use case. We often build queries with the same prefix/wildcard query for multiple different fields (and sometimes many fields). As a user I really appreciate to simply copy a PrefixQuery or WildcardQuery, rather than building the automaton myself. The inner automaton inside PrefixQuery is hidden, and the logic is internal to the PrefixQuery. I don't want to know myself how it is built. I agree with exposing the compiled automaton. Although I find the copy constructor easier to use. {quote}Should PrefixQuery & WildcardQuery & TermRangeQuery have the same constructors too? {quote} I indeed prepared the same copy constructors for these classes. I didn't have time to resubmit the patch yet, but that's the idea, yes. > Add a copy constructor in AutomatonQuery to copy directly the compiled > automaton > > > Key: LUCENE-8159 > URL: https://issues.apache.org/jira/browse/LUCENE-8159 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: trunk >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: > 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, > LUCENE-8159.patch > > > When the query is composed of multiple AutomatonQuery with the same automaton > and which target different fields, it is much more efficient to reuse the > already compiled automaton by copying it directly and just changing the > target field. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton
[ https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378373#comment-16378373 ] Bruno Roustant edited comment on LUCENE-8159 at 2/27/18 10:44 AM: -- {quote}I's rather like to expose an expert constructor that takes a compiled automaton and expect users to compile the automaton themselves if they plan to reuse it in multiple queries? {quote} I can speak as such a "user" as I'm having this use case. We often build queries with the same prefix/wildcard query for multiple different fields (and sometimes many fields - in this case the optimization does help). As a user I really appreciate to simply copy a PrefixQuery or WildcardQuery, rather than building the automaton myself. The inner automaton inside PrefixQuery is hidden, and the logic is internal to the PrefixQuery. I don't want to know myself how it is built. I agree with exposing the compiled automaton. {quote}Should PrefixQuery & WildcardQuery & TermRangeQuery have the same constructors too? {quote} I indeed prepared the same copy constructors for these classes. I didn't have time to resubmit the patch yet, but that's the idea, yes. was (Author: bruno.roustant): {quote}I's rather like to expose an expert constructor that takes a compiled automaton and expect users to compile the automaton themselves if they plan to reuse it in multiple queries? {quote} I can speak as such a "user" as I'm having this use case. We often build queries with the same prefix/wildcard query for multiple different fields (and sometimes many fields). As a user I really appreciate to simply copy a PrefixQuery or WildcardQuery, rather than building the automaton myself. The inner automaton inside PrefixQuery is hidden, and the logic is internal to the PrefixQuery. I don't want to know myself how it is built. I agree with exposing the compiled automaton. {quote}Should PrefixQuery & WildcardQuery & TermRangeQuery have the same constructors too? {quote} I indeed prepared the same copy constructors for these classes. I didn't have time to resubmit the patch yet, but that's the idea, yes. > Add a copy constructor in AutomatonQuery to copy directly the compiled > automaton > > > Key: LUCENE-8159 > URL: https://issues.apache.org/jira/browse/LUCENE-8159 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: trunk >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: > 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, > LUCENE-8159.patch > > > When the query is composed of multiple AutomatonQuery with the same automaton > and which target different fields, it is much more efficient to reuse the > already compiled automaton by copying it directly and just changing the > target field. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton
[ https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380407#comment-16380407 ] Bruno Roustant commented on LUCENE-8159: [~rcmuir] could you be a little more explicit? Without context I don't understand why a copy constructor is bad in Java in general. > Add a copy constructor in AutomatonQuery to copy directly the compiled > automaton > > > Key: LUCENE-8159 > URL: https://issues.apache.org/jira/browse/LUCENE-8159 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: trunk >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: > 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, > LUCENE-8159.patch > > > When the query is composed of multiple AutomatonQuery with the same automaton > and which target different fields, it is much more efficient to reuse the > already compiled automaton by copying it directly and just changing the > target field. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton
[ https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380407#comment-16380407 ] Bruno Roustant edited comment on LUCENE-8159 at 2/28/18 2:58 PM: - [~rcmuir] could you be a little more explicit? Without context I don't understand why a copy constructor is bad in Java in general. Do you mean you prefer a copy method? PrefixQuery copy(String field) was (Author: bruno.roustant): [~rcmuir] could you be a little more explicit? Without context I don't understand why a copy constructor is bad in Java in general. > Add a copy constructor in AutomatonQuery to copy directly the compiled > automaton > > > Key: LUCENE-8159 > URL: https://issues.apache.org/jira/browse/LUCENE-8159 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: trunk >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: > 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, > LUCENE-8159.patch > > > When the query is composed of multiple AutomatonQuery with the same automaton > and which target different fields, it is much more efficient to reuse the > already compiled automaton by copying it directly and just changing the > target field. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809251#comment-16809251 ] Bruno Roustant commented on LUCENE-8753: {quote}I think this is similar to […] BlockTermsReader/Writer {quote} Indeed similar; it mainly differs from VariableGapTermsIndexWriter in the way it selects the best term to start a block. It is based on the minimal distinguishing prefix. The idea is to make the terms index FST more compact. That way, given a target max heap memory, we can have potentially more blocks, so smaller ones that are scanned faster. This requirement to consume less heap was strong with lucene 7.1, now maybe less with the recent off-heap FST. {quote}Are you also doing something different to encode/decode postings? {quote} No, the postings are written with the regular PostingsWriterBase. {quote}Can you post results on the full wikimediumall? {quote} Good point. Will do tomorrow. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8753) New PostingFormat - UniformSplit
Bruno Roustant created LUCENE-8753: -- Summary: New PostingFormat - UniformSplit Key: LUCENE-8753 URL: https://issues.apache.org/jira/browse/LUCENE-8753 Project: Lucene - Core Issue Type: Improvement Components: core/codecs Affects Versions: 8.0 Reporter: Bruno Roustant Attachments: Uniform Split Technique.pdf This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 objectives: - Clear design and simple code. - Easily extensible, for both the logic and the index format. - Light memory usage with a very compact FST. - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. (the pdf attached explains visually the technique in more details) The principle is to split the list of terms into blocks and use a FST to access the block, but not as a prefix trie, rather with a seek-floor pattern. For the selection of the blocks, there is a target average block size (number of terms), with an allowed delta variation (10%) to compare the terms and select the one with the minimal distinguishing prefix. There are also several optimizations inside the block to make it more compact and speed up the loading/scanning. The performance obtained is interesting with the luceneutil benchmark, comparing UniformSplit with BlockTree. Find it in the first comment. Although the precise percentages vary between runs, three main points: - TermQuery and PhraseQuery are improved. - PrefixQuery and WildcardQuery are ok. - Fuzzy queries are clearly less performant, because BlockTree is so optimized for them. Compared to BlockTree, FST size is reduced by 15%, and segment writing time is reduced by 20%. So this PostingsFormat scales to lots of docs, as BlockTree. This initial version passes all Lucene tests. Use “ant test -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. Subjectively, we think we have fulfilled our goal of code simplicity. And we have already exercised this PostingsFormat extensibility to create a different flavor for our own use-case. Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated LUCENE-8753: --- Attachment: luceneutil.benchmark.txt > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated LUCENE-8753: --- Description: This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 objectives: - Clear design and simple code. - Easily extensible, for both the logic and the index format. - Light memory usage with a very compact FST. - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. (the pdf attached explains visually the technique in more details) The principle is to split the list of terms into blocks and use a FST to access the block, but not as a prefix trie, rather with a seek-floor pattern. For the selection of the blocks, there is a target average block size (number of terms), with an allowed delta variation (10%) to compare the terms and select the one with the minimal distinguishing prefix. There are also several optimizations inside the block to make it more compact and speed up the loading/scanning. The performance obtained is interesting with the luceneutil benchmark, comparing UniformSplit with BlockTree. Find it in the first comment and also attached for better formatting. Although the precise percentages vary between runs, three main points: - TermQuery and PhraseQuery are improved. - PrefixQuery and WildcardQuery are ok. - Fuzzy queries are clearly less performant, because BlockTree is so optimized for them. Compared to BlockTree, FST size is reduced by 15%, and segment writing time is reduced by 20%. So this PostingsFormat scales to lots of docs, as BlockTree. This initial version passes all Lucene tests. Use “ant test -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. Subjectively, we think we have fulfilled our goal of code simplicity. And we have already exercised this PostingsFormat extensibility to create a different flavor for our own use-case. Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley was: This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 objectives: - Clear design and simple code. - Easily extensible, for both the logic and the index format. - Light memory usage with a very compact FST. - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. (the pdf attached explains visually the technique in more details) The principle is to split the list of terms into blocks and use a FST to access the block, but not as a prefix trie, rather with a seek-floor pattern. For the selection of the blocks, there is a target average block size (number of terms), with an allowed delta variation (10%) to compare the terms and select the one with the minimal distinguishing prefix. There are also several optimizations inside the block to make it more compact and speed up the loading/scanning. The performance obtained is interesting with the luceneutil benchmark, comparing UniformSplit with BlockTree. Find it in the first comment. Although the precise percentages vary between runs, three main points: - TermQuery and PhraseQuery are improved. - PrefixQuery and WildcardQuery are ok. - Fuzzy queries are clearly less performant, because BlockTree is so optimized for them. Compared to BlockTree, FST size is reduced by 15%, and segment writing time is reduced by 20%. So this PostingsFormat scales to lots of docs, as BlockTree. This initial version passes all Lucene tests. Use “ant test -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. Subjectively, we think we have fulfilled our goal of code simplicity. And we have already exercised this PostingsFormat extensibility to create a different flavor for our own use-case. Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf > > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808799#comment-16808799 ] Bruno Roustant commented on LUCENE-8753: Here's the Luceneutil benchmark with the wikimedium500k data set using Java 8. This is a bit dated using Lucene 7.1; it'd be nice to update to master. Report after iter 19: TaskQPS blocktree StdDevQPS uniformsplit StdDev Pct diff Fuzzy1 508.47 (3.8%) 221.37 (0.9%) {color:#59afe1}-56.5%{color} ( -58% - -53%) Fuzzy2 171.73 (6.4%) 80.62 (1.4%) {color:#59afe1}-53.1%{color} ( -57% - -48%) PKLookup 182.47 (2.4%) 149.62 (2.5%) {color:#59afe1}-18.0%{color} ( -22% - -13%) Wildcard 1788.74 (5.9%) 1729.37 (4.5%) {color:#59afe1}-3.3%{color} ( -12% - 7%) IntNRQ 1561.48 (2.1%) 1564.33 (1.9%) {color:#59afe1}0.2%{color} ( -3% - 4%) Prefix3 1759.69 (5.0%) 1829.74 (4.8%) {color:#59afe1}4.0%{color} ( -5% - 14%) HighTermDayOfYearSort 586.06 (5.4%) 622.34 (8.2%) {color:#59afe1}6.2%{color} ( -6% - 20%) MedPhrase 1204.85 (5.5%) 1282.89 (7.7%) {color:#59afe1}6.5%{color} ( -6% - 20%) HighSpanNear 590.88 (4.1%) 629.64 (6.1%) {color:#59afe1}6.6%{color} ( -3% - 17%) OrHighMed 1101.48 (4.5%) 1220.75 (6.2%) {color:#59afe1}10.8%{color} ( 0% - 22%) HighTermMonthSort 2617.10 (2.6%) 2916.34 (4.6%) {color:#59afe1}11.4%{color} ( 4% - 19%) HighPhrase 961.04 (5.5%) 1073.62 (6.0%) {color:#59afe1}11.7%{color} ( 0% - 24%) MedSloppyPhrase 604.56 (13.3%) 680.31 (13.7%) {color:#59afe1}12.5%{color} ( -12% - 45%) LowSloppyPhrase 954.87 (8.1%) 1075.67 (5.4%) {color:#59afe1}12.7%{color} ( 0% - 28%) MedSpanNear 737.14 (5.8%) 830.68 (8.3%) {color:#59afe1}12.7%{color} ( -1% - 28%) OrHighHigh 811.57 (5.7%) 915.01 (6.2%) {color:#59afe1}12.7%{color} ( 0% - 26%) AndHighMed 1157.45 (5.3%) 1317.78 (5.1%) {color:#59afe1}13.9%{color} ( 3% - 25%) AndHighHigh 1095.29 (5.7%) 1254.16 (4.9%) {color:#59afe1}14.5%{color} ( 3% - 26%) HighSloppyPhrase 880.42 (8.2%) 1009.72 (7.0%) {color:#59afe1}14.7%{color} ( 0% - 32%) LowPhrase 1245.33 (6.0%) 1473.57 (4.4%) {color:#59afe1}18.3%{color} ( 7% - 30%) Respell 81.10 (12.7%) 99.43 (10.3%) {color:#59afe1}22.6%{color} ( 0% - 52%) HighTerm 3733.81 (6.1%) 4599.96 (6.8%) {color:#59afe1}23.2%{color} ( 9% - 38%) OrHighLow 1960.13 (6.2%) 2415.81 (6.0%) {color:#59afe1}23.2%{color} ( 10% - 37%) MedTerm 4411.60 (4.9%) 5450.56 (5.8%) {color:#59afe1}23.6%{color} ( 12% - 35%) LowSpanNear 1944.27 (5.3%) 2416.29 (4.5%) {color:#59afe1}24.3%{color} ( 13% - 36%) AndHighLow 1978.10 (7.6%) 2500.74 (5.8%) {color:#59afe1}26.4%{color} ( 12% - 43%) LowTerm 4949.24 (4.8%) 6589.86 (5.3%) {color:#59afe1}33.1%{color} ( 22% - 45%) > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf > > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more compact > and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment. > > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808852#comment-16808852 ] Bruno Roustant commented on LUCENE-8753: {quote}Is it due to the fact that it doesn't have the ability to fail lookups early like BlockTree? {quote} This is one cause. While BlockTree builds a kind of prefix-trie and may stop if the prefix is not matched, UniformSplit doesn't, so it loads a block. That said I remarked that PKLookup performance varies a lot. It is sometimes in favor of UniformSplit. Actually I don't know how the benchmark generates the test set. It clearly has an influence on the metric. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813171#comment-16813171 ] Bruno Roustant edited comment on LUCENE-8753 at 4/9/19 9:26 AM: I agree. We profiled wikimediumall and we saw that 90% of the time is spent in the scoring, and less than a couple of percent is spent to access the dictionary blocks. Our own use-case is to have multiple small-to-medium cores, the size of wikimedium500k, that's why we studied it more. was (Author: bruno.roustant): I agree. We profile wikimediumall and we saw that 90% of the time is spent in the scoring, and less than a couple of percent is spent to access the dictionary blocks. Our own use-case is to have multiple small-to-medium cores, the size of wikimedium500k, that's why we studied it more. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813171#comment-16813171 ] Bruno Roustant commented on LUCENE-8753: I agree. We profile wikimediumall and we saw that 90% of the time is spent in the scoring, and less than a couple of percent is spent to access the dictionary blocks. Our own use-case is to have multiple small-to-medium cores, the size of wikimedium500k, that's why we studied it more. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813106#comment-16813106 ] Bruno Roustant commented on LUCENE-8753: It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h indexing initially - a little less for UniformSplit, then I had an exception about facets). Then I got results which surprised me. BlockTree and UniformSplit had the same QPS for Term and Phrase queries. I didn't understand why a different behavior between a small and a large index. Then I thought about 2 explanations: * Much larger index could mean less OS IO cache hits. I ran the benchmark with a 16 GB laptop and a 64 GB desktop. Actually I got nearly no difference in my test. * Much larger index could mean more results. So the time spent to score and rank the results could become much larger and diminish the effect of a change in the dictionary. I have no clue there at the moment. Here is the result of wikimedimall on a 64 GB desktop: ||Task||QPS BT||StdDev BT||QPS CUS||StdDev CUS||Pct diff |Fuzzy1|72.81|3.11|21.77|0.71|\{color:red}72%\{color}-\{color:red}67%\{color}| |Fuzzy2|66.77|3.77|20.41|0.67|\{color:red}72%\{color}-\{color:red}66%\{color}| |Respell|8.85|0.64|6.02|0.33|\{color:red}40%\{color}-\{color:red}22%\{color}| |PKLookup|130.83|3.96|121.66|12.37|\{color:red}18%\{color}-\{color:green}5%\{color}| |Wildcard|25.03|1.33|23.93|1.19|\{color:red}13%\{color}-\{color:green}6%\{color}| |HighTermMonthSort|19.03|2.55|18.40|1.56|\{color:red}21%\{color}-\{color:green}21%\{color}| |Prefix3|12.47|0.82|12.10|0.78|\{color:red}14%\{color}-\{color:green}10%\{color}| |LowTerm|182.95|14.94|177.97|18.67|\{color:red}19%\{color}-\{color:green}17%\{color}| |IntNRQ|5.21|0.54|5.09|0.56|\{color:red}21%\{color}-\{color:green}21%\{color}| |MedTerm|90.74|3.99|89.14|4.24|\{color:red}10%\{color}-\{color:green}7%\{color}| |HighTerm|42.54|1.95|41.86|2.00|\{color:red}10%\{color}-\{color:green}8%\{color}| |OrNotHighLow|532.96|16.16|526.86|24.40|\{color:red}8%\{color}-\{color:green}6%\{color}| |HighSloppyPhrase|12.00|0.39|11.90|0.48|\{color:red}7%\{color}-\{color:green}6%\{color}| |OrNotHighMed|53.64|1.08|53.37|1.22|\{color:red}4%\{color}-\{color:green}3%\{color}| |MedSloppyPhrase|31.83|0.59|31.67|0.78|\{color:red}4%\{color}-\{color:green}3%\{color}| |HighPhrase|32.24|0.85|32.09|0.81|\{color:red}5%\{color}-\{color:green}4%\{color}| |LowSloppyPhrase|29.51|0.43|29.40|0.58|\{color:red}3%\{color}-\{color:green}3%\{color}| |AndHighHigh|26.97|0.31|26.88|0.37|\{color:red}2%\{color}-\{color:green}2%\{color}| |MedPhrase|4.95|0.16|4.94|0.15|\{color:red}6%\{color}-\{color:green}6%\{color}| |AndHighMed|50.03|0.72|49.97|0.72|\{color:red}2%\{color}-\{color:green}2%\{color}| |OrNotHighHigh|18.85|0.76|18.85|0.82|\{color:red}8%\{color}-\{color:green}8%\{color}| |OrHighNotHigh|9.35|0.32|9.35|0.35|\{color:red}6%\{color}-\{color:green}7%\{color}| |OrHighLow|15.85|0.59|15.85|0.52|\{color:red}6%\{color}-\{color:green}7%\{color}| |OrHighNotLow|17.56|0.71|17.57|0.70|\{color:red}7%\{color}-\{color:green}8%\{color}| |AndHighLow|284.39|4.41|284.60|5.65|\{color:red}3%\{color}-\{color:green}3%\{color}| |LowPhrase|224.73|4.35|224.97|4.84|\{color:red}3%\{color}-\{color:green}4%\{color}| |OrHighNotMed|13.21|0.49|13.22|0.50|\{color:red}7%\{color}-\{color:green}7%\{color}| |OrHighMed|13.22|0.73|13.30|0.70|\{color:red}9%\{color}-\{color:green}12%\{color}| |OrHighHigh|7.56|0.43|7.62|0.41|\{color:red}9%\{color}-\{color:green}12%\{color}| |BrowseMonthTaxoFacets|7.96|1.92|8.06|1.78|\{color:red}36%\{color}-\{color:green}63%\{color}| |LowSpanNear|11.84|0.19|11.99|0.21|\{color:red}2%\{color}-\{color:green}4%\{color}| |HighTermDayOfYearSort|20.05|1.40|20.31|2.15|\{color:red}15%\{color}-\{color:green}20%\{color}| |BrowseDayOfYearTaxoFacets|7.96|1.91|8.07|1.85|\{color:red}37%\{color}-\{color:green}64%\{color}| |BrowseMonthSSDVFacets|7.95|1.90|8.07|1.87|\{color:red}37%\{color}-\{color:green}64%\{color}| |BrowseDayOfYearSSDVFacets|7.96|1.93|8.08|1.84|\{color:red}36%\{color}-\{color:green}64%\{color}| |MedSpanNear|10.50|0.18|10.67|0.21|\{color:red}2%\{color}-\{color:green}5%\{color}| |BrowseDateTaxoFacets|7.91|1.81|8.07|1.83|\{color:red}35%\{color}-\{color:green}62%\{color}| |HighSpanNear|8.68|0.19|8.88|0.19|\{color:red}2%\{color}-\{color:green}6%\{color}| > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 >
[jira] [Comment Edited] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813106#comment-16813106 ] Bruno Roustant edited comment on LUCENE-8753 at 4/9/19 8:15 AM: It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h indexing initially - a little less for UniformSplit, then I had an exception about facets). Then I got results which surprised me. BlockTree and UniformSplit had the same QPS for Term and Phrase queries. I didn't understand why a different behavior between a small and a large index. Then I thought about 2 explanations: * Much larger index could mean less OS IO cache hits. I ran the benchmark with a 16 GB laptop and a 64 GB desktop. Actually I got nearly no difference in my test. * Much larger index could mean more results. So the time spent to score and rank the results could become much larger and diminish the effect of a change in the dictionary. I have no clue there at the moment. Here is the result of wikimedimall on a 64 GB desktop: (I used -Jira option, but it does not seem to recognize the "color" tag) ||Task||QPS BT||StdDev BT||QPS CUS||StdDev CUS||Pct diff |Fuzzy1|72.81|3.11|21.77|0.71|\{color:red}72%\{color}-\{color:red}67%\{color}| |Fuzzy2|66.77|3.77|20.41|0.67|\{color:red}72%\{color}-\{color:red}66%\{color}| |Respell|8.85|0.64|6.02|0.33|\{color:red}40%\{color}-\{color:red}22%\{color}| |PKLookup|130.83|3.96|121.66|12.37|\{color:red}18%\{color}-\{color:green}5%\{color}| |Wildcard|25.03|1.33|23.93|1.19|\{color:red}13%\{color}-\{color:green}6%\{color}| |HighTermMonthSort|19.03|2.55|18.40|1.56|\{color:red}21%\{color}-\{color:green}21%\{color}| |Prefix3|12.47|0.82|12.10|0.78|\{color:red}14%\{color}-\{color:green}10%\{color}| |LowTerm|182.95|14.94|177.97|18.67|\{color:red}19%\{color}-\{color:green}17%\{color}| |IntNRQ|5.21|0.54|5.09|0.56|\{color:red}21%\{color}-\{color:green}21%\{color}| |MedTerm|90.74|3.99|89.14|4.24|\{color:red}10%\{color}-\{color:green}7%\{color}| |HighTerm|42.54|1.95|41.86|2.00|\{color:red}10%\{color}-\{color:green}8%\{color}| |OrNotHighLow|532.96|16.16|526.86|24.40|\{color:red}8%\{color}-\{color:green}6%\{color}| |HighSloppyPhrase|12.00|0.39|11.90|0.48|\{color:red}7%\{color}-\{color:green}6%\{color}| |OrNotHighMed|53.64|1.08|53.37|1.22|\{color:red}4%\{color}-\{color:green}3%\{color}| |MedSloppyPhrase|31.83|0.59|31.67|0.78|\{color:red}4%\{color}-\{color:green}3%\{color}| |HighPhrase|32.24|0.85|32.09|0.81|\{color:red}5%\{color}-\{color:green}4%\{color}| |LowSloppyPhrase|29.51|0.43|29.40|0.58|\{color:red}3%\{color}-\{color:green}3%\{color}| |AndHighHigh|26.97|0.31|26.88|0.37|\{color:red}2%\{color}-\{color:green}2%\{color}| |MedPhrase|4.95|0.16|4.94|0.15|\{color:red}6%\{color}-\{color:green}6%\{color}| |AndHighMed|50.03|0.72|49.97|0.72|\{color:red}2%\{color}-\{color:green}2%\{color}| |OrNotHighHigh|18.85|0.76|18.85|0.82|\{color:red}8%\{color}-\{color:green}8%\{color}| |OrHighNotHigh|9.35|0.32|9.35|0.35|\{color:red}6%\{color}-\{color:green}7%\{color}| |OrHighLow|15.85|0.59|15.85|0.52|\{color:red}6%\{color}-\{color:green}7%\{color}| |OrHighNotLow|17.56|0.71|17.57|0.70|\{color:red}7%\{color}-\{color:green}8%\{color}| |AndHighLow|284.39|4.41|284.60|5.65|\{color:red}3%\{color}-\{color:green}3%\{color}| |LowPhrase|224.73|4.35|224.97|4.84|\{color:red}3%\{color}-\{color:green}4%\{color}| |OrHighNotMed|13.21|0.49|13.22|0.50|\{color:red}7%\{color}-\{color:green}7%\{color}| |OrHighMed|13.22|0.73|13.30|0.70|\{color:red}9%\{color}-\{color:green}12%\{color}| |OrHighHigh|7.56|0.43|7.62|0.41|\{color:red}9%\{color}-\{color:green}12%\{color}| |BrowseMonthTaxoFacets|7.96|1.92|8.06|1.78|\{color:red}36%\{color}-\{color:green}63%\{color}| |LowSpanNear|11.84|0.19|11.99|0.21|\{color:red}2%\{color}-\{color:green}4%\{color}| |HighTermDayOfYearSort|20.05|1.40|20.31|2.15|\{color:red}15%\{color}-\{color:green}20%\{color}| |BrowseDayOfYearTaxoFacets|7.96|1.91|8.07|1.85|\{color:red}37%\{color}-\{color:green}64%\{color}| |BrowseMonthSSDVFacets|7.95|1.90|8.07|1.87|\{color:red}37%\{color}-\{color:green}64%\{color}| |BrowseDayOfYearSSDVFacets|7.96|1.93|8.08|1.84|\{color:red}36%\{color}-\{color:green}64%\{color}| |MedSpanNear|10.50|0.18|10.67|0.21|\{color:red}2%\{color}-\{color:green}5%\{color}| |BrowseDateTaxoFacets|7.91|1.81|8.07|1.83|\{color:red}35%\{color}-\{color:green}62%\{color}| |HighSpanNear|8.68|0.19|8.88|0.19|\{color:red}2%\{color}-\{color:green}6%\{color}| was (Author: bruno.roustant): It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h indexing initially - a little less for UniformSplit, then I had an exception about facets). Then I got results which surprised me. BlockTree and UniformSplit had the same QPS for Term and Phrase queries. I didn't understand why a different behavior between a small and a large index. Then I thought about 2 explanations: * Much larger index could mean less OS IO cache
[jira] [Comment Edited] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813106#comment-16813106 ] Bruno Roustant edited comment on LUCENE-8753 at 4/9/19 8:13 AM: It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h indexing initially - a little less for UniformSplit, then I had an exception about facets). Then I got results which surprised me. BlockTree and UniformSplit had the same QPS for Term and Phrase queries. I didn't understand why a different behavior between a small and a large index. Then I thought about 2 explanations: * Much larger index could mean less OS IO cache hits. I ran the benchmark with a 16 GB laptop and a 64 GB desktop. Actually I got nearly no difference in my test. * Much larger index could mean more results. So the time spent to score and rank the results could become much larger and diminish the effect of a change in the dictionary. I have no clue there at the moment. Here is the result of wikimedimall on a 64 GB desktop: (I used -Jira option, but it does not seem to recognize the \{color} tag) ||Task||QPS BT||StdDev BT||QPS CUS||StdDev CUS||Pct diff|| |Fuzzy1|72.81|3.11|21.77|0.71|{color:red}72%\{color}-\{color:red}67%\{color}| |Fuzzy2|66.77|3.77|20.41|0.67|{color:red}72%\{color}-\{color:red}66%\{color}| |Respell|8.85|0.64|6.02|0.33|{color:red}40%\{color}-\{color:red}22%\{color}| |PKLookup|130.83|3.96|121.66|12.37|{color:red}18%\{color}-\{color:green}5%\{color}| |Wildcard|25.03|1.33|23.93|1.19|{color:red}13%\{color}-\{color:green}6%\{color}| |HighTermMonthSort|19.03|2.55|18.40|1.56|{color:red}21%\{color}-\{color:green}21%\{color}| |Prefix3|12.47|0.82|12.10|0.78|{color:red}14%\{color}-\{color:green}10%\{color}| |LowTerm|182.95|14.94|177.97|18.67|{color:red}19%\{color}-\{color:green}17%\{color}| |IntNRQ|5.21|0.54|5.09|0.56|{color:red}21%\{color}-\{color:green}21%\{color}| |MedTerm|90.74|3.99|89.14|4.24|{color:red}10%\{color}-\{color:green}7%\{color}| |HighTerm|42.54|1.95|41.86|2.00|{color:red}10%\{color}-\{color:green}8%\{color}| |OrNotHighLow|532.96|16.16|526.86|24.40|{color:red}8%\{color}-\{color:green}6%\{color}| |HighSloppyPhrase|12.00|0.39|11.90|0.48|{color:red}7%\{color}-\{color:green}6%\{color}| |OrNotHighMed|53.64|1.08|53.37|1.22|{color:red}4%\{color}-\{color:green}3%\{color}| |MedSloppyPhrase|31.83|0.59|31.67|0.78|{color:red}4%\{color}-\{color:green}3%\{color}| |HighPhrase|32.24|0.85|32.09|0.81|{color:red}5%\{color}-\{color:green}4%\{color}| |LowSloppyPhrase|29.51|0.43|29.40|0.58|{color:red}3%\{color}-\{color:green}3%\{color}| |AndHighHigh|26.97|0.31|26.88|0.37|{color:red}2%\{color}-\{color:green}2%\{color}| |MedPhrase|4.95|0.16|4.94|0.15|{color:red}6%\{color}-\{color:green}6%\{color}| |AndHighMed|50.03|0.72|49.97|0.72|{color:red}2%\{color}-\{color:green}2%\{color}| |OrNotHighHigh|18.85|0.76|18.85|0.82|{color:red}8%\{color}-\{color:green}8%\{color}| |OrHighNotHigh|9.35|0.32|9.35|0.35|{color:red}6%\{color}-\{color:green}7%\{color}| |OrHighLow|15.85|0.59|15.85|0.52|{color:red}6%\{color}-\{color:green}7%\{color}| |OrHighNotLow|17.56|0.71|17.57|0.70|{color:red}7%\{color}-\{color:green}8%\{color}| |AndHighLow|284.39|4.41|284.60|5.65|{color:red}3%\{color}-\{color:green}3%\{color}| |LowPhrase|224.73|4.35|224.97|4.84|{color:red}3%\{color}-\{color:green}4%\{color}| |OrHighNotMed|13.21|0.49|13.22|0.50|{color:red}7%\{color}-\{color:green}7%\{color}| |OrHighMed|13.22|0.73|13.30|0.70|{color:red}9%\{color}-\{color:green}12%\{color}| |OrHighHigh|7.56|0.43|7.62|0.41|{color:red}9%\{color}-\{color:green}12%\{color}| |BrowseMonthTaxoFacets|7.96|1.92|8.06|1.78|{color:red}36%\{color}-\{color:green}63%\{color}| |LowSpanNear|11.84|0.19|11.99|0.21|{color:red}2%\{color}-\{color:green}4%\{color}| |HighTermDayOfYearSort|20.05|1.40|20.31|2.15|{color:red}15%\{color}-\{color:green}20%\{color}| |BrowseDayOfYearTaxoFacets|7.96|1.91|8.07|1.85|{color:red}37%\{color}-\{color:green}64%\{color}| |BrowseMonthSSDVFacets|7.95|1.90|8.07|1.87|{color:red}37%\{color}-\{color:green}64%\{color}| |BrowseDayOfYearSSDVFacets|7.96|1.93|8.08|1.84|{color:red}36%\{color}-\{color:green}64%\{color}| |MedSpanNear|10.50|0.18|10.67|0.21|{color:red}2%\{color}-\{color:green}5%\{color}| |BrowseDateTaxoFacets|7.91|1.81|8.07|1.83|{color:red}35%\{color}-\{color:green}62%\{color}| |HighSpanNear|8.68|0.19|8.88|0.19|{color:red}2%\{color}-\{color:green}6%\{color}| was (Author: bruno.roustant): It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h indexing initially - a little less for UniformSplit, then I had an exception about facets). Then I got results which surprised me. BlockTree and UniformSplit had the same QPS for Term and Phrase queries. I didn't understand why a different behavior between a small and a large index. Then I thought about 2 explanations: * Much larger index could mean less OS IO cache hits. I ran the benchmark with a 16 GB
[jira] [Created] (LUCENE-8836) Optimize DocValues TermsDict to continue scanning from the last position when possible
Bruno Roustant created LUCENE-8836: -- Summary: Optimize DocValues TermsDict to continue scanning from the last position when possible Key: LUCENE-8836 URL: https://issues.apache.org/jira/browse/LUCENE-8836 Project: Lucene - Core Issue Type: Improvement Reporter: Bruno Roustant Lucene80DocValuesProducer.TermsDict is used to lookup for either a term or a term ordinal. Currently it does not have the optimization the FSTEnum has: to be able to continue a sequential scan from where the last lookup was in the IndexInput. For sparse lookups (when searching only a few terms or ordinal) it is not an issue. But for multiple lookups in a row this optimization could save re-scanning all the terms from the block start (since they are delat encoded). This patch proposes the optimization. To estimate the gain, we ran 3 Lucene tests while counting the seeks and the term reads in the IndexInput, with and without the optimization: TestLucene70DocValuesFormat - the optimization saves 24% seeks and 15% term reads. TestDocValuesQueries - the optimization adds 0.7% seeks and 0.003% term reads. TestDocValuesRewriteMethod.testRegexps - the optimization saves 71% seeks and 82% term reads. In some cases, when scanning many terms in lexicographical order, the optimization saves a lot. In some case, when only looking for some sparse terms, the optimization does not bring improvement, but does not penalize neither. It seems to be worth to always have it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839429#comment-16839429 ] Bruno Roustant commented on LUCENE-8753: Beyond the performance aspects, we developed UniformSplit to be extensible. To give an idea of how it can be extended, I have added a new PR#676: SharedTerms UniformSplit. The use-case is when there are many fields. We want to take advantage of the FST property to share the terms between all the fields, by replacing one FST per field by a single FST containing the shared terms. In this case each term is stored only once in the block file, and its block line contains the TermState for each different field for which the term occurs. term A -> field1 TermState, field2 TermState, field3 TermState term B -> field3 TermState, field5 TermState The FST is compact and this posting format also unlocks the possibility to cache when the same term is searched in many fields (but this is not part of this PR). My goal here is to showcase the extensibility of this posting format. This extension is in a separate sub-package sharedterms and is quite concise. (the only tricky part is the custom merge to merge efficiently two segments by accessing directly the sharedterms posting format) > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 20m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11866) Support efficient subset matching in query elevation rules
[ https://issues.apache.org/jira/browse/SOLR-11866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-11866: -- Attachment: (was: 0001-New-SubsetMatchElevationProvider-in-QueryElevationCo.patch) > Support efficient subset matching in query elevation rules > -- > > Key: SOLR-11866 > URL: https://issues.apache.org/jira/browse/SOLR-11866 > Project: Solr > Issue Type: Improvement > Components: SearchComponents - other >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: SOLR-11866.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Leverages the SOLR-11865 refactoring by introducing a > SubsetMatchElevationProvider in QueryElevationComponent. This provider calls > a new util class TrieSubsetMatcher to efficiently match all query elevation > rules which subset is contained by the current query list of terms. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-11866) Support efficient subset matching in query elevation rules
[ https://issues.apache.org/jira/browse/SOLR-11866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruno Roustant updated SOLR-11866: -- Attachment: (was: SOLR-11866.patch) > Support efficient subset matching in query elevation rules > -- > > Key: SOLR-11866 > URL: https://issues.apache.org/jira/browse/SOLR-11866 > Project: Solr > Issue Type: Improvement > Components: SearchComponents - other >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Leverages the SOLR-11865 refactoring by introducing a > SubsetMatchElevationProvider in QueryElevationComponent. This provider calls > a new util class TrieSubsetMatcher to efficiently match all query elevation > rules which subset is contained by the current query list of terms. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11866) Support efficient subset matching in query elevation rules
[ https://issues.apache.org/jira/browse/SOLR-11866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883816#comment-16883816 ] Bruno Roustant commented on SOLR-11866: --- I have updated with PR [#780|https://github.com/apache/lucene-solr/pull/780]. Should I remove the obsolete patch files from this Jira issue? > Support efficient subset matching in query elevation rules > -- > > Key: SOLR-11866 > URL: https://issues.apache.org/jira/browse/SOLR-11866 > Project: Solr > Issue Type: Improvement > Components: SearchComponents - other >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: > 0001-New-SubsetMatchElevationProvider-in-QueryElevationCo.patch, > SOLR-11866.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Leverages the SOLR-11865 refactoring by introducing a > SubsetMatchElevationProvider in QueryElevationComponent. This provider calls > a new util class TrieSubsetMatcher to efficiently match all query elevation > rules which subset is contained by the current query list of terms. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-11866) Support efficient subset matching in query elevation rules
[ https://issues.apache.org/jira/browse/SOLR-11866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883819#comment-16883819 ] Bruno Roustant commented on SOLR-11866: --- Also, the doc will need to be updated to explain the support of the new match="subset" param in the elevation rule (in addition to match="exact"). . > Support efficient subset matching in query elevation rules > -- > > Key: SOLR-11866 > URL: https://issues.apache.org/jira/browse/SOLR-11866 > Project: Solr > Issue Type: Improvement > Components: SearchComponents - other >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: > 0001-New-SubsetMatchElevationProvider-in-QueryElevationCo.patch, > SOLR-11866.patch > > Time Spent: 10m > Remaining Estimate: 0h > > Leverages the SOLR-11865 refactoring by introducing a > SubsetMatchElevationProvider in QueryElevationComponent. This provider calls > a new util class TrieSubsetMatcher to efficiently match all query elevation > rules which subset is contained by the current query list of terms. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8906) Lucene50PostingsReader.postings() casts BlockTermState param to private IntBlockTermState
Bruno Roustant created LUCENE-8906: -- Summary: Lucene50PostingsReader.postings() casts BlockTermState param to private IntBlockTermState Key: LUCENE-8906 URL: https://issues.apache.org/jira/browse/LUCENE-8906 Project: Lucene - Core Issue Type: Improvement Components: core/codecs Reporter: Bruno Roustant Lucene50PostingsReader is the public API that offers the postings() method to read the postings. Any PostingFormat can use it (as well as Lucene50PostingsWriter) to read/write postings. But the postings() method asks for a (public) BlockTermState param which is internally cast to the private IntBlockTermState. This BlockTermState is provided by Lucene50PostingsReader.newTermState(). public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, PostingsEnum reuse, int flags) This actually makes impossible to a custom PostingFormat customizing the Block file structure to use this postings() method by providing their (Int)BlockTermState, because they cannot access the FP fields of the IntBlockTermState returned by PostingsReaderBase.newTermState(). Proposed change: * Either make IntBlockTermState public, as well as its fields. * Or replace it by an interface in the postings() method. In this case the IntBlockTermState fields currently accessed directly would be replaced by getter/setter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8906) Lucene50PostingsReader.postings() casts BlockTermState param to private IntBlockTermState
[ https://issues.apache.org/jira/browse/LUCENE-8906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881037#comment-16881037 ] Bruno Roustant commented on LUCENE-8906: This issue has been encountered in LUCENE-8753 (Uniform Split posting format). > Lucene50PostingsReader.postings() casts BlockTermState param to private > IntBlockTermState > - > > Key: LUCENE-8906 > URL: https://issues.apache.org/jira/browse/LUCENE-8906 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Bruno Roustant >Priority: Major > > Lucene50PostingsReader is the public API that offers the postings() method to > read the postings. Any PostingFormat can use it (as well as > Lucene50PostingsWriter) to read/write postings. > But the postings() method asks for a (public) BlockTermState param which is > internally cast to the private IntBlockTermState. This BlockTermState is > provided by Lucene50PostingsReader.newTermState(). > public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, > PostingsEnum reuse, int flags) > This actually makes impossible to a custom PostingFormat customizing the > Block file structure to use this postings() method by providing their > (Int)BlockTermState, because they cannot access the FP fields of the > IntBlockTermState returned by PostingsReaderBase.newTermState(). > Proposed change: > * Either make IntBlockTermState public, as well as its fields. > * Or replace it by an interface in the postings() method. In this case the > IntBlockTermState fields currently accessed directly would be replaced by > getter/setter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881046#comment-16881046 ] Bruno Roustant commented on LUCENE-8753: I have created a related Jira issue LUCENE-8906 (Lucene50PostingsReader.postings() casts BlockTermState param to private IntBlockTermState) to make the PR review advance. If we find a solution for this issue, then UniformSplit posting format will be fully isolated in a separate package in codecs, with no intrusion anymore elsewhere. The goal is to have it as an additional optional posting format (not to replace BlockTree) for the following use-cases: customizable by extension, shared-terms extension available, low memory on-heap footprint, best efficiency when dealing with small to medium indexes. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 3h 20m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906219#comment-16906219 ] Bruno Roustant commented on LUCENE-8753: New [PR 828|https://github.com/apache/lucene-solr/pull/828] to have this PostingsFormat inside codecs/uniformsplit with no code elsewhere. I added package javadoc and lucene.experimental annotation. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 3h 40m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923463#comment-16923463 ] Bruno Roustant commented on LUCENE-8920: Based on some heuristics, Direct-Addressing is the good choice. For example if num labels / (max label - min label) >= 75%. > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mike Sokolov >Priority: Blocker > Fix For: 8.3 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923428#comment-16923428 ] Bruno Roustant commented on LUCENE-8920: [~sokolov] There may be another option to speed-up FST arc lookup while limiting the memory increase. Direct-Addressing option looks up by accessing directly 1 label, and costs up to (num labels x 4 x num bytes to encode) bytes. Label-List option is the opposite, look up needs on average N/2 label comparisons, and costs (num labels x var bytes to encode) bytes. Another option is to use open-addressing. Look up would be <= L comparisons where we can fix L < log(N)/2 (to be faster than binary search), and would cost < (num labels x 2 x num bytes to encode). The idea is to have an array of size 2^p such as 2^(p-1) < N < 2^p. We hash the labels and store them in the array using the open-addressing idea: if a slot is occupied, try with the next block. If we can’t store a label in less than L tries, then abort and fallback to Label-List or Binary-Search option. At lookup we hash the input label and know that we have less than L tries to compare. This is another compromise speed/memory: faster than binary search (constant L), with at least 2x less memory than Direct-Addressing. It is also possible to combine open-addressing and variable length encoding, by finding the first byte starting a label based on the bit used to encode the var length additional bytes. > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mike Sokolov >Priority: Blocker > Fix For: 8.3 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922609#comment-16922609 ] Bruno Roustant commented on LUCENE-8753: Ok, I followed your advice to include the "shared terms" extension (subpackage) in the same PR #828. I'm going to close the two previous ones. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 4h 20m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923428#comment-16923428 ] Bruno Roustant edited comment on LUCENE-8920 at 9/5/19 8:46 PM: [~sokolov] There may be another option to speed-up FST arc lookup while limiting the memory increase. Direct-Addressing option looks up by accessing directly 1 label, and costs up to (num labels x 4 x num bytes to encode) bytes. Label-List option is the opposite, look up needs on average N/2 label comparisons, and costs (num labels x var bytes to encode) bytes. Another option is to use open-addressing. Look up would be <= L comparisons where we can fix L < log(N)/2 (to be faster than binary search), and would cost < (num labels x 2 x num bytes to encode). The idea is to have an array of size 2^p such as 2^(p-1) < N < 2^p. We hash the labels and store them in the array using the open-addressing idea: if a slot is occupied, try with the next block. If we can’t store a label in less than L tries, then abort and fallback to Label-List or Binary-Search option. At lookup we hash the input label and know that we have less than L tries to compare. This is another compromise speed/memory: faster than binary search (constant L), with at least 2x less memory than Direct-Addressing. On the Binary-Search side, it could be possible to support variable length encoding, by finding the first byte starting a label based on the bit used to encode the var length additional bytes. was (Author: bruno.roustant): [~sokolov] There may be another option to speed-up FST arc lookup while limiting the memory increase. Direct-Addressing option looks up by accessing directly 1 label, and costs up to (num labels x 4 x num bytes to encode) bytes. Label-List option is the opposite, look up needs on average N/2 label comparisons, and costs (num labels x var bytes to encode) bytes. Another option is to use open-addressing. Look up would be <= L comparisons where we can fix L < log(N)/2 (to be faster than binary search), and would cost < (num labels x 2 x num bytes to encode). The idea is to have an array of size 2^p such as 2^(p-1) < N < 2^p. We hash the labels and store them in the array using the open-addressing idea: if a slot is occupied, try with the next block. If we can’t store a label in less than L tries, then abort and fallback to Label-List or Binary-Search option. At lookup we hash the input label and know that we have less than L tries to compare. This is another compromise speed/memory: faster than binary search (constant L), with at least 2x less memory than Direct-Addressing. It is also possible to combine open-addressing and variable length encoding, by finding the first byte starting a label based on the bit used to encode the var length additional bytes. > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mike Sokolov >Priority: Blocker > Fix For: 8.3 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924178#comment-16924178 ] Bruno Roustant commented on LUCENE-8920: I'd love to work on that, but I'm pretty busy so I can't start immediately. If you can start on it soon I'll be happy to help and review. I'll try to think more about the subject. Where should I post my remarks/ideas? Here in the thread or in an attached doc? Some additional thoughts: * Threshold T1 to find to decide when direct-addressing is best (N / (max label - min label) >= T1). E.g. with T1 = 50% worst case is memory x2 right? (although there is the var length encoding difference...). Did you try that, what is the perf? * Threshold T2 to find to decide if a list is better (N < T2) or if open-addressing is more appropriate. * If N is close to 2^p, the probability that open-addressing aborts (can't store a label in less than L tries) is high. Do we double the array size (2^(p+1)) or can we take 1.5x2^p to save memory? (my intuition is the second, but need some testing about the load factor) * I think var-length List and fixed-length Binary-Search options could be merged to always have a var-length List that can be binary searched with low impact on perf. This is a work in itself, but it can help reduce the FST memory and thus free some bytes for the faster options. > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mike Sokolov >Priority: Blocker > Fix For: 8.3 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924178#comment-16924178 ] Bruno Roustant edited comment on LUCENE-8920 at 9/6/19 11:57 AM: - I'd love to work on that, but I'm pretty busy so I can't start immediately. If you can start on it soon I'll be happy to help and review. I'll try to think more about the subject. Where should I post my remarks/ideas? Here in the thread or in an attached doc? Some additional thoughts: * Threshold T1 to find to decide when direct-addressing is best (N / (max label - min label) >= T1). E.g. with T1 = 50% worst case is memory x2 right? (although there is the var length encoding difference...). Did you try that, what is the perf? * Threshold T2 to find to decide if a list is better (N < T2) or if open-addressing is more appropriate. * If N is close to 2^p, the probability that open-addressing aborts (can't store a label in less than L tries) is high. Do we double the array size (2^(p+1)) or can we take 1.5x2^p to save memory? (my intuition is the second, but need some testing about the load factor) was (Author: bruno.roustant): I'd love to work on that, but I'm pretty busy so I can't start immediately. If you can start on it soon I'll be happy to help and review. I'll try to think more about the subject. Where should I post my remarks/ideas? Here in the thread or in an attached doc? Some additional thoughts: * Threshold T1 to find to decide when direct-addressing is best (N / (max label - min label) >= T1). E.g. with T1 = 50% worst case is memory x2 right? (although there is the var length encoding difference...). Did you try that, what is the perf? * Threshold T2 to find to decide if a list is better (N < T2) or if open-addressing is more appropriate. * If N is close to 2^p, the probability that open-addressing aborts (can't store a label in less than L tries) is high. Do we double the array size (2^(p+1)) or can we take 1.5x2^p to save memory? (my intuition is the second, but need some testing about the load factor) * I think var-length List and fixed-length Binary-Search options could be merged to always have a var-length List that can be binary searched with low impact on perf. This is a work in itself, but it can help reduce the FST memory and thus free some bytes for the faster options. > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mike Sokolov >Priority: Blocker > Fix For: 8.3 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8921) IndexSearcher.termStatistics should not require TermStates but docFreq and totalTermFreq
[ https://issues.apache.org/jira/browse/LUCENE-8921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1630#comment-1630 ] Bruno Roustant commented on LUCENE-8921: PR added > IndexSearcher.termStatistics should not require TermStates but docFreq and > totalTermFreq > > > Key: LUCENE-8921 > URL: https://issues.apache.org/jira/browse/LUCENE-8921 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: 8.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: master (9.0) > > Time Spent: 10m > Remaining Estimate: 0h > > IndexSearcher.termStatistics(Term term, TermStates context) is the way to > create a TermStatistics. It requires a TermStates param although it only > cares about the docFreq and totalTermFreq. > > For customizations that what to create TermStatistics based on docFreq and > totalTermFreq, but that do not have available TermStates, this method forces > to create a TermStates instance (which is not very lightweight) only to pass > two ints. > termStatistics could be modified to the following signature: > termStatistics(Term term, int docFreq, int totalTermFreq) > Since it would change the API, it could be done in master for next major > release. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8906) Lucene50PostingsReader.postings() casts BlockTermState param to private IntBlockTermState
[ https://issues.apache.org/jira/browse/LUCENE-8906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1682#comment-1682 ] Bruno Roustant commented on LUCENE-8906: PR added > Lucene50PostingsReader.postings() casts BlockTermState param to private > IntBlockTermState > - > > Key: LUCENE-8906 > URL: https://issues.apache.org/jira/browse/LUCENE-8906 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Bruno Roustant >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Lucene50PostingsReader is the public API that offers the postings() method to > read the postings. Any PostingFormat can use it (as well as > Lucene50PostingsWriter) to read/write postings. > But the postings() method asks for a (public) BlockTermState param which is > internally cast to the private IntBlockTermState. This BlockTermState is > provided by Lucene50PostingsReader.newTermState(). > public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, > PostingsEnum reuse, int flags) > This actually makes impossible to a custom PostingFormat customizing the > Block file structure to use this postings() method by providing their > (Int)BlockTermState, because they cannot access the FP fields of the > IntBlockTermState returned by PostingsReaderBase.newTermState(). > Proposed change: > * Either make IntBlockTermState public, as well as its fields. > * Or replace it by an interface in the postings() method. In this case the > IntBlockTermState fields currently accessed directly would be replaced by > getter/setter. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Created] (LUCENE-8921) IndexSearcher.termStatistics should not require TermStates but docFreq and totalTermFreq
Bruno Roustant created LUCENE-8921: -- Summary: IndexSearcher.termStatistics should not require TermStates but docFreq and totalTermFreq Key: LUCENE-8921 URL: https://issues.apache.org/jira/browse/LUCENE-8921 Project: Lucene - Core Issue Type: Improvement Components: core/search Affects Versions: 8.1 Reporter: Bruno Roustant Fix For: master (9.0) IndexSearcher.termStatistics(Term term, TermStates context) is the way to create a TermStatistics. It requires a TermStates param although it only cares about the docFreq and totalTermFreq. For customizations that what to create TermStatistics based on docFreq and totalTermFreq, but that do not have available TermStates, this method forces to create a TermStates instance (which is not very lightweight) only to pass two ints. termStatistics could be modified to the following signature: termStatistics(Term term, int docFreq, int totalTermFreq) Since it would change the API, it could be done in master for next major release. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8921) IndexSearcher.termStatistics should not require TermStates but docFreq and totalTermFreq
[ https://issues.apache.org/jira/browse/LUCENE-8921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886889#comment-16886889 ] Bruno Roustant commented on LUCENE-8921: Yes, sure. I could work on a PR for 8.2. > IndexSearcher.termStatistics should not require TermStates but docFreq and > totalTermFreq > > > Key: LUCENE-8921 > URL: https://issues.apache.org/jira/browse/LUCENE-8921 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: 8.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: master (9.0) > > > IndexSearcher.termStatistics(Term term, TermStates context) is the way to > create a TermStatistics. It requires a TermStates param although it only > cares about the docFreq and totalTermFreq. > > For customizations that what to create TermStatistics based on docFreq and > totalTermFreq, but that do not have available TermStates, this method forces > to create a TermStates instance (which is not very lightweight) only to pass > two ints. > termStatistics could be modified to the following signature: > termStatistics(Term term, int docFreq, int totalTermFreq) > Since it would change the API, it could be done in master for next major > release. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org