[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16925077#comment-16925077 ] ASF subversion and git services commented on LUCENE-8753: - Commit b963b7c3dbecda86c2917ad341caee63b93815ac in lucene-solr's branch refs/heads/jira/SOLR-13677_3 from David Smiley [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b963b7c ] LUCENE-8753: New UniformSplit and SharedTermsUniformSplit PostingsFormats > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Fix For: 8.3 > > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 5h 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924365#comment-16924365 ] ASF subversion and git services commented on LUCENE-8753: - Commit b8a1857b0bd235bc9d4833276b1e60c9865aa04b in lucene-solr's branch refs/heads/branch_8x from David Smiley [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b8a1857 ] LUCENE-8753: New UniformSplit and SharedTermsUniformSplit PostingsFormats (cherry picked from commit b963b7c3dbecda86c2917ad341caee63b93815ac) > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 4h 50m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924352#comment-16924352 ] ASF subversion and git services commented on LUCENE-8753: - Commit b963b7c3dbecda86c2917ad341caee63b93815ac in lucene-solr's branch refs/heads/master from David Smiley [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b963b7c ] LUCENE-8753: New UniformSplit and SharedTermsUniformSplit PostingsFormats > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 4h 50m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922609#comment-16922609 ] Bruno Roustant commented on LUCENE-8753: Ok, I followed your advice to include the "shared terms" extension (subpackage) in the same PR #828. I'm going to close the two previous ones. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 4h 20m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920296#comment-16920296 ] David Smiley commented on LUCENE-8753: -- Oh; I was just looking at PR #828, by the way. As I write this, it does not include the "shared terms" in #676. Please incorporate that into #828; okay? One PR to review/commit. The other two can then be closed as obsoleted. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 4h 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920294#comment-16920294 ] David Smiley commented on LUCENE-8753: -- Bruno: Why a new PR? Please don't do that next time unless the new PR is quite different in approach. I know you needed to update the PR to consider changes in Lucene but that doesn't require a new PR; just rebase it and force-push. I encourage others to take a look if they wish. If there aren't further issues to address, I'd like to commit this later next week (e.g. maybe Sept 8th). > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 4h 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906219#comment-16906219 ] Bruno Roustant commented on LUCENE-8753: New [PR 828|https://github.com/apache/lucene-solr/pull/828] to have this PostingsFormat inside codecs/uniformsplit with no code elsewhere. I added package javadoc and lucene.experimental annotation. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 3h 40m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881120#comment-16881120 ] Michael McCandless commented on LUCENE-8753: +1 to push this new codec in Lucene, e.g. under codecs or sandbox or misc modules, if we can avoid making changes to other sources (once LUCENE-8906 is fixed). > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 3h 20m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881046#comment-16881046 ] Bruno Roustant commented on LUCENE-8753: I have created a related Jira issue LUCENE-8906 (Lucene50PostingsReader.postings() casts BlockTermState param to private IntBlockTermState) to make the PR review advance. If we find a solution for this issue, then UniformSplit posting format will be fully isolated in a separate package in codecs, with no intrusion anymore elsewhere. The goal is to have it as an additional optional posting format (not to replace BlockTree) for the following use-cases: customizable by extension, shared-terms extension available, low memory on-heap footprint, best efficiency when dealing with small to medium indexes. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 3h 20m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853986#comment-16853986 ] Robert Muir commented on LUCENE-8753: - It seems by your question that you already know that what you did there isn't committable. Please use Terms.intersect api for codec customization of this stuff and don't add such stuff to AutomatonTermsEnum. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 20m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853870#comment-16853870 ] David Smiley commented on LUCENE-8753: -- Are there any concerns of the Lucene project accepting this new PostingsFormat? Two PostingsFormats now. See my last comment please (RE AutomatonTermsEnum and DeltaBaseTermStateSerializer); I'm curious what others think. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 20m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839437#comment-16839437 ] Yonik Seeley commented on LUCENE-8753: -- Thanks Bruno, awesome stuff! A single FST for multiple fields is an important optimization. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 20m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839429#comment-16839429 ] Bruno Roustant commented on LUCENE-8753: Beyond the performance aspects, we developed UniformSplit to be extensible. To give an idea of how it can be extended, I have added a new PR#676: SharedTerms UniformSplit. The use-case is when there are many fields. We want to take advantage of the FST property to share the terms between all the fields, by replacing one FST per field by a single FST containing the shared terms. In this case each term is stored only once in the block file, and its block line contains the TermState for each different field for which the term occurs. term A -> field1 TermState, field2 TermState, field3 TermState term B -> field3 TermState, field5 TermState The FST is compact and this posting format also unlocks the possibility to cache when the same term is searched in many fields (but this is not part of this PR). My goal here is to showcase the extensibility of this posting format. This extension is in a separate sub-package sharedterms and is quite concise. (the only tricky part is the custom merge to merge efficiently two segments by accessing directly the sharedterms posting format) > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 20m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825225#comment-16825225 ] juan camilo rodriguez duran commented on LUCENE-8753: - [~rcmuir] as [~jpountz] said the last benchmark does not show the benefits of Uniform Split as most of the query time is spent most of the time processing the postings. Just as a recap Uniform Split shines for its simplicity and extensibility with addition of lower memory consumption and faster segment merge. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825028#comment-16825028 ] Robert Muir commented on LUCENE-8753: - Why are we looking at committing this when the most recent benchmark is iffy: most searches are the same or slower? > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824995#comment-16824995 ] David Smiley commented on LUCENE-8753: -- Are there any concerns for committing this patch? There are two things I want to draw attention to in the patch: * AutomatonTermsEnum: some small edits here allowed ATE's core algorithm to be used by UniformSplit custom intersect(). See IntersectBlockReader at the very bottom – an inner class AutomatonNextTermCalculator. Admittedly this is my doing and a little hacky; I err'ed on the side of smallest change vs a more comprehensive refactoring. Now that we're contributing U.S., it's worth shining a light on this. Perhaps it's innocent enough and fairly internal that we don't care (my opinion). * DeltaBaseTermStateSerializer: This new class is placed in the package org.apache.lucene.codecs.lucene50 because it needs to access Lucene50PostingsFormat.IntBlockTermState. What do we think of this? Perhaps IntBlockTermState could be made public and then DBTSS could move to a U.S.'s package? Or... some other refactoring. Otherwise everything is self-contained in its own package. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813171#comment-16813171 ] Bruno Roustant commented on LUCENE-8753: I agree. We profile wikimediumall and we saw that 90% of the time is spent in the scoring, and less than a couple of percent is spent to access the dictionary blocks. Our own use-case is to have multiple small-to-medium cores, the size of wikimedium500k, that's why we studied it more. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813164#comment-16813164 ] Adrien Grand commented on LUCENE-8753: -- bq. BlockTree and UniformSplit had the same QPS for Term and Phrase queries. I didn't understand why a different behavior between a small and a large index. I think this is expected. Query processing needs to look up the term in the terms dict and then process documents that contain this term. When the index gets larger, postings usually grow more quickly than the terms dictionary, so processing postings takes more time relatively compared to looking up the term in the terms dictionary. Term dictionary lookup performance only really matters for queries that have few matches (which you somehow simulated by running the benchmark on wikimedium500k) and updates, which are simulated by the PKLookup task. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813106#comment-16813106 ] Bruno Roustant commented on LUCENE-8753: It took me some time to run wikimedimall 8 GB index (didn't anticipate 1h indexing initially - a little less for UniformSplit, then I had an exception about facets). Then I got results which surprised me. BlockTree and UniformSplit had the same QPS for Term and Phrase queries. I didn't understand why a different behavior between a small and a large index. Then I thought about 2 explanations: * Much larger index could mean less OS IO cache hits. I ran the benchmark with a 16 GB laptop and a 64 GB desktop. Actually I got nearly no difference in my test. * Much larger index could mean more results. So the time spent to score and rank the results could become much larger and diminish the effect of a change in the dictionary. I have no clue there at the moment. Here is the result of wikimedimall on a 64 GB desktop: ||Task||QPS BT||StdDev BT||QPS CUS||StdDev CUS||Pct diff |Fuzzy1|72.81|3.11|21.77|0.71|\{color:red}72%\{color}-\{color:red}67%\{color}| |Fuzzy2|66.77|3.77|20.41|0.67|\{color:red}72%\{color}-\{color:red}66%\{color}| |Respell|8.85|0.64|6.02|0.33|\{color:red}40%\{color}-\{color:red}22%\{color}| |PKLookup|130.83|3.96|121.66|12.37|\{color:red}18%\{color}-\{color:green}5%\{color}| |Wildcard|25.03|1.33|23.93|1.19|\{color:red}13%\{color}-\{color:green}6%\{color}| |HighTermMonthSort|19.03|2.55|18.40|1.56|\{color:red}21%\{color}-\{color:green}21%\{color}| |Prefix3|12.47|0.82|12.10|0.78|\{color:red}14%\{color}-\{color:green}10%\{color}| |LowTerm|182.95|14.94|177.97|18.67|\{color:red}19%\{color}-\{color:green}17%\{color}| |IntNRQ|5.21|0.54|5.09|0.56|\{color:red}21%\{color}-\{color:green}21%\{color}| |MedTerm|90.74|3.99|89.14|4.24|\{color:red}10%\{color}-\{color:green}7%\{color}| |HighTerm|42.54|1.95|41.86|2.00|\{color:red}10%\{color}-\{color:green}8%\{color}| |OrNotHighLow|532.96|16.16|526.86|24.40|\{color:red}8%\{color}-\{color:green}6%\{color}| |HighSloppyPhrase|12.00|0.39|11.90|0.48|\{color:red}7%\{color}-\{color:green}6%\{color}| |OrNotHighMed|53.64|1.08|53.37|1.22|\{color:red}4%\{color}-\{color:green}3%\{color}| |MedSloppyPhrase|31.83|0.59|31.67|0.78|\{color:red}4%\{color}-\{color:green}3%\{color}| |HighPhrase|32.24|0.85|32.09|0.81|\{color:red}5%\{color}-\{color:green}4%\{color}| |LowSloppyPhrase|29.51|0.43|29.40|0.58|\{color:red}3%\{color}-\{color:green}3%\{color}| |AndHighHigh|26.97|0.31|26.88|0.37|\{color:red}2%\{color}-\{color:green}2%\{color}| |MedPhrase|4.95|0.16|4.94|0.15|\{color:red}6%\{color}-\{color:green}6%\{color}| |AndHighMed|50.03|0.72|49.97|0.72|\{color:red}2%\{color}-\{color:green}2%\{color}| |OrNotHighHigh|18.85|0.76|18.85|0.82|\{color:red}8%\{color}-\{color:green}8%\{color}| |OrHighNotHigh|9.35|0.32|9.35|0.35|\{color:red}6%\{color}-\{color:green}7%\{color}| |OrHighLow|15.85|0.59|15.85|0.52|\{color:red}6%\{color}-\{color:green}7%\{color}| |OrHighNotLow|17.56|0.71|17.57|0.70|\{color:red}7%\{color}-\{color:green}8%\{color}| |AndHighLow|284.39|4.41|284.60|5.65|\{color:red}3%\{color}-\{color:green}3%\{color}| |LowPhrase|224.73|4.35|224.97|4.84|\{color:red}3%\{color}-\{color:green}4%\{color}| |OrHighNotMed|13.21|0.49|13.22|0.50|\{color:red}7%\{color}-\{color:green}7%\{color}| |OrHighMed|13.22|0.73|13.30|0.70|\{color:red}9%\{color}-\{color:green}12%\{color}| |OrHighHigh|7.56|0.43|7.62|0.41|\{color:red}9%\{color}-\{color:green}12%\{color}| |BrowseMonthTaxoFacets|7.96|1.92|8.06|1.78|\{color:red}36%\{color}-\{color:green}63%\{color}| |LowSpanNear|11.84|0.19|11.99|0.21|\{color:red}2%\{color}-\{color:green}4%\{color}| |HighTermDayOfYearSort|20.05|1.40|20.31|2.15|\{color:red}15%\{color}-\{color:green}20%\{color}| |BrowseDayOfYearTaxoFacets|7.96|1.91|8.07|1.85|\{color:red}37%\{color}-\{color:green}64%\{color}| |BrowseMonthSSDVFacets|7.95|1.90|8.07|1.87|\{color:red}37%\{color}-\{color:green}64%\{color}| |BrowseDayOfYearSSDVFacets|7.96|1.93|8.08|1.84|\{color:red}36%\{color}-\{color:green}64%\{color}| |MedSpanNear|10.50|0.18|10.67|0.21|\{color:red}2%\{color}-\{color:green}5%\{color}| |BrowseDateTaxoFacets|7.91|1.81|8.07|1.83|\{color:red}35%\{color}-\{color:green}62%\{color}| |HighSpanNear|8.68|0.19|8.88|0.19|\{color:red}2%\{color}-\{color:green}6%\{color}| > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 >
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809251#comment-16809251 ] Bruno Roustant commented on LUCENE-8753: {quote}I think this is similar to […] BlockTermsReader/Writer {quote} Indeed similar; it mainly differs from VariableGapTermsIndexWriter in the way it selects the best term to start a block. It is based on the minimal distinguishing prefix. The idea is to make the terms index FST more compact. That way, given a target max heap memory, we can have potentially more blocks, so smaller ones that are scanned faster. This requirement to consume less heap was strong with lucene 7.1, now maybe less with the recent off-heap FST. {quote}Are you also doing something different to encode/decode postings? {quote} No, the postings are written with the regular PostingsWriterBase. {quote}Can you post results on the full wikimediumall? {quote} Good point. Will do tomorrow. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809218#comment-16809218 ] Michael McCandless commented on LUCENE-8753: I think this is similar to the terms dictionary format Lucene used to have before {{BlockTree}}, still in Lucene's sources as {{BlockTermsReader/Writer}}. Terms are assigned to fixed sized blocks and only the minimum unique prefix needs to be enrolled in the terms index FST. But being able to do binary search within a block is unique! That's very cool. It's curious you see gains e.g. for {{AndHighLow}} – are you also doing something different to encode/decode postings (not just terms dictionary)? The 500K docs is a little small – can you post results on the full {{wikimediumall}} set of documents? > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809188#comment-16809188 ] Michael McCandless commented on LUCENE-8753: {quote}I think PKLookup should be disregarded until it's fixed: [https://github.com/mikemccand/luceneutil/issues/35] (feel free to comment there if people have opinions) {quote} Note that the title on that issue was misleading (backwards from the reality) – I just corrected it. I don't think we should disregard {{PKLookup}} results: it's reporting the performance when looking up actual IDs that do exist in the index. That is an interesting result, but it is odd you are seeing varying/inconsistent results. Note that if you add {{-jira}} into the luceneutil benchmark command-line it will print results using the markup that Jira displays as a table, making it easier for everyone to read. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809124#comment-16809124 ] Mike Sokolov commented on LUCENE-8753: -- The behavior I'm referring to isn't a problem with the benchmark _per se_, it's an intrinsic feature of trying to optimize blocks of things that work better when you have a bunch of similar things right next to each other sharing a common prefix. The PKLookup test is very sensitive to such optimizations (or failures to optimize) because it indexes a consecutive block of terms: a00, a01, a02, etc. and at the same time, this nice consecutive property can be disturbed by index fragmentation, especially when only each term is indexed for only a single document. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808911#comment-16808911 ] David Smiley commented on LUCENE-8753: -- I think PKLookup should be disregarded until it's fixed: [https://github.com/mikemccand/luceneutil/issues/35] (feel free to comment there if people have opinions) > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808860#comment-16808860 ] Mike Sokolov commented on LUCENE-8753: -- I've been working on some other FST-related changes, and running these benchmarks a bunch. PKLookup performance varies wildly (for me) depending on just how the terms are distributed among the segments. The PK's are a sequence (with no gaps) of base36 identifiers, and I see different performance characteristics in small index vs larger (wikimediumall). In general force merge gets me consistent results, but then may not be predictive of reality... > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808852#comment-16808852 ] Bruno Roustant commented on LUCENE-8753: {quote}Is it due to the fact that it doesn't have the ability to fail lookups early like BlockTree? {quote} This is one cause. While BlockTree builds a kind of prefix-trie and may stop if the prefix is not matched, UniformSplit doesn't, so it loads a block. That said I remarked that PKLookup performance varies a lot. It is sometimes in favor of UniformSplit. Actually I don't know how the benchmark generates the test set. It clearly has an influence on the metric. > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808813#comment-16808813 ] Adrien Grand commented on LUCENE-8753: -- I haven't read the pdf or the patch yet. The LowTerm speedup makes me think that term lookups are faster, but then I'm surprised PKLookup is slower. Is it due to the fact that it doesn't have the ability to fail lookups early like BlockTree? > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808799#comment-16808799 ] Bruno Roustant commented on LUCENE-8753: Here's the Luceneutil benchmark with the wikimedium500k data set using Java 8. This is a bit dated using Lucene 7.1; it'd be nice to update to master. Report after iter 19: TaskQPS blocktree StdDevQPS uniformsplit StdDev Pct diff Fuzzy1 508.47 (3.8%) 221.37 (0.9%) {color:#59afe1}-56.5%{color} ( -58% - -53%) Fuzzy2 171.73 (6.4%) 80.62 (1.4%) {color:#59afe1}-53.1%{color} ( -57% - -48%) PKLookup 182.47 (2.4%) 149.62 (2.5%) {color:#59afe1}-18.0%{color} ( -22% - -13%) Wildcard 1788.74 (5.9%) 1729.37 (4.5%) {color:#59afe1}-3.3%{color} ( -12% - 7%) IntNRQ 1561.48 (2.1%) 1564.33 (1.9%) {color:#59afe1}0.2%{color} ( -3% - 4%) Prefix3 1759.69 (5.0%) 1829.74 (4.8%) {color:#59afe1}4.0%{color} ( -5% - 14%) HighTermDayOfYearSort 586.06 (5.4%) 622.34 (8.2%) {color:#59afe1}6.2%{color} ( -6% - 20%) MedPhrase 1204.85 (5.5%) 1282.89 (7.7%) {color:#59afe1}6.5%{color} ( -6% - 20%) HighSpanNear 590.88 (4.1%) 629.64 (6.1%) {color:#59afe1}6.6%{color} ( -3% - 17%) OrHighMed 1101.48 (4.5%) 1220.75 (6.2%) {color:#59afe1}10.8%{color} ( 0% - 22%) HighTermMonthSort 2617.10 (2.6%) 2916.34 (4.6%) {color:#59afe1}11.4%{color} ( 4% - 19%) HighPhrase 961.04 (5.5%) 1073.62 (6.0%) {color:#59afe1}11.7%{color} ( 0% - 24%) MedSloppyPhrase 604.56 (13.3%) 680.31 (13.7%) {color:#59afe1}12.5%{color} ( -12% - 45%) LowSloppyPhrase 954.87 (8.1%) 1075.67 (5.4%) {color:#59afe1}12.7%{color} ( 0% - 28%) MedSpanNear 737.14 (5.8%) 830.68 (8.3%) {color:#59afe1}12.7%{color} ( -1% - 28%) OrHighHigh 811.57 (5.7%) 915.01 (6.2%) {color:#59afe1}12.7%{color} ( 0% - 26%) AndHighMed 1157.45 (5.3%) 1317.78 (5.1%) {color:#59afe1}13.9%{color} ( 3% - 25%) AndHighHigh 1095.29 (5.7%) 1254.16 (4.9%) {color:#59afe1}14.5%{color} ( 3% - 26%) HighSloppyPhrase 880.42 (8.2%) 1009.72 (7.0%) {color:#59afe1}14.7%{color} ( 0% - 32%) LowPhrase 1245.33 (6.0%) 1473.57 (4.4%) {color:#59afe1}18.3%{color} ( 7% - 30%) Respell 81.10 (12.7%) 99.43 (10.3%) {color:#59afe1}22.6%{color} ( 0% - 52%) HighTerm 3733.81 (6.1%) 4599.96 (6.8%) {color:#59afe1}23.2%{color} ( 9% - 38%) OrHighLow 1960.13 (6.2%) 2415.81 (6.0%) {color:#59afe1}23.2%{color} ( 10% - 37%) MedTerm 4411.60 (4.9%) 5450.56 (5.8%) {color:#59afe1}23.6%{color} ( 12% - 35%) LowSpanNear 1944.27 (5.3%) 2416.29 (4.5%) {color:#59afe1}24.3%{color} ( 13% - 36%) AndHighLow 1978.10 (7.6%) 2500.74 (5.8%) {color:#59afe1}26.4%{color} ( 12% - 43%) LowTerm 4949.24 (4.8%) 6589.86 (5.3%) {color:#59afe1}33.1%{color} ( 22% - 45%) > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Priority: Major > Attachments: Uniform Split Technique.pdf > > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more compact > and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment. > > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David