Re: additional term meta data
Hi Martin: I don't think my PR has this test. Thanks -John On Fri, Jan 22, 2021 at 7:01 AM Martin Gainty wrote: > close to finish testing but i need help finding this testcase > RamUsageTester > > any ideas? > > Thanks John! > martin > > -- > *From:* Martin Gainty > *Sent:* Wednesday, January 6, 2021 6:28 AM > *To:* dev@lucene.apache.org > *Subject:* Re: additional term meta data > > how to access first and last? > which version will you be merging > > -- > *From:* John Wang > *Sent:* Tuesday, January 5, 2021 8:19 PM > *To:* dev@lucene.apache.org > *Subject:* additional term meta data > > Hi folks: > > We like to propose a feature to add additional per-term metadata to the > term diction. > > Currently, the TermsEnum API returns docFreq as its only meta-data. We > needed a way to quickly get the first and last doc id in the postings > without having to scan through the entire postings list. > > We have created a PR on our own fork and we would like to contribute this > back to the community. Please let us know if this is something that's > useful and/or fits Lucene's roadmap, we would be happy to submit a patch. > > https://github.com/dashbase/lucene-solr/pull/1 > > Thank you > > -John >
Re: additional term meta data
close to finish testing but i need help finding this testcase RamUsageTester any ideas? Thanks John! martin From: Martin Gainty Sent: Wednesday, January 6, 2021 6:28 AM To: dev@lucene.apache.org Subject: Re: additional term meta data how to access first and last? which version will you be merging From: John Wang Sent: Tuesday, January 5, 2021 8:19 PM To: dev@lucene.apache.org Subject: additional term meta data Hi folks: We like to propose a feature to add additional per-term metadata to the term diction. Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list. We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch. https://github.com/dashbase/lucene-solr/pull/1 Thank you -John
Re: additional term meta data
Hi Simon: This might be specific to us, it makes sense not making such core changes If not needed. Here is our use case anyway: We first sort the index in time order, so docids can be used as proxy for time. In the VoIP world, we are using Lucene to stitch call flows, which is similar to the APM/tracing use case. To optimally get the range of the transaction, using first and last docid helps without the need to traverse the posting list. It would be ideal for us to not have to modify Lucene, would be great to understand how getting AttributeSource helps with this case. Let me spend some time learning about it. Thank you for the suggestion! -John On Fri, Jan 8, 2021 at 11:19 PM Simon Willnauer wrote: > John, can you explain what the usecase for such a new API is? I don't > see a user of the API in your code. Is there a query you can optimize > with this or what is the reasoning behind this change? I personally > think it's quite invasive to add this information and there must be a > good reason to add this to the TermsEnum? I also don't think we should > have an option on the field for this if we add it but if we don't do > that it's quite a heavy change so I am on the fence if we should even > consider this? > I wonder if you can use the TermsEnum#getAttributeSource() API instead > and add this as a dedicated attribute which is present if the info is > stored. That way you can build your own PostingsFormat that does store > this information? > > simon > > On Wed, Jan 6, 2021 at 8:06 PM John Wang wrote: > > > > Thank you, Martin! > > > > You can apply the patch to the 8.7 build by just ignoring the changes to > Lucene90xxx. Appreciate the help and guidance! > > > > -John > > > > > > On Wed, Jan 6, 2021 at 10:36 AM Martin Gainty > wrote: > >> > >> appears you are targeting 9.0 for your code > >> > lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90FieldInfosFormat.java > >> (Lucene90FIeldInfosFormat.java is not contained in either 8.4 or 8.7 > distros) > >> > >> > >> someone had the bright idea to nuke ant 8.x build.xml without > consulting anyone > >> not a fan of ant but the execution model of gradle is woefully > inflexible in comparison to maven > >> > >> > >> i will try with 90 distro to get the > codecs/lucene90/Lucene90FieldInfosFormat and recompile and hopefully your > TestLucene84PostingsFormat will run w/o fail or error > >> > >> Thx > >> martin- > >> > >> > >> From: John Wang > >> Sent: Wednesday, January 6, 2021 10:15 AM > >> To: dev@lucene.apache.org > >> Subject: Re: additional term meta data > >> > >> Hey Martin: > >> > >> There is a test case in the PR we created on our own fork: > https://github.com/dashbase/lucene-solr/pull/1, which also contains some > example code on how to access in the PR description. > >> > >> Here is the link to the beginning of the tests: > https://github.com/dashbase/lucene-solr/blob/posting-last-docid/lucene/core/src/test/org/apache/lucene/codecs/lucene84/TestLucene84PostingsFormat.java#L142 > >> > >> I am not sure which version this should be applied to, currently, it > was based on master as of a few days ago. We intend to patch 8.7 for our > own environment. > >> > >> Any advice or feedback is much appreciated. > >> > >> Thank you! > >> > >> -John > >> > >> On Wed, Jan 6, 2021 at 3:28 AM Martin Gainty > wrote: > >> > >> how to access first and last? > >> which version will you be merging > >> > >> > >> From: John Wang > >> Sent: Tuesday, January 5, 2021 8:19 PM > >> To: dev@lucene.apache.org > >> Subject: additional term meta data > >> > >> Hi folks: > >> > >> We like to propose a feature to add additional per-term metadata to the > term diction. > >> > >> Currently, the TermsEnum API returns docFreq as its only meta-data. We > needed a way to quickly get the first and last doc id in the postings > without having to scan through the entire postings list. > >> > >> We have created a PR on our own fork and we would like to contribute > this back to the community. Please let us know if this is something that's > useful and/or fits Lucene's roadmap, we would be happy to submit a patch. > >> > >> https://github.com/dashbase/lucene-solr/pull/1 > >> > >> Thank you > >> > >> -John > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: additional term meta data
John, can you explain what the usecase for such a new API is? I don't see a user of the API in your code. Is there a query you can optimize with this or what is the reasoning behind this change? I personally think it's quite invasive to add this information and there must be a good reason to add this to the TermsEnum? I also don't think we should have an option on the field for this if we add it but if we don't do that it's quite a heavy change so I am on the fence if we should even consider this? I wonder if you can use the TermsEnum#getAttributeSource() API instead and add this as a dedicated attribute which is present if the info is stored. That way you can build your own PostingsFormat that does store this information? simon On Wed, Jan 6, 2021 at 8:06 PM John Wang wrote: > > Thank you, Martin! > > You can apply the patch to the 8.7 build by just ignoring the changes to > Lucene90xxx. Appreciate the help and guidance! > > -John > > > On Wed, Jan 6, 2021 at 10:36 AM Martin Gainty wrote: >> >> appears you are targeting 9.0 for your code >> lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90FieldInfosFormat.java >> (Lucene90FIeldInfosFormat.java is not contained in either 8.4 or 8.7 distros) >> >> >> someone had the bright idea to nuke ant 8.x build.xml without consulting >> anyone >> not a fan of ant but the execution model of gradle is woefully inflexible in >> comparison to maven >> >> >> i will try with 90 distro to get the >> codecs/lucene90/Lucene90FieldInfosFormat and recompile and hopefully your >> TestLucene84PostingsFormat will run w/o fail or error >> >> Thx >> martin- >> >> ____ >> From: John Wang >> Sent: Wednesday, January 6, 2021 10:15 AM >> To: dev@lucene.apache.org >> Subject: Re: additional term meta data >> >> Hey Martin: >> >> There is a test case in the PR we created on our own fork: >> https://github.com/dashbase/lucene-solr/pull/1, which also contains some >> example code on how to access in the PR description. >> >> Here is the link to the beginning of the tests: >> https://github.com/dashbase/lucene-solr/blob/posting-last-docid/lucene/core/src/test/org/apache/lucene/codecs/lucene84/TestLucene84PostingsFormat.java#L142 >> >> I am not sure which version this should be applied to, currently, it was >> based on master as of a few days ago. We intend to patch 8.7 for our own >> environment. >> >> Any advice or feedback is much appreciated. >> >> Thank you! >> >> -John >> >> On Wed, Jan 6, 2021 at 3:28 AM Martin Gainty wrote: >> >> how to access first and last? >> which version will you be merging >> >> >> From: John Wang >> Sent: Tuesday, January 5, 2021 8:19 PM >> To: dev@lucene.apache.org >> Subject: additional term meta data >> >> Hi folks: >> >> We like to propose a feature to add additional per-term metadata to the term >> diction. >> >> Currently, the TermsEnum API returns docFreq as its only meta-data. We >> needed a way to quickly get the first and last doc id in the postings >> without having to scan through the entire postings list. >> >> We have created a PR on our own fork and we would like to contribute this >> back to the community. Please let us know if this is something that's useful >> and/or fits Lucene's roadmap, we would be happy to submit a patch. >> >> https://github.com/dashbase/lucene-solr/pull/1 >> >> Thank you >> >> -John - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: additional term meta data
Thank you, Martin! You can apply the patch to the 8.7 build by just ignoring the changes to Lucene90xxx. Appreciate the help and guidance! -John On Wed, Jan 6, 2021 at 10:36 AM Martin Gainty wrote: > appears you are targeting 9.0 for your code > > lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90FieldInfosFormat.java > <https://github.com/dashbase/lucene-solr/pull/1/files#diff-224246aa19a54dd91fc495a6bbf7d75b26dbeaa3aceab058214d68fcbb38d24c> > (Lucene90FIeldInfosFormat.java is not contained in either 8.4 or 8.7 > distros) > > > someone had the bright idea to nuke ant 8.x build.xml without consulting > anyone > not a fan of ant but the execution model of gradle is woefully inflexible > in comparison to maven > > > i will try with 90 distro to get the > codecs/lucene90/Lucene90FieldInfosFormat and recompile and hopefully your > TestLucene84PostingsFormat will run w/o fail or error > > Thx > martin- > > -- > *From:* John Wang > *Sent:* Wednesday, January 6, 2021 10:15 AM > *To:* dev@lucene.apache.org > *Subject:* Re: additional term meta data > > Hey Martin: > > There is a test case in the PR we created on our own fork: > https://github.com/dashbase/lucene-solr/pull/1, which also contains some > example code on how to access in the PR description. > > Here is the link to the beginning of the tests: > https://github.com/dashbase/lucene-solr/blob/posting-last-docid/lucene/core/src/test/org/apache/lucene/codecs/lucene84/TestLucene84PostingsFormat.java#L142 > > I am not sure which version this should be applied to, currently, it was > based on master as of a few days ago. We intend to patch 8.7 for our own > environment. > > Any advice or feedback is much appreciated. > > Thank you! > > -John > > On Wed, Jan 6, 2021 at 3:28 AM Martin Gainty wrote: > > how to access first and last? > which version will you be merging > > -- > *From:* John Wang > *Sent:* Tuesday, January 5, 2021 8:19 PM > *To:* dev@lucene.apache.org > *Subject:* additional term meta data > > Hi folks: > > We like to propose a feature to add additional per-term metadata to the > term diction. > > Currently, the TermsEnum API returns docFreq as its only meta-data. We > needed a way to quickly get the first and last doc id in the postings > without having to scan through the entire postings list. > > We have created a PR on our own fork and we would like to contribute this > back to the community. Please let us know if this is something that's > useful and/or fits Lucene's roadmap, we would be happy to submit a patch. > > https://github.com/dashbase/lucene-solr/pull/1 > > Thank you > > -John > >
Re: additional term meta data
appears you are targeting 9.0 for your code lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90FieldInfosFormat.java<https://github.com/dashbase/lucene-solr/pull/1/files#diff-224246aa19a54dd91fc495a6bbf7d75b26dbeaa3aceab058214d68fcbb38d24c> (Lucene90FIeldInfosFormat.java is not contained in either 8.4 or 8.7 distros) someone had the bright idea to nuke ant 8.x build.xml without consulting anyone not a fan of ant but the execution model of gradle is woefully inflexible in comparison to maven i will try with 90 distro to get the codecs/lucene90/Lucene90FieldInfosFormat and recompile and hopefully your TestLucene84PostingsFormat will run w/o fail or error Thx martin- From: John Wang Sent: Wednesday, January 6, 2021 10:15 AM To: dev@lucene.apache.org Subject: Re: additional term meta data Hey Martin: There is a test case in the PR we created on our own fork: https://github.com/dashbase/lucene-solr/pull/1, which also contains some example code on how to access in the PR description. Here is the link to the beginning of the tests: https://github.com/dashbase/lucene-solr/blob/posting-last-docid/lucene/core/src/test/org/apache/lucene/codecs/lucene84/TestLucene84PostingsFormat.java#L142 I am not sure which version this should be applied to, currently, it was based on master as of a few days ago. We intend to patch 8.7 for our own environment. Any advice or feedback is much appreciated. Thank you! -John On Wed, Jan 6, 2021 at 3:28 AM Martin Gainty mailto:mgai...@hotmail.com>> wrote: how to access first and last? which version will you be merging From: John Wang mailto:john.w...@gmail.com>> Sent: Tuesday, January 5, 2021 8:19 PM To: dev@lucene.apache.org<mailto:dev@lucene.apache.org> mailto:dev@lucene.apache.org>> Subject: additional term meta data Hi folks: We like to propose a feature to add additional per-term metadata to the term diction. Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list. We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch. https://github.com/dashbase/lucene-solr/pull/1 Thank you -John
Re: additional term meta data
Hey Martin: There is a test case in the PR we created on our own fork: https://github.com/dashbase/lucene-solr/pull/1, which also contains some example code on how to access in the PR description. Here is the link to the beginning of the tests: https://github.com/dashbase/lucene-solr/blob/posting-last-docid/lucene/core/src/test/org/apache/lucene/codecs/lucene84/TestLucene84PostingsFormat.java#L142 I am not sure which version this should be applied to, currently, it was based on master as of a few days ago. We intend to patch 8.7 for our own environment. Any advice or feedback is much appreciated. Thank you! -John On Wed, Jan 6, 2021 at 3:28 AM Martin Gainty wrote: > how to access first and last? > which version will you be merging > > -- > *From:* John Wang > *Sent:* Tuesday, January 5, 2021 8:19 PM > *To:* dev@lucene.apache.org > *Subject:* additional term meta data > > Hi folks: > > We like to propose a feature to add additional per-term metadata to the > term diction. > > Currently, the TermsEnum API returns docFreq as its only meta-data. We > needed a way to quickly get the first and last doc id in the postings > without having to scan through the entire postings list. > > We have created a PR on our own fork and we would like to contribute this > back to the community. Please let us know if this is something that's > useful and/or fits Lucene's roadmap, we would be happy to submit a patch. > > https://github.com/dashbase/lucene-solr/pull/1 > > Thank you > > -John >
Re: additional term meta data
how to access first and last? which version will you be merging From: John Wang Sent: Tuesday, January 5, 2021 8:19 PM To: dev@lucene.apache.org Subject: additional term meta data Hi folks: We like to propose a feature to add additional per-term metadata to the term diction. Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list. We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch. https://github.com/dashbase/lucene-solr/pull/1 Thank you -John
Re: additional term meta data
how to access first and last doc-id? for which lucene version will you be targeting your merge? Request: please submit testcase to show proper operation Thanks John! martin- From: John Wang Sent: Tuesday, January 5, 2021 8:19 PM To: dev@lucene.apache.org Subject: additional term meta data Hi folks: We like to propose a feature to add additional per-term metadata to the term diction. Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list. We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch. https://github.com/dashbase/lucene-solr/pull/1 Thank you -John
additional term meta data
Hi folks: We like to propose a feature to add additional per-term metadata to the term diction. Currently, the TermsEnum API returns docFreq as its only meta-data. We needed a way to quickly get the first and last doc id in the postings without having to scan through the entire postings list. We have created a PR on our own fork and we would like to contribute this back to the community. Please let us know if this is something that's useful and/or fits Lucene's roadmap, we would be happy to submit a patch. https://github.com/dashbase/lucene-solr/pull/1 Thank you -John