Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension
On Fri, Apr 06, 2018 at 08:52:15AM +0300, Panu Matilainen wrote: > On 04/05/2018 03:42 PM, Vladimir D. Seleznev wrote: > > On Thu, Apr 05, 2018 at 11:41:33AM +0300, Panu Matilainen wrote: > >> On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote: > >>> RPMTAG_IDENTITY is calculating as digest of part of package header that > >>> does not contain irrelevant to package build tag entries. > >>> > >>> Mathematically RPMTAG_IDENTITY value is a result of function of two > >>> variable: a package header and an rpm utility, thus this value can > >>> differ for same package and different version of rpm. > >>> > >> > >> Before proceeding with further work on this, we need to define what is > >> it that we're trying to identify. The above definition is very > >> ambiguous, and it's impossible to properly review + discuss the patch > >> when my idea of package identity might be entirely different from > >> somebody elses idea, that'll only cause unnecessary work and frustration. > > > > Agree, that commit message isn't clear. > > > >> Starting with, what is a "package"? Are we talking about the source > >> package, or binary packages? > > > > Originally it was about binary packages, but is there really difference? > > Source packages are building as well as binary, and something can be > > changed after rebuild. > > Source *packages* are built too, yes, but there's a vast difference > between reproducability of src.rpm and binary rpm. > > However while reviewing the patch yesterday, I realized I've been > increasingly thinking about *source* identity (note the lack of > "package"), which is something quite different: you'd calculate a digest > over the unparsed spec + all the sources and patches etc the spec refers > to [*] and save it in the header of binaries and sources on build. This > would let you identify all the packages that have been built from the > same source, ie whether the package was built eg on Fedora or RHEL (it's > fairly common to share specs between them) or whatever it'd have the > same source id. > > [*] obviously you need to parse the spec to get those references and > it's possible to create specs where this differs between arches, but > sane specs use same sources + patches between archs etc > > > > >> If it's binaries, then we're always ultimately talking about a *build*, > >> and a line needs to be drawn somewhere. > > > > OK. > > > >> There are any number of ways to draw such a line, so it needs to be > >> explicitly stated. One example of such line could be something like > >> "package id must match between a package built on different instances > >> of the same operating system, version and architecture". That clearly > >> is NOT the line that this version of the patch tries to draw, but then > >> it's not at all clear to me what that line is supposed to be. > > > > I think, there should be a line with other side idea: if package > > identity is matched between package build on the same build environment, > > then the build is reproducible. > > > > The possible new version of commit massage is below: > > > > Add RPMTAG_IDENTITY calculation as tag extension > > > > RPMTAG_IDENTITY is calculating as digest of values of significant > > package header tag entries and represents package build characteristics. > > The main purpose of package identity is reproducible build verification: > > if package identity is matched between package build on same build > > environment, then the package build is reproducible for this > > environment. > > Right, reproducability is one such line and that'd be a much better > description. > > I do think that RPMTAG_IDENTITY is overly broad name for such a narrow > purpose though - note how it led me to think about the source level > identity instead. Something towards "build id" maybe, but we don't want > to mix it up with debuginfo buildid. No need to get hung over it right > now though, just something to think about. RPMTAG_IDENTITY originally was supposed to be called RPMTAG_BUILDID, but we decided that it was not a good tag name. Perhaps, it can be named RPMTAG_PBUILDID, but I think currently it is not main task. -- With best regards, Vladimir D. Seleznev ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension
On Thu, Apr 05, 2018 at 05:18:14PM -0400, Jeff Johnson wrote: > > > > On Apr 5, 2018, at 4:41 AM, Panu Matilainenwrote: > > > >> On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote: > >> RPMTAG_IDENTITY is calculating as digest of part of package header that > >> does not contain irrelevant to package build tag entries. > >> Mathematically RPMTAG_IDENTITY value is a result of function of two > >> variable: a package header and an rpm utility, thus this value can > >> differ for same package and different version of rpm. > > > > (aside) > Can we move this discussion to the github issue? E-mail is > increasingly painful for discussions ... I will provide some general > ideas there. https://github.com/rpm-software-management/rpm/issues/426 > > Before proceeding with further work on this, we need to define what > > is it that we're trying to identify. The above definition is very > > ambiguous, and it's impossible to properly review + discuss the > > patch when my idea of package identity might be entirely different > > from somebody elses idea, that'll only cause unnecessary work and > > frustration. > > > > Yup. However, IDENTITY as a proof-of-reproducibility is sufficient for > discussion, though there are many details about what the plaintext > should be remain to be decided. > > > Starting with, what is a "package"? Are we talking about the source > > package, or binary packages? > > > > Both binary/source, just different identities (unless one wants to use > source IDENTITY to tie binary packages to a sufficiently similar class > of "reproducible" source rpm's, in which case a dynamic IDENTITY will > also have to be added into headers). > > > If it's binaries, then we're always ultimately talking about a > > *build*, and a line needs to be drawn somewhere. There are any > > number of ways to draw such a line, so it needs to be explicitly > > stated. One example of such line could be something like "package id > > must match between a package built on different instances of the > > same operating system, version and architecture". That clearly is > > NOT the line that this version of the patch tries to draw, but then > > it's not at all clear to me what that line is supposed to be. > > > > I'll add other thoughts at the github issue for IDENTITY. -- With best regards, Vladimir D. Seleznev ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension
Sent from my iPad > On Apr 6, 2018, at 1:52 AM, Panu Matilainenwrote: > >> On 04/05/2018 03:42 PM, Vladimir D. Seleznev wrote: >>> On Thu, Apr 05, 2018 at 11:41:33AM +0300, Panu Matilainen wrote: On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote: RPMTAG_IDENTITY is calculating as digest of part of package header that does not contain irrelevant to package build tag entries. Mathematically RPMTAG_IDENTITY value is a result of function of two variable: a package header and an rpm utility, thus this value can differ for same package and different version of rpm. >>> >>> Before proceeding with further work on this, we need to define what is >>> it that we're trying to identify. The above definition is very >>> ambiguous, and it's impossible to properly review + discuss the patch >>> when my idea of package identity might be entirely different from >>> somebody elses idea, that'll only cause unnecessary work and frustration. >> Agree, that commit message isn't clear. >>> Starting with, what is a "package"? Are we talking about the source >>> package, or binary packages? >> Originally it was about binary packages, but is there really difference? >> Source packages are building as well as binary, and something can be >> changed after rebuild. > > Source *packages* are built too, yes, but there's a vast difference between > reproducability of src.rpm and binary rpm. > > However while reviewing the patch yesterday, I realized I've been > increasingly thinking about *source* identity (note the lack of "package"), > which is something quite different: you'd calculate a digest over the > unparsed spec + all the sources and patches etc the spec refers to [*] and > save it in the header of binaries and sources on build. This would let you > identify all the packages that have been built from the same source, ie > whether the package was built eg on Fedora or RHEL (it's fairly common to > share specs between them) or whatever it'd have the same source id. > > [*] obviously you need to parse the spec to get those references and it's > possible to create specs where this differs between arches, but sane specs > use same sources + patches between archs etc > >>> If it's binaries, then we're always ultimately talking about a *build*, >>> and a line needs to be drawn somewhere. >> OK. >>> There are any number of ways to draw such a line, so it needs to be >>> explicitly stated. One example of such line could be something like >>> "package id must match between a package built on different instances >>> of the same operating system, version and architecture". That clearly >>> is NOT the line that this version of the patch tries to draw, but then >>> it's not at all clear to me what that line is supposed to be. >> I think, there should be a line with other side idea: if package >> identity is matched between package build on the same build environment, >> then the build is reproducible. >> The possible new version of commit massage is below: >> Add RPMTAG_IDENTITY calculation as tag extension >> RPMTAG_IDENTITY is calculating as digest of values of significant >> package header tag entries and represents package build characteristics. >> The main purpose of package identity is reproducible build verification: >> if package identity is matched between package build on same build >> environment, then the package build is reproducible for this >> environment. > > Right, reproducability is one such line and that'd be a much better > description. > > I do think that RPMTAG_IDENTITY is overly broad name for such a narrow > purpose though - note how it led me to think about the source level identity > instead. Something towards "build id" maybe, but we don't want to mix it up > with debuginfo buildid. No need to get hung over it right now though, just > something to think about. > To handle multiple types of reproducibility, IDENTITY needs to be computed across an array of hashes of elements, not the elements themselves. One reason for the added abstraction layer in an array of hashes of elements is to diagnose proof-of-reproducibility failures: when IDENTITY fails, one can then identify which element failed. For the (overly simple) case of tags in a header, the array of hashes is of each tag's data, and IDENTITY is then the digest of the array. For diagnostic purposes, the tag name should also be appended to the tag data hash. By using a hash on an array of hashes, the mechanism can be chained/extended. Consider, say, a TRANSACTION_IDENTITY composed as an array of package IDENTITY hashes composed of tag data elements in the (overly simple) example we have been considering. See what NixOS (a package manager based on functional programming concepts) does to append human readable strings to hashes for an easily readable format for the array elements in the IDENTITY value. hth 73 de Jeff >- Panu - >
Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension
On 04/05/2018 03:42 PM, Vladimir D. Seleznev wrote: On Thu, Apr 05, 2018 at 11:41:33AM +0300, Panu Matilainen wrote: On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote: RPMTAG_IDENTITY is calculating as digest of part of package header that does not contain irrelevant to package build tag entries. Mathematically RPMTAG_IDENTITY value is a result of function of two variable: a package header and an rpm utility, thus this value can differ for same package and different version of rpm. Before proceeding with further work on this, we need to define what is it that we're trying to identify. The above definition is very ambiguous, and it's impossible to properly review + discuss the patch when my idea of package identity might be entirely different from somebody elses idea, that'll only cause unnecessary work and frustration. Agree, that commit message isn't clear. Starting with, what is a "package"? Are we talking about the source package, or binary packages? Originally it was about binary packages, but is there really difference? Source packages are building as well as binary, and something can be changed after rebuild. Source *packages* are built too, yes, but there's a vast difference between reproducability of src.rpm and binary rpm. However while reviewing the patch yesterday, I realized I've been increasingly thinking about *source* identity (note the lack of "package"), which is something quite different: you'd calculate a digest over the unparsed spec + all the sources and patches etc the spec refers to [*] and save it in the header of binaries and sources on build. This would let you identify all the packages that have been built from the same source, ie whether the package was built eg on Fedora or RHEL (it's fairly common to share specs between them) or whatever it'd have the same source id. [*] obviously you need to parse the spec to get those references and it's possible to create specs where this differs between arches, but sane specs use same sources + patches between archs etc If it's binaries, then we're always ultimately talking about a *build*, and a line needs to be drawn somewhere. OK. There are any number of ways to draw such a line, so it needs to be explicitly stated. One example of such line could be something like "package id must match between a package built on different instances of the same operating system, version and architecture". That clearly is NOT the line that this version of the patch tries to draw, but then it's not at all clear to me what that line is supposed to be. I think, there should be a line with other side idea: if package identity is matched between package build on the same build environment, then the build is reproducible. The possible new version of commit massage is below: Add RPMTAG_IDENTITY calculation as tag extension RPMTAG_IDENTITY is calculating as digest of values of significant package header tag entries and represents package build characteristics. The main purpose of package identity is reproducible build verification: if package identity is matched between package build on same build environment, then the package build is reproducible for this environment. Right, reproducability is one such line and that'd be a much better description. I do think that RPMTAG_IDENTITY is overly broad name for such a narrow purpose though - note how it led me to think about the source level identity instead. Something towards "build id" maybe, but we don't want to mix it up with debuginfo buildid. No need to get hung over it right now though, just something to think about. - Panu - ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension
> On Apr 5, 2018, at 4:41 AM, Panu Matilainenwrote: > >> On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote: >> RPMTAG_IDENTITY is calculating as digest of part of package header that >> does not contain irrelevant to package build tag entries. >> Mathematically RPMTAG_IDENTITY value is a result of function of two >> variable: a package header and an rpm utility, thus this value can >> differ for same package and different version of rpm. > (aside) Can we move this discussion to the github issue? E-mail is increasingly painful for discussions ... I will provide some general ideas there. > Before proceeding with further work on this, we need to define what is it > that we're trying to identify. The above definition is very ambiguous, and > it's impossible to properly review + discuss the patch when my idea of > package identity might be entirely different from somebody elses idea, > that'll only cause unnecessary work and frustration. > Yup. However, IDENTITY as a proof-of-reproducibility is sufficient for discussion, though there are many details about what the plaintext should be remain to be decided. > Starting with, what is a "package"? Are we talking about the source package, > or binary packages? > Both binary/source, just different identities (unless one wants to use source IDENTITY to tie binary packages to a sufficiently similar class of "reproducible" source rpm's, in which case a dynamic IDENTITY will also have to be added into headers). > If it's binaries, then we're always ultimately talking about a *build*, and a > line needs to be drawn somewhere. There are any number of ways to draw such a > line, so it needs to be explicitly stated. One example of such line could be > something like "package id must match between a package built on different > instances of the same operating system, version and architecture". That > clearly is NOT the line that this version of the patch tries to draw, but > then it's not at all clear to me what that line is supposed to be. > I'll add other thoughts at the github issue for IDENTITY. 73 de Jeff >- Panu - > ___ > Rpm-maint mailing list > Rpm-maint@lists.rpm.org > http://lists.rpm.org/mailman/listinfo/rpm-maint ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension
> On Apr 5, 2018, at 2:12 AM, Panu Matilainenwrote: > >> On 04/04/2018 05:32 PM, Vladimir D. Seleznev wrote: >>> On Wed, Apr 04, 2018 at 09:18:35AM -0400, Jeff Johnson wrote: >>> >>> On Apr 4, 2018, at 5:47 AM, Panu Matilainen wrote: On 04/04/2018 11:01 AM, Panu Matilainen wrote: [...] >> +static int identityTag(Header h, rpmtd td, headerGetFlags hgflags) >> +{ >> +int rc = 1; >> +unsigned int bsize; >> +void * hb; >> +DIGEST_CTX ctx; >> +char * identity = NULL; >> + >> +struct rpmtd_s ttd; >> +Header ih = headerNew(); >> +HeaderIterator hi = headerInitIterator(h); >> + >> +while (headerNext(hi, ) && rc) { >> +if (rpmtdCount() > 0) { >> +if (!identityFilter(ttd.tag)) >> +rc = headerPut(ih, , HEADERPUT_DEFAULT); >> +} >> +rpmtdFreeData(); >> +} >> +headerFreeIterator(hi); >> + >> +if (!rc) >> +return 0; >> + >> +hb = headerExport(ih, ); Forgot to bring this up on the first round: another question is whether it actually makes sense to go through all this trouble of copying the header, then exporting it and calculating the digest from that. It'd be considerably cheaper to calculate a digest on the go while iterating over the data (from the immutable region, see my other email). A newly imported header is guaranteed to be sorted (by tags) so it's consistent. >>> >>> There is no particular reason why the IDENTITY digest should need/use >>> a header blob. >>> >>> Any faithful transformation (e.g. using sprintf or hex strings) on the >>> data for the set of tags can be used for an IDENTITY digest. The >>> header blob implicitly includes padding, which can/will change >>> depending on the definition ordering even when the tag data is >>> identical/reproducible. >> Ok. >>> The filtering should also be positive/inclusive rather than >>> negative/exclusive. While a positive list of tags to include is harder >>> to enumerate than the shorter list of tags to exclude, the IDENTITY >>> tag set will then be closed and well known. >> I still don't understand why filtering should be inclusive rather than >> exclusive? The only reason I see is arbitrary tags but the code >> currently does not include tags beyond RPMTAG_FIRSTFREE_TAG. > > With negative filtering you'll never quite know which tags were included. > Older packages have things that newer packages don't, and coming from > different sources who knows what patches the building rpm was carrying. And > you'll be playing "what we forgot" for quite some time this way. After > sleeping over it, for example: > > +case RPMTAG_CHANGELOG: > > This is not an actual tag in packages, it's only used by the spec parser > internally. To filter out the changelog from identity you'll need these three: > >RPMTAG_CHANGELOGTIME= 1080, /* i[] */ >RPMTAG_CHANGELOGNAME= 1081, /* s[] */ >RPMTAG_CHANGELOGTEXT= 1082, /* s[] */ > > Other tags you'd need to filter out but current patch misses: > - RPMTAG_RPMVERSION (reflects the version used to build rpm, which is not a > part of a packages identity) > - RPMTAG_OPTFLAGS (reflects building rpm configuration, not the package id) > - RPMTAG_PAYLOADFORMAT, RPMTAG_PAYLOADCOMPRESSOR, RPMTAG_PAYLOADFLAGS (these > reflect rpm configuration, not package id) > - RPMTAG_OS, RPMTAG_PLATFORM (if filtering arch then surely these too) > - RPMTAG_RHNPLATFORM (older packages can have this) > - RPMTAG_VENDOR, RPMTAG_PACKAGER (typically set from configuration) > - RPMTAG_FILEDIGESTS, RPMTAG_FILEDIGESTALGO (these depend on rpm > configuration, and also compiler versions, build time encoded in files etc) > > ...and still I didn't go through each and every tag, which is what will be > needed either way. Only with blacklisting, you'll never *really* know. With > whitelisting, identity instantly becomes well-defined, even if it turns out > it's missing something it should have. > > A random remark of the day: > RPMTAG_RELEASE you clearly want to include, but then for example in > Fedora/RHEL ecosystem part of it is very commonly partially defined by %dist > macro which comes from rpm config. So the package id would change depending > on which distro it was built on, even if it otherwise did not change. ID as a > build-time digest of the unparsed spec (and all the associated sources) would > avoid those issues... > >>> The simplest way to mark the positive/inclusive IDENTITY tag set is to >>> change the awk script that generates the tag table to add a marker >>> (like the array return marker) to identify which tags to include. >>> The members (and ordering) of the IDENTITY tag set might also need to >>> be configurable without recompiling. >> May be it would be easier to add marker that tells that
Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension
On Thu, Apr 05, 2018 at 03:42:15PM +0300, Vladimir D. Seleznev wrote: > On Thu, Apr 05, 2018 at 11:41:33AM +0300, Panu Matilainen wrote: > > On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote: > > > RPMTAG_IDENTITY is calculating as digest of part of package header that > > > does not contain irrelevant to package build tag entries. > > > > > > Mathematically RPMTAG_IDENTITY value is a result of function of two > > > variable: a package header and an rpm utility, thus this value can > > > differ for same package and different version of rpm. > > > > > > > Before proceeding with further work on this, we need to define what is > > it that we're trying to identify. The above definition is very > > ambiguous, and it's impossible to properly review + discuss the patch > > when my idea of package identity might be entirely different from > > somebody elses idea, that'll only cause unnecessary work and frustration. > > Agree, that commit message isn't clear. I agree. > > Starting with, what is a "package"? Are we talking about the source > > package, or binary packages? > > Originally it was about binary packages, but is there really difference? > Source packages are building as well as binary, and something can be > changed after rebuild. > > > If it's binaries, then we're always ultimately talking about a *build*, > > and a line needs to be drawn somewhere. > > OK. > > > There are any number of ways to draw such a line, so it needs to be > > explicitly stated. One example of such line could be something like > > "package id must match between a package built on different instances > > of the same operating system, version and architecture". That clearly > > is NOT the line that this version of the patch tries to draw, but then > > it's not at all clear to me what that line is supposed to be. > > I think, there should be a line with other side idea: if package > identity is matched between package build on the same build environment, > then the build is reproducible. > > The possible new version of commit massage is below: > > Add RPMTAG_IDENTITY calculation as tag extension > > RPMTAG_IDENTITY is calculating as digest of values of significant > package header tag entries and represents package build characteristics. > The main purpose of package identity is reproducible build verification: > if package identity is matched between package build on same build > environment, then the package build is reproducible for this > environment. -- With best regards, Vladimir D. Seleznev ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension
On Thu, Apr 05, 2018 at 11:41:33AM +0300, Panu Matilainen wrote: > On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote: > > RPMTAG_IDENTITY is calculating as digest of part of package header that > > does not contain irrelevant to package build tag entries. > > > > Mathematically RPMTAG_IDENTITY value is a result of function of two > > variable: a package header and an rpm utility, thus this value can > > differ for same package and different version of rpm. > > > > Before proceeding with further work on this, we need to define what is > it that we're trying to identify. The above definition is very > ambiguous, and it's impossible to properly review + discuss the patch > when my idea of package identity might be entirely different from > somebody elses idea, that'll only cause unnecessary work and frustration. Agree, that commit message isn't clear. > Starting with, what is a "package"? Are we talking about the source > package, or binary packages? Originally it was about binary packages, but is there really difference? Source packages are building as well as binary, and something can be changed after rebuild. > If it's binaries, then we're always ultimately talking about a *build*, > and a line needs to be drawn somewhere. OK. > There are any number of ways to draw such a line, so it needs to be > explicitly stated. One example of such line could be something like > "package id must match between a package built on different instances > of the same operating system, version and architecture". That clearly > is NOT the line that this version of the patch tries to draw, but then > it's not at all clear to me what that line is supposed to be. I think, there should be a line with other side idea: if package identity is matched between package build on the same build environment, then the build is reproducible. The possible new version of commit massage is below: Add RPMTAG_IDENTITY calculation as tag extension RPMTAG_IDENTITY is calculating as digest of values of significant package header tag entries and represents package build characteristics. The main purpose of package identity is reproducible build verification: if package identity is matched between package build on same build environment, then the package build is reproducible for this environment. -- With best regards, Vladimir D. Seleznev ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension
On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote: RPMTAG_IDENTITY is calculating as digest of part of package header that does not contain irrelevant to package build tag entries. Mathematically RPMTAG_IDENTITY value is a result of function of two variable: a package header and an rpm utility, thus this value can differ for same package and different version of rpm. Before proceeding with further work on this, we need to define what is it that we're trying to identify. The above definition is very ambiguous, and it's impossible to properly review + discuss the patch when my idea of package identity might be entirely different from somebody elses idea, that'll only cause unnecessary work and frustration. Starting with, what is a "package"? Are we talking about the source package, or binary packages? If it's binaries, then we're always ultimately talking about a *build*, and a line needs to be drawn somewhere. There are any number of ways to draw such a line, so it needs to be explicitly stated. One example of such line could be something like "package id must match between a package built on different instances of the same operating system, version and architecture". That clearly is NOT the line that this version of the patch tries to draw, but then it's not at all clear to me what that line is supposed to be. - Panu - ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension
On 04/04/2018 05:32 PM, Vladimir D. Seleznev wrote: On Wed, Apr 04, 2018 at 09:18:35AM -0400, Jeff Johnson wrote: On Apr 4, 2018, at 5:47 AM, Panu Matilainenwrote: On 04/04/2018 11:01 AM, Panu Matilainen wrote: [...] +static int identityTag(Header h, rpmtd td, headerGetFlags hgflags) +{ +int rc = 1; +unsigned int bsize; +void * hb; +DIGEST_CTX ctx; +char * identity = NULL; + +struct rpmtd_s ttd; +Header ih = headerNew(); +HeaderIterator hi = headerInitIterator(h); + +while (headerNext(hi, ) && rc) { +if (rpmtdCount() > 0) { +if (!identityFilter(ttd.tag)) +rc = headerPut(ih, , HEADERPUT_DEFAULT); +} +rpmtdFreeData(); +} +headerFreeIterator(hi); + +if (!rc) +return 0; + +hb = headerExport(ih, ); Forgot to bring this up on the first round: another question is whether it actually makes sense to go through all this trouble of copying the header, then exporting it and calculating the digest from that. It'd be considerably cheaper to calculate a digest on the go while iterating over the data (from the immutable region, see my other email). A newly imported header is guaranteed to be sorted (by tags) so it's consistent. There is no particular reason why the IDENTITY digest should need/use a header blob. Any faithful transformation (e.g. using sprintf or hex strings) on the data for the set of tags can be used for an IDENTITY digest. The header blob implicitly includes padding, which can/will change depending on the definition ordering even when the tag data is identical/reproducible. Ok. The filtering should also be positive/inclusive rather than negative/exclusive. While a positive list of tags to include is harder to enumerate than the shorter list of tags to exclude, the IDENTITY tag set will then be closed and well known. I still don't understand why filtering should be inclusive rather than exclusive? The only reason I see is arbitrary tags but the code currently does not include tags beyond RPMTAG_FIRSTFREE_TAG. With negative filtering you'll never quite know which tags were included. Older packages have things that newer packages don't, and coming from different sources who knows what patches the building rpm was carrying. And you'll be playing "what we forgot" for quite some time this way. After sleeping over it, for example: +case RPMTAG_CHANGELOG: This is not an actual tag in packages, it's only used by the spec parser internally. To filter out the changelog from identity you'll need these three: RPMTAG_CHANGELOGTIME= 1080, /* i[] */ RPMTAG_CHANGELOGNAME= 1081, /* s[] */ RPMTAG_CHANGELOGTEXT= 1082, /* s[] */ Other tags you'd need to filter out but current patch misses: - RPMTAG_RPMVERSION (reflects the version used to build rpm, which is not a part of a packages identity) - RPMTAG_OPTFLAGS (reflects building rpm configuration, not the package id) - RPMTAG_PAYLOADFORMAT, RPMTAG_PAYLOADCOMPRESSOR, RPMTAG_PAYLOADFLAGS (these reflect rpm configuration, not package id) - RPMTAG_OS, RPMTAG_PLATFORM (if filtering arch then surely these too) - RPMTAG_RHNPLATFORM (older packages can have this) - RPMTAG_VENDOR, RPMTAG_PACKAGER (typically set from configuration) - RPMTAG_FILEDIGESTS, RPMTAG_FILEDIGESTALGO (these depend on rpm configuration, and also compiler versions, build time encoded in files etc) ...and still I didn't go through each and every tag, which is what will be needed either way. Only with blacklisting, you'll never *really* know. With whitelisting, identity instantly becomes well-defined, even if it turns out it's missing something it should have. A random remark of the day: RPMTAG_RELEASE you clearly want to include, but then for example in Fedora/RHEL ecosystem part of it is very commonly partially defined by %dist macro which comes from rpm config. So the package id would change depending on which distro it was built on, even if it otherwise did not change. ID as a build-time digest of the unparsed spec (and all the associated sources) would avoid those issues... The simplest way to mark the positive/inclusive IDENTITY tag set is to change the awk script that generates the tag table to add a marker (like the array return marker) to identify which tags to include. The members (and ordering) of the IDENTITY tag set might also need to be configurable without recompiling. May be it would be easier to add marker that tells that marked tag should be filtered from IDENTITY by calculation? But what is best way to do that? Should I add a new field to struct headerTagTableEntry_s? Or I miss something? AIUI that's exactly what Jeff was suggesting. But adding a new field to headerTagTableEntry_s would require also adding an API to access it which seems a bit weird I think. An alternative would be constructing the identityFilter() function into a .C file with eg an awk script that's then included into
Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension
On Wed, Apr 04, 2018 at 09:18:35AM -0400, Jeff Johnson wrote: > > > > On Apr 4, 2018, at 5:47 AM, Panu Matilainenwrote: > > > > On 04/04/2018 11:01 AM, Panu Matilainen wrote: > > [...] > >>> +static int identityTag(Header h, rpmtd td, headerGetFlags hgflags) > >>> +{ > >>> +int rc = 1; > >>> +unsigned int bsize; > >>> +void * hb; > >>> +DIGEST_CTX ctx; > >>> +char * identity = NULL; > >>> + > >>> +struct rpmtd_s ttd; > >>> +Header ih = headerNew(); > >>> +HeaderIterator hi = headerInitIterator(h); > >>> + > >>> +while (headerNext(hi, ) && rc) { > >>> +if (rpmtdCount() > 0) { > >>> +if (!identityFilter(ttd.tag)) > >>> +rc = headerPut(ih, , HEADERPUT_DEFAULT); > >>> +} > >>> +rpmtdFreeData(); > >>> +} > >>> +headerFreeIterator(hi); > >>> + > >>> +if (!rc) > >>> +return 0; > >>> + > >>> +hb = headerExport(ih, ); > > > > Forgot to bring this up on the first round: another question is > > whether it actually makes sense to go through all this trouble of > > copying the header, then exporting it and calculating the digest > > from that. > > > > It'd be considerably cheaper to calculate a digest on the go while > > iterating over the data (from the immutable region, see my other > > email). A newly imported header is guaranteed to be sorted (by tags) > > so it's consistent. > > > > There is no particular reason why the IDENTITY digest should need/use > a header blob. > > Any faithful transformation (e.g. using sprintf or hex strings) on the > data for the set of tags can be used for an IDENTITY digest. The > header blob implicitly includes padding, which can/will change > depending on the definition ordering even when the tag data is > identical/reproducible. Ok. > The filtering should also be positive/inclusive rather than > negative/exclusive. While a positive list of tags to include is harder > to enumerate than the shorter list of tags to exclude, the IDENTITY > tag set will then be closed and well known. I still don't understand why filtering should be inclusive rather than exclusive? The only reason I see is arbitrary tags but the code currently does not include tags beyond RPMTAG_FIRSTFREE_TAG. > The simplest way to mark the positive/inclusive IDENTITY tag set is to > change the awk script that generates the tag table to add a marker > (like the array return marker) to identify which tags to include. > The members (and ordering) of the IDENTITY tag set might also need to > be configurable without recompiling. May be it would be easier to add marker that tells that marked tag should be filtered from IDENTITY by calculation? But what is best way to do that? Should I add a new field to struct headerTagTableEntry_s? Or I miss something? > But overall, dynamically generating the IDENTITY tag set withe a tag > extension can be deployed (and back ported and maintained) far more > easily than changing header code. > > Nice job! -- With best regards, Vladimir D. Seleznev ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension
> On Apr 4, 2018, at 5:47 AM, Panu Matilainenwrote: > > On 04/04/2018 11:01 AM, Panu Matilainen wrote: > [...] >>> +static int identityTag(Header h, rpmtd td, headerGetFlags hgflags) >>> +{ >>> +int rc = 1; >>> +unsigned int bsize; >>> +void * hb; >>> +DIGEST_CTX ctx; >>> +char * identity = NULL; >>> + >>> +struct rpmtd_s ttd; >>> +Header ih = headerNew(); >>> +HeaderIterator hi = headerInitIterator(h); >>> + >>> +while (headerNext(hi, ) && rc) { >>> +if (rpmtdCount() > 0) { >>> +if (!identityFilter(ttd.tag)) >>> +rc = headerPut(ih, , HEADERPUT_DEFAULT); >>> +} >>> +rpmtdFreeData(); >>> +} >>> +headerFreeIterator(hi); >>> + >>> +if (!rc) >>> +return 0; >>> + >>> +hb = headerExport(ih, ); > > Forgot to bring this up on the first round: another question is whether it > actually makes sense to go through all this trouble of copying the header, > then exporting it and calculating the digest from that. > > It'd be considerably cheaper to calculate a digest on the go while iterating > over the data (from the immutable region, see my other email). A newly > imported header is guaranteed to be sorted (by tags) so it's consistent. > There is no particular reason why the IDENTITY digest should need/use a header blob. Any faithful transformation (e.g. using sprintf or hex strings) on the data for the set of tags can be used for an IDENTITY digest. The header blob implicitly includes padding, which can/will change depending on the definition ordering even when the tag data is identical/reproducible. The filtering should also be positive/inclusive rather than negative/exclusive. While a positive list of tags to include is harder to enumerate than the shorter list of tags to exclude, the IDENTITY tag set will then be closed and well known. The simplest way to mark the positive/inclusive IDENTITY tag set is to change the awk script that generates the tag table to add a marker (like the array return marker) to identify which tags to include. The members (and ordering) of the IDENTITY tag set might also need to be configurable without recompiling. But overall, dynamically generating the IDENTITY tag set withe a tag extension can be deployed (and back ported and maintained) far more easily than changing header code. Nice job! 73 de Jeff >- Panu - > ___ > Rpm-maint mailing list > Rpm-maint@lists.rpm.org > http://lists.rpm.org/mailman/listinfo/rpm-maint ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint
Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension
On 04/04/2018 11:01 AM, Panu Matilainen wrote: [...] +static int identityTag(Header h, rpmtd td, headerGetFlags hgflags) +{ + int rc = 1; + unsigned int bsize; + void * hb; + DIGEST_CTX ctx; + char * identity = NULL; + + struct rpmtd_s ttd; + Header ih = headerNew(); + HeaderIterator hi = headerInitIterator(h); + + while (headerNext(hi, ) && rc) { + if (rpmtdCount() > 0) { + if (!identityFilter(ttd.tag)) + rc = headerPut(ih, , HEADERPUT_DEFAULT); + } + rpmtdFreeData(); + } + headerFreeIterator(hi); + + if (!rc) + return 0; + + hb = headerExport(ih, ); Forgot to bring this up on the first round: another question is whether it actually makes sense to go through all this trouble of copying the header, then exporting it and calculating the digest from that. It'd be considerably cheaper to calculate a digest on the go while iterating over the data (from the immutable region, see my other email). A newly imported header is guaranteed to be sorted (by tags) so it's consistent. - Panu - ___ Rpm-maint mailing list Rpm-maint@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-maint