Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension

2018-04-09 Thread Vladimir D. Seleznev
On Fri, Apr 06, 2018 at 08:52:15AM +0300, Panu Matilainen wrote:
> On 04/05/2018 03:42 PM, Vladimir D. Seleznev wrote:
> > On Thu, Apr 05, 2018 at 11:41:33AM +0300, Panu Matilainen wrote:
> >> On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote:
> >>> RPMTAG_IDENTITY is calculating as digest of part of package header that
> >>> does not contain irrelevant to package build tag entries.
> >>>
> >>> Mathematically RPMTAG_IDENTITY value is a result of function of two
> >>> variable: a package header and an rpm utility, thus this value can
> >>> differ for same package and different version of rpm.
> >>>
> >>
> >> Before proceeding with further work on this, we need to define what is
> >> it that we're trying to identify. The above definition is very
> >> ambiguous, and it's impossible to properly review + discuss the patch
> >> when my idea of package identity might be entirely different from
> >> somebody elses idea, that'll only cause unnecessary work and frustration.
> > 
> > Agree, that commit message isn't clear.
> > 
> >> Starting with, what is a "package"? Are we talking about the source
> >> package, or binary packages?
> > 
> > Originally it was about binary packages, but is there really difference?
> > Source packages are building as well as binary, and something can be
> > changed after rebuild.
> 
> Source *packages* are built too, yes, but there's a vast difference 
> between reproducability of src.rpm and binary rpm.
> 
> However while reviewing the patch yesterday, I realized I've been 
> increasingly thinking about *source* identity (note the lack of 
> "package"), which is something quite different: you'd calculate a digest 
> over the unparsed spec + all the sources and patches etc the spec refers 
> to [*] and save it in the header of binaries and sources on build. This 
> would let you identify all the packages that have been built from the 
> same source, ie whether the package was built eg on Fedora or RHEL (it's 
> fairly common to share specs between them) or whatever it'd have the 
> same source id.
> 
> [*] obviously you need to parse the spec to get those references and 
> it's possible to create specs where this differs between arches, but 
> sane specs use same sources + patches between archs etc
> 
> > 
> >> If it's binaries, then we're always ultimately talking about a *build*,
> >> and a line needs to be drawn somewhere.
> > 
> > OK.
> > 
> >> There are any number of ways to draw such a line, so it needs to be
> >> explicitly stated. One example of such line could be something like
> >> "package id must match between a package built on different instances
> >> of the same operating system, version and architecture". That clearly
> >> is NOT the line that this version of the patch tries to draw, but then
> >> it's not at all clear to me what that line is supposed to be.
> > 
> > I think, there should be a line with other side idea: if package
> > identity is matched between package build on the same build environment,
> > then the build is reproducible.
> > 
> > The possible new version of commit massage is below:
> > 
> > Add RPMTAG_IDENTITY calculation as tag extension
> > 
> > RPMTAG_IDENTITY is calculating as digest of values of significant
> > package header tag entries and represents package build characteristics.
> > The main purpose of package identity is reproducible build verification:
> > if package identity is matched between package build on same build
> > environment, then the package build is reproducible for this
> > environment.
> 
> Right, reproducability is one such line and that'd be a much better 
> description.
> 
> I do think that RPMTAG_IDENTITY is overly broad name for such a narrow 
> purpose though - note how it led me to think about the source level 
> identity instead. Something towards "build id" maybe, but we don't want 
> to mix it up with debuginfo buildid. No need to get hung over it right 
> now though, just something to think about.

RPMTAG_IDENTITY originally was supposed to be called RPMTAG_BUILDID, but
we decided that it was not a good tag name. Perhaps, it can be named
RPMTAG_PBUILDID, but I think currently it is not main task.

-- 
   With best regards,
   Vladimir D. Seleznev
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension

2018-04-09 Thread Vladimir D. Seleznev
On Thu, Apr 05, 2018 at 05:18:14PM -0400, Jeff Johnson wrote:
> 
> 
> > On Apr 5, 2018, at 4:41 AM, Panu Matilainen  wrote:
> > 
> >> On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote:
> >> RPMTAG_IDENTITY is calculating as digest of part of package header that
> >> does not contain irrelevant to package build tag entries.
> >> Mathematically RPMTAG_IDENTITY value is a result of function of two
> >> variable: a package header and an rpm utility, thus this value can
> >> differ for same package and different version of rpm.
> > 
> 
> (aside)
> Can we move this discussion to the github issue? E-mail is
> increasingly painful for discussions ... I will provide some general
> ideas there.

https://github.com/rpm-software-management/rpm/issues/426

> > Before proceeding with further work on this, we need to define what
> > is it that we're trying to identify. The above definition is very
> > ambiguous, and it's impossible to properly review + discuss the
> > patch when my idea of package identity might be entirely different
> > from somebody elses idea, that'll only cause unnecessary work and
> > frustration.
> > 
> 
> Yup. However, IDENTITY as a proof-of-reproducibility is sufficient for
> discussion, though  there are many details about what the plaintext
> should be remain to be decided.
> 
> > Starting with, what is a "package"? Are we talking about the source
> > package, or binary packages?
> > 
> 
> Both binary/source, just different identities (unless one wants to use
> source IDENTITY to tie binary packages to a sufficiently similar class
> of "reproducible" source rpm's, in which case a dynamic IDENTITY will
> also have to be added into headers).
> 
> > If it's binaries, then we're always ultimately talking about a
> > *build*, and a line needs to be drawn somewhere. There are any
> > number of ways to draw such a line, so it needs to be explicitly
> > stated. One example of such line could be something like "package id
> > must match between a package built on different instances of the
> > same operating system, version and architecture". That clearly is
> > NOT the line that this version of the patch tries to draw, but then
> > it's not at all clear to me what that line is supposed to be.
> > 
> 
> I'll add other thoughts at the github issue for IDENTITY.

-- 
   With best regards,
   Vladimir D. Seleznev
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension

2018-04-06 Thread Jeff Johnson


Sent from my iPad

> On Apr 6, 2018, at 1:52 AM, Panu Matilainen  wrote:
> 
>> On 04/05/2018 03:42 PM, Vladimir D. Seleznev wrote:
>>> On Thu, Apr 05, 2018 at 11:41:33AM +0300, Panu Matilainen wrote:
 On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote:
 RPMTAG_IDENTITY is calculating as digest of part of package header that
 does not contain irrelevant to package build tag entries.
 
 Mathematically RPMTAG_IDENTITY value is a result of function of two
 variable: a package header and an rpm utility, thus this value can
 differ for same package and different version of rpm.
>>> 
>>> Before proceeding with further work on this, we need to define what is
>>> it that we're trying to identify. The above definition is very
>>> ambiguous, and it's impossible to properly review + discuss the patch
>>> when my idea of package identity might be entirely different from
>>> somebody elses idea, that'll only cause unnecessary work and frustration.
>> Agree, that commit message isn't clear.
>>> Starting with, what is a "package"? Are we talking about the source
>>> package, or binary packages?
>> Originally it was about binary packages, but is there really difference?
>> Source packages are building as well as binary, and something can be
>> changed after rebuild.
> 
> Source *packages* are built too, yes, but there's a vast difference between 
> reproducability of src.rpm and binary rpm.
> 
> However while reviewing the patch yesterday, I realized I've been 
> increasingly thinking about *source* identity (note the lack of "package"), 
> which is something quite different: you'd calculate a digest over the 
> unparsed spec + all the sources and patches etc the spec refers to [*] and 
> save it in the header of binaries and sources on build. This would let you 
> identify all the packages that have been built from the same source, ie 
> whether the package was built eg on Fedora or RHEL (it's fairly common to 
> share specs between them) or whatever it'd have the same source id.
> 
> [*] obviously you need to parse the spec to get those references and it's 
> possible to create specs where this differs between arches, but sane specs 
> use same sources + patches between archs etc
> 
>>> If it's binaries, then we're always ultimately talking about a *build*,
>>> and a line needs to be drawn somewhere.
>> OK.
>>> There are any number of ways to draw such a line, so it needs to be
>>> explicitly stated. One example of such line could be something like
>>> "package id must match between a package built on different instances
>>> of the same operating system, version and architecture". That clearly
>>> is NOT the line that this version of the patch tries to draw, but then
>>> it's not at all clear to me what that line is supposed to be.
>> I think, there should be a line with other side idea: if package
>> identity is matched between package build on the same build environment,
>> then the build is reproducible.
>> The possible new version of commit massage is below:
>> Add RPMTAG_IDENTITY calculation as tag extension
>> RPMTAG_IDENTITY is calculating as digest of values of significant
>> package header tag entries and represents package build characteristics.
>> The main purpose of package identity is reproducible build verification:
>> if package identity is matched between package build on same build
>> environment, then the package build is reproducible for this
>> environment.
> 
> Right, reproducability is one such line and that'd be a much better 
> description.
> 
> I do think that RPMTAG_IDENTITY is overly broad name for such a narrow 
> purpose though - note how it led me to think about the source level identity 
> instead. Something towards "build id" maybe, but we don't want to mix it up 
> with debuginfo buildid. No need to get hung over it right now though, just 
> something to think about.
> 

To handle multiple types of reproducibility, IDENTITY needs to be computed 
across an array of hashes of elements, not the elements themselves.

One reason for the added abstraction layer in an array of hashes of elements is 
to diagnose proof-of-reproducibility failures: when IDENTITY fails, one can 
then identify which element failed.

For the (overly simple) case of tags in a header, the array of hashes is of 
each tag's data, and IDENTITY is then the digest of the array. For diagnostic 
purposes, the tag name should also be appended to the tag data hash.

By using a hash on an array of hashes, the mechanism can be chained/extended.

Consider, say, a TRANSACTION_IDENTITY composed as an array of package IDENTITY 
hashes composed of tag data elements in the (overly simple) example we have 
been considering.

See what NixOS (a package manager based on functional programming concepts) 
does to append human readable strings to hashes for an easily readable format 
for the array elements in the IDENTITY value.

hth

73 de Jeff
>- Panu -
> 

Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension

2018-04-05 Thread Panu Matilainen

On 04/05/2018 03:42 PM, Vladimir D. Seleznev wrote:

On Thu, Apr 05, 2018 at 11:41:33AM +0300, Panu Matilainen wrote:

On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote:

RPMTAG_IDENTITY is calculating as digest of part of package header that
does not contain irrelevant to package build tag entries.

Mathematically RPMTAG_IDENTITY value is a result of function of two
variable: a package header and an rpm utility, thus this value can
differ for same package and different version of rpm.



Before proceeding with further work on this, we need to define what is
it that we're trying to identify. The above definition is very
ambiguous, and it's impossible to properly review + discuss the patch
when my idea of package identity might be entirely different from
somebody elses idea, that'll only cause unnecessary work and frustration.


Agree, that commit message isn't clear.


Starting with, what is a "package"? Are we talking about the source
package, or binary packages?


Originally it was about binary packages, but is there really difference?
Source packages are building as well as binary, and something can be
changed after rebuild.


Source *packages* are built too, yes, but there's a vast difference 
between reproducability of src.rpm and binary rpm.


However while reviewing the patch yesterday, I realized I've been 
increasingly thinking about *source* identity (note the lack of 
"package"), which is something quite different: you'd calculate a digest 
over the unparsed spec + all the sources and patches etc the spec refers 
to [*] and save it in the header of binaries and sources on build. This 
would let you identify all the packages that have been built from the 
same source, ie whether the package was built eg on Fedora or RHEL (it's 
fairly common to share specs between them) or whatever it'd have the 
same source id.


[*] obviously you need to parse the spec to get those references and 
it's possible to create specs where this differs between arches, but 
sane specs use same sources + patches between archs etc





If it's binaries, then we're always ultimately talking about a *build*,
and a line needs to be drawn somewhere.


OK.


There are any number of ways to draw such a line, so it needs to be
explicitly stated. One example of such line could be something like
"package id must match between a package built on different instances
of the same operating system, version and architecture". That clearly
is NOT the line that this version of the patch tries to draw, but then
it's not at all clear to me what that line is supposed to be.


I think, there should be a line with other side idea: if package
identity is matched between package build on the same build environment,
then the build is reproducible.

The possible new version of commit massage is below:

Add RPMTAG_IDENTITY calculation as tag extension

RPMTAG_IDENTITY is calculating as digest of values of significant
package header tag entries and represents package build characteristics.
The main purpose of package identity is reproducible build verification:
if package identity is matched between package build on same build
environment, then the package build is reproducible for this
environment.


Right, reproducability is one such line and that'd be a much better 
description.


I do think that RPMTAG_IDENTITY is overly broad name for such a narrow 
purpose though - note how it led me to think about the source level 
identity instead. Something towards "build id" maybe, but we don't want 
to mix it up with debuginfo buildid. No need to get hung over it right 
now though, just something to think about.


- Panu -
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension

2018-04-05 Thread Jeff Johnson


> On Apr 5, 2018, at 4:41 AM, Panu Matilainen  wrote:
> 
>> On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote:
>> RPMTAG_IDENTITY is calculating as digest of part of package header that
>> does not contain irrelevant to package build tag entries.
>> Mathematically RPMTAG_IDENTITY value is a result of function of two
>> variable: a package header and an rpm utility, thus this value can
>> differ for same package and different version of rpm.
> 

(aside)
Can we move this discussion to the github issue? E-mail is increasingly painful 
for discussions ... I will provide some general ideas there.

> Before proceeding with further work on this, we need to define what is it 
> that we're trying to identify. The above definition is very ambiguous, and 
> it's impossible to properly review + discuss the patch when my idea of 
> package identity might be entirely different from somebody elses idea, 
> that'll only cause unnecessary work and frustration.
> 

Yup. However, IDENTITY as a proof-of-reproducibility is sufficient for 
discussion, though  there are many details about what the plaintext should be 
remain to be decided.

> Starting with, what is a "package"? Are we talking about the source package, 
> or binary packages?
> 

Both binary/source, just different identities (unless one wants to use source 
IDENTITY to tie binary packages to a sufficiently similar class of 
"reproducible" source rpm's, in which case a dynamic IDENTITY will also have to 
be added into headers).

> If it's binaries, then we're always ultimately talking about a *build*, and a 
> line needs to be drawn somewhere. There are any number of ways to draw such a 
> line, so it needs to be explicitly stated. One example of such line could be 
> something like "package id must match between a package built on different 
> instances of the same operating system, version and architecture". That 
> clearly is NOT the line that this version of the patch tries to draw, but 
> then it's not at all clear to me what that line is supposed to be.
> 

I'll add other thoughts at the github issue for IDENTITY.

73 de Jeff
>- Panu -
> ___
> Rpm-maint mailing list
> Rpm-maint@lists.rpm.org
> http://lists.rpm.org/mailman/listinfo/rpm-maint
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension

2018-04-05 Thread Jeff Johnson


> On Apr 5, 2018, at 2:12 AM, Panu Matilainen  wrote:
> 
>> On 04/04/2018 05:32 PM, Vladimir D. Seleznev wrote:
>>> On Wed, Apr 04, 2018 at 09:18:35AM -0400, Jeff Johnson wrote:
>>> 
>>> 
 On Apr 4, 2018, at 5:47 AM, Panu Matilainen  wrote:
 
 On 04/04/2018 11:01 AM, Panu Matilainen wrote:
 [...]
>> +static int identityTag(Header h, rpmtd td, headerGetFlags hgflags)
>> +{
>> +int rc = 1;
>> +unsigned int bsize;
>> +void * hb;
>> +DIGEST_CTX ctx;
>> +char * identity = NULL;
>> +
>> +struct rpmtd_s ttd;
>> +Header ih = headerNew();
>> +HeaderIterator hi = headerInitIterator(h);
>> +
>> +while (headerNext(hi, ) && rc) {
>> +if (rpmtdCount() > 0) {
>> +if (!identityFilter(ttd.tag))
>> +rc = headerPut(ih, , HEADERPUT_DEFAULT);
>> +}
>> +rpmtdFreeData();
>> +}
>> +headerFreeIterator(hi);
>> +
>> +if (!rc)
>> +return 0;
>> +
>> +hb = headerExport(ih, );
 
 Forgot to bring this up on the first round: another question is
 whether it actually makes sense to go through all this trouble of
 copying the header, then exporting it and calculating the digest
 from that.
 
 It'd be considerably cheaper to calculate a digest on the go while
 iterating over the data (from the immutable region, see my other
 email). A newly imported header is guaranteed to be sorted (by tags)
 so it's consistent.
>>> 
>>> There is no particular reason why the IDENTITY digest should need/use
>>> a header blob.
>>> 
>>> Any faithful transformation (e.g. using sprintf or hex strings) on the
>>> data for the set of tags can be used for an IDENTITY digest. The
>>> header blob implicitly includes padding, which can/will change
>>> depending on the definition ordering even when the tag data is
>>> identical/reproducible.
>> Ok.
>>> The filtering should also be positive/inclusive rather than
>>> negative/exclusive. While a positive list of tags to include is harder
>>> to enumerate than the shorter list of tags to exclude, the IDENTITY
>>> tag set will then be closed and well known.
>> I still don't understand why filtering should be inclusive rather than
>> exclusive? The only reason I see is arbitrary tags but the code
>> currently does not include tags beyond RPMTAG_FIRSTFREE_TAG.
> 
> With negative filtering you'll never quite know which tags were included. 
> Older packages have things that newer packages don't, and coming from 
> different sources who knows what patches the building rpm was carrying. And 
> you'll be playing "what we forgot" for quite some time this way. After 
> sleeping over it, for example:
> 
> +case RPMTAG_CHANGELOG:
> 
> This is not an actual tag in packages, it's only used by the spec parser 
> internally. To filter out the changelog from identity you'll need these three:
> 
>RPMTAG_CHANGELOGTIME= 1080, /* i[] */
>RPMTAG_CHANGELOGNAME= 1081, /* s[] */
>RPMTAG_CHANGELOGTEXT= 1082, /* s[] */
> 
> Other tags you'd need to filter out but current patch misses:
> - RPMTAG_RPMVERSION (reflects the version used to build rpm, which is not a 
> part of a packages identity)
> - RPMTAG_OPTFLAGS (reflects building rpm configuration, not the package id)
> - RPMTAG_PAYLOADFORMAT, RPMTAG_PAYLOADCOMPRESSOR, RPMTAG_PAYLOADFLAGS (these 
> reflect rpm configuration, not package id)
> - RPMTAG_OS, RPMTAG_PLATFORM (if filtering arch then surely these too)
> - RPMTAG_RHNPLATFORM (older packages can have this)
> - RPMTAG_VENDOR, RPMTAG_PACKAGER (typically set from configuration)
> - RPMTAG_FILEDIGESTS, RPMTAG_FILEDIGESTALGO (these depend on rpm 
> configuration, and also compiler versions, build time encoded in files etc)
> 
> ...and still I didn't go through each and every tag, which is what will be 
> needed either way. Only with blacklisting, you'll never *really* know. With 
> whitelisting, identity instantly becomes well-defined, even if it turns out 
> it's missing something it should have.
> 
> A random remark of the day:
> RPMTAG_RELEASE you clearly want to include, but then for example in 
> Fedora/RHEL ecosystem part of it is very commonly partially defined by %dist 
> macro which comes from rpm config. So the package id would change depending 
> on which distro it was built on, even if it otherwise did not change. ID as a 
> build-time digest of the unparsed spec (and all the associated sources) would 
> avoid those issues...
> 
>>> The simplest way to mark the positive/inclusive IDENTITY tag set is to
>>> change the awk script that generates the tag table to add a marker
>>> (like the array return marker) to identify which tags to include.
>>> The members (and ordering) of the IDENTITY tag set might also need to
>>> be configurable without recompiling.
>> May be it would be easier to add marker that tells that 

Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension

2018-04-05 Thread Vladimir D. Seleznev
On Thu, Apr 05, 2018 at 03:42:15PM +0300, Vladimir D. Seleznev wrote:
> On Thu, Apr 05, 2018 at 11:41:33AM +0300, Panu Matilainen wrote:
> > On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote:
> > > RPMTAG_IDENTITY is calculating as digest of part of package header that
> > > does not contain irrelevant to package build tag entries.
> > > 
> > > Mathematically RPMTAG_IDENTITY value is a result of function of two
> > > variable: a package header and an rpm utility, thus this value can
> > > differ for same package and different version of rpm.
> > > 
> > 
> > Before proceeding with further work on this, we need to define what is 
> > it that we're trying to identify. The above definition is very 
> > ambiguous, and it's impossible to properly review + discuss the patch 
> > when my idea of package identity might be entirely different from 
> > somebody elses idea, that'll only cause unnecessary work and frustration.
> 
> Agree, that commit message isn't clear.

I agree.

> > Starting with, what is a "package"? Are we talking about the source 
> > package, or binary packages?
> 
> Originally it was about binary packages, but is there really difference?
> Source packages are building as well as binary, and something can be
> changed after rebuild.
> 
> > If it's binaries, then we're always ultimately talking about a *build*, 
> > and a line needs to be drawn somewhere.
> 
> OK.
> 
> > There are any number of ways to draw such a line, so it needs to be
> > explicitly stated. One example of such line could be something like
> > "package id must match between a package built on different instances
> > of the same operating system, version and architecture". That clearly
> > is NOT the line that this version of the patch tries to draw, but then
> > it's not at all clear to me what that line is supposed to be.
> 
> I think, there should be a line with other side idea: if package
> identity is matched between package build on the same build environment,
> then the build is reproducible.
> 
> The possible new version of commit massage is below:
> 
> Add RPMTAG_IDENTITY calculation as tag extension
> 
> RPMTAG_IDENTITY is calculating as digest of values of significant
> package header tag entries and represents package build characteristics.
> The main purpose of package identity is reproducible build verification:
> if package identity is matched between package build on same build
> environment, then the package build is reproducible for this
> environment.

-- 
   With best regards,
   Vladimir D. Seleznev
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension

2018-04-05 Thread Vladimir D. Seleznev
On Thu, Apr 05, 2018 at 11:41:33AM +0300, Panu Matilainen wrote:
> On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote:
> > RPMTAG_IDENTITY is calculating as digest of part of package header that
> > does not contain irrelevant to package build tag entries.
> > 
> > Mathematically RPMTAG_IDENTITY value is a result of function of two
> > variable: a package header and an rpm utility, thus this value can
> > differ for same package and different version of rpm.
> > 
> 
> Before proceeding with further work on this, we need to define what is 
> it that we're trying to identify. The above definition is very 
> ambiguous, and it's impossible to properly review + discuss the patch 
> when my idea of package identity might be entirely different from 
> somebody elses idea, that'll only cause unnecessary work and frustration.

Agree, that commit message isn't clear.

> Starting with, what is a "package"? Are we talking about the source 
> package, or binary packages?

Originally it was about binary packages, but is there really difference?
Source packages are building as well as binary, and something can be
changed after rebuild.

> If it's binaries, then we're always ultimately talking about a *build*, 
> and a line needs to be drawn somewhere.

OK.

> There are any number of ways to draw such a line, so it needs to be
> explicitly stated. One example of such line could be something like
> "package id must match between a package built on different instances
> of the same operating system, version and architecture". That clearly
> is NOT the line that this version of the patch tries to draw, but then
> it's not at all clear to me what that line is supposed to be.

I think, there should be a line with other side idea: if package
identity is matched between package build on the same build environment,
then the build is reproducible.

The possible new version of commit massage is below:

Add RPMTAG_IDENTITY calculation as tag extension

RPMTAG_IDENTITY is calculating as digest of values of significant
package header tag entries and represents package build characteristics.
The main purpose of package identity is reproducible build verification:
if package identity is matched between package build on same build
environment, then the package build is reproducible for this
environment.

-- 
   With best regards,
   Vladimir D. Seleznev
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension

2018-04-05 Thread Panu Matilainen

On 04/03/2018 10:31 PM, Vladimir D. Seleznev wrote:

RPMTAG_IDENTITY is calculating as digest of part of package header that
does not contain irrelevant to package build tag entries.

Mathematically RPMTAG_IDENTITY value is a result of function of two
variable: a package header and an rpm utility, thus this value can
differ for same package and different version of rpm.



Before proceeding with further work on this, we need to define what is 
it that we're trying to identify. The above definition is very 
ambiguous, and it's impossible to properly review + discuss the patch 
when my idea of package identity might be entirely different from 
somebody elses idea, that'll only cause unnecessary work and frustration.


Starting with, what is a "package"? Are we talking about the source 
package, or binary packages?


If it's binaries, then we're always ultimately talking about a *build*, 
and a line needs to be drawn somewhere. There are any number of ways to 
draw such a line, so it needs to be explicitly stated. One example of 
such line could be something like "package id must match between a 
package built on different instances of the same operating system, 
version and architecture". That clearly is NOT the line that this 
version of the patch tries to draw, but then it's not at all clear to me 
what that line is supposed to be.


- Panu -
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension

2018-04-05 Thread Panu Matilainen

On 04/04/2018 05:32 PM, Vladimir D. Seleznev wrote:

On Wed, Apr 04, 2018 at 09:18:35AM -0400, Jeff Johnson wrote:




On Apr 4, 2018, at 5:47 AM, Panu Matilainen  wrote:

On 04/04/2018 11:01 AM, Panu Matilainen wrote:
[...]

+static int identityTag(Header h, rpmtd td, headerGetFlags hgflags)
+{
+int rc = 1;
+unsigned int bsize;
+void * hb;
+DIGEST_CTX ctx;
+char * identity = NULL;
+
+struct rpmtd_s ttd;
+Header ih = headerNew();
+HeaderIterator hi = headerInitIterator(h);
+
+while (headerNext(hi, ) && rc) {
+if (rpmtdCount() > 0) {
+if (!identityFilter(ttd.tag))
+rc = headerPut(ih, , HEADERPUT_DEFAULT);
+}
+rpmtdFreeData();
+}
+headerFreeIterator(hi);
+
+if (!rc)
+return 0;
+
+hb = headerExport(ih, );


Forgot to bring this up on the first round: another question is
whether it actually makes sense to go through all this trouble of
copying the header, then exporting it and calculating the digest
from that.

It'd be considerably cheaper to calculate a digest on the go while
iterating over the data (from the immutable region, see my other
email). A newly imported header is guaranteed to be sorted (by tags)
so it's consistent.



There is no particular reason why the IDENTITY digest should need/use
a header blob.

Any faithful transformation (e.g. using sprintf or hex strings) on the
data for the set of tags can be used for an IDENTITY digest. The
header blob implicitly includes padding, which can/will change
depending on the definition ordering even when the tag data is
identical/reproducible.


Ok.


The filtering should also be positive/inclusive rather than
negative/exclusive. While a positive list of tags to include is harder
to enumerate than the shorter list of tags to exclude, the IDENTITY
tag set will then be closed and well known.


I still don't understand why filtering should be inclusive rather than
exclusive? The only reason I see is arbitrary tags but the code
currently does not include tags beyond RPMTAG_FIRSTFREE_TAG.


With negative filtering you'll never quite know which tags were 
included. Older packages have things that newer packages don't, and 
coming from different sources who knows what patches the building rpm 
was carrying. And you'll be playing "what we forgot" for quite some time 
this way. After sleeping over it, for example:


+case RPMTAG_CHANGELOG:

This is not an actual tag in packages, it's only used by the spec parser 
internally. To filter out the changelog from identity you'll need these 
three:


RPMTAG_CHANGELOGTIME= 1080, /* i[] */
RPMTAG_CHANGELOGNAME= 1081, /* s[] */
RPMTAG_CHANGELOGTEXT= 1082, /* s[] */

Other tags you'd need to filter out but current patch misses:
- RPMTAG_RPMVERSION (reflects the version used to build rpm, which is 
not a part of a packages identity)

- RPMTAG_OPTFLAGS (reflects building rpm configuration, not the package id)
- RPMTAG_PAYLOADFORMAT, RPMTAG_PAYLOADCOMPRESSOR, RPMTAG_PAYLOADFLAGS 
(these reflect rpm configuration, not package id)

- RPMTAG_OS, RPMTAG_PLATFORM (if filtering arch then surely these too)
- RPMTAG_RHNPLATFORM (older packages can have this)
- RPMTAG_VENDOR, RPMTAG_PACKAGER (typically set from configuration)
- RPMTAG_FILEDIGESTS, RPMTAG_FILEDIGESTALGO (these depend on rpm 
configuration, and also compiler versions, build time encoded in files etc)


...and still I didn't go through each and every tag, which is what will 
be needed either way. Only with blacklisting, you'll never *really* 
know. With whitelisting, identity instantly becomes well-defined, even 
if it turns out it's missing something it should have.


A random remark of the day:
RPMTAG_RELEASE you clearly want to include, but then for example in 
Fedora/RHEL ecosystem part of it is very commonly partially defined by 
%dist macro which comes from rpm config. So the package id would change 
depending on which distro it was built on, even if it otherwise did not 
change. ID as a build-time digest of the unparsed spec (and all the 
associated sources) would avoid those issues...





The simplest way to mark the positive/inclusive IDENTITY tag set is to
change the awk script that generates the tag table to add a marker
(like the array return marker) to identify which tags to include.
The members (and ordering) of the IDENTITY tag set might also need to
be configurable without recompiling.


May be it would be easier to add marker that tells that marked tag
should be filtered from IDENTITY by calculation? But what is best way to
do that? Should I add a new field to struct headerTagTableEntry_s? Or I
miss something?


AIUI that's exactly what Jeff was suggesting. But adding a new field to 
headerTagTableEntry_s would require also adding an API to access it 
which seems a bit weird I think. An alternative would be constructing 
the identityFilter() function into a .C file with eg an awk script 
that's then included into 

Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension

2018-04-04 Thread Vladimir D. Seleznev
On Wed, Apr 04, 2018 at 09:18:35AM -0400, Jeff Johnson wrote:
> 
> 
> > On Apr 4, 2018, at 5:47 AM, Panu Matilainen  wrote:
> > 
> > On 04/04/2018 11:01 AM, Panu Matilainen wrote:
> > [...]
> >>> +static int identityTag(Header h, rpmtd td, headerGetFlags hgflags)
> >>> +{
> >>> +int rc = 1;
> >>> +unsigned int bsize;
> >>> +void * hb;
> >>> +DIGEST_CTX ctx;
> >>> +char * identity = NULL;
> >>> +
> >>> +struct rpmtd_s ttd;
> >>> +Header ih = headerNew();
> >>> +HeaderIterator hi = headerInitIterator(h);
> >>> +
> >>> +while (headerNext(hi, ) && rc) {
> >>> +if (rpmtdCount() > 0) {
> >>> +if (!identityFilter(ttd.tag))
> >>> +rc = headerPut(ih, , HEADERPUT_DEFAULT);
> >>> +}
> >>> +rpmtdFreeData();
> >>> +}
> >>> +headerFreeIterator(hi);
> >>> +
> >>> +if (!rc)
> >>> +return 0;
> >>> +
> >>> +hb = headerExport(ih, );
> > 
> > Forgot to bring this up on the first round: another question is
> > whether it actually makes sense to go through all this trouble of
> > copying the header, then exporting it and calculating the digest
> > from that.
> > 
> > It'd be considerably cheaper to calculate a digest on the go while
> > iterating over the data (from the immutable region, see my other
> > email). A newly imported header is guaranteed to be sorted (by tags)
> > so it's consistent.
> > 
> 
> There is no particular reason why the IDENTITY digest should need/use
> a header blob.
> 
> Any faithful transformation (e.g. using sprintf or hex strings) on the
> data for the set of tags can be used for an IDENTITY digest. The
> header blob implicitly includes padding, which can/will change
> depending on the definition ordering even when the tag data is
> identical/reproducible.

Ok.

> The filtering should also be positive/inclusive rather than
> negative/exclusive. While a positive list of tags to include is harder
> to enumerate than the shorter list of tags to exclude, the IDENTITY
> tag set will then be closed and well known.

I still don't understand why filtering should be inclusive rather than
exclusive? The only reason I see is arbitrary tags but the code
currently does not include tags beyond RPMTAG_FIRSTFREE_TAG.

> The simplest way to mark the positive/inclusive IDENTITY tag set is to
> change the awk script that generates the tag table to add a marker
> (like the array return marker) to identify which tags to include. 
> The members (and ordering) of the IDENTITY tag set might also need to
> be configurable without recompiling.

May be it would be easier to add marker that tells that marked tag
should be filtered from IDENTITY by calculation? But what is best way to
do that? Should I add a new field to struct headerTagTableEntry_s? Or I
miss something?

> But overall, dynamically generating the IDENTITY tag set withe a tag
> extension can be deployed (and back ported and maintained) far more
> easily than changing header code.
> 
> Nice job!

-- 
   With best regards,
   Vladimir D. Seleznev
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension

2018-04-04 Thread Jeff Johnson


> On Apr 4, 2018, at 5:47 AM, Panu Matilainen  wrote:
> 
> On 04/04/2018 11:01 AM, Panu Matilainen wrote:
> [...]
>>> +static int identityTag(Header h, rpmtd td, headerGetFlags hgflags)
>>> +{
>>> +int rc = 1;
>>> +unsigned int bsize;
>>> +void * hb;
>>> +DIGEST_CTX ctx;
>>> +char * identity = NULL;
>>> +
>>> +struct rpmtd_s ttd;
>>> +Header ih = headerNew();
>>> +HeaderIterator hi = headerInitIterator(h);
>>> +
>>> +while (headerNext(hi, ) && rc) {
>>> +if (rpmtdCount() > 0) {
>>> +if (!identityFilter(ttd.tag))
>>> +rc = headerPut(ih, , HEADERPUT_DEFAULT);
>>> +}
>>> +rpmtdFreeData();
>>> +}
>>> +headerFreeIterator(hi);
>>> +
>>> +if (!rc)
>>> +return 0;
>>> +
>>> +hb = headerExport(ih, );
> 
> Forgot to bring this up on the first round: another question is whether it 
> actually makes sense to go through all this trouble of copying the header, 
> then exporting it and calculating the digest from that.
> 
> It'd be considerably cheaper to calculate a digest on the go while iterating 
> over the data (from the immutable region, see my other email). A newly 
> imported header is guaranteed to be sorted (by tags) so it's consistent.
> 

There is no particular reason why the IDENTITY digest should need/use a header 
blob.

Any faithful transformation (e.g. using sprintf or hex strings) on the data for 
the set of tags can be used for an IDENTITY digest. The header blob implicitly 
includes padding, which can/will change depending on the definition ordering 
even when the tag data is identical/reproducible.

The filtering should also be positive/inclusive rather than negative/exclusive. 
While a positive list of tags to include is harder to enumerate than the 
shorter list of tags to exclude, the IDENTITY tag set will then be closed and 
well known.

The simplest way to mark the positive/inclusive IDENTITY tag set is to change 
the awk script that generates the tag table to add a marker (like the array 
return marker) to identify which tags to include. 

The members (and ordering) of the IDENTITY tag set might also need to be 
configurable without recompiling.

But overall, dynamically generating the IDENTITY tag set withe a tag extension 
can be deployed (and back ported and maintained) far more easily than changing 
header code.

Nice job!

73 de Jeff




>- Panu -
> ___
> Rpm-maint mailing list
> Rpm-maint@lists.rpm.org
> http://lists.rpm.org/mailman/listinfo/rpm-maint
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint


Re: [Rpm-maint] [PATCH] Add RPMTAG_IDENTITY calculation as tag extension

2018-04-04 Thread Panu Matilainen

On 04/04/2018 11:01 AM, Panu Matilainen wrote:
[...]

+static int identityTag(Header h, rpmtd td, headerGetFlags hgflags)
+{
+    int rc = 1;
+    unsigned int bsize;
+    void * hb;
+    DIGEST_CTX ctx;
+    char * identity = NULL;
+
+    struct rpmtd_s ttd;
+    Header ih = headerNew();
+    HeaderIterator hi = headerInitIterator(h);
+
+    while (headerNext(hi, ) && rc) {
+    if (rpmtdCount() > 0) {
+    if (!identityFilter(ttd.tag))
+    rc = headerPut(ih, , HEADERPUT_DEFAULT);
+    }
+    rpmtdFreeData();
+    }
+    headerFreeIterator(hi);
+
+    if (!rc)
+    return 0;
+
+    hb = headerExport(ih, );


Forgot to bring this up on the first round: another question is whether 
it actually makes sense to go through all this trouble of copying the 
header, then exporting it and calculating the digest from that.


It'd be considerably cheaper to calculate a digest on the go while 
iterating over the data (from the immutable region, see my other email). 
A newly imported header is guaranteed to be sorted (by tags) so it's 
consistent.


- Panu -
___
Rpm-maint mailing list
Rpm-maint@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-maint