Re: Two questions about build-path reproducibility in Debian

2024-03-05 Thread Eric Myhre


On 3/4/24 22:25, David A. Wheeler via rb-general wrote:

On Mar 4, 2024, at 3:37 PM, Holger Levsen  wrote:

On Mon, Mar 04, 2024 at 11:52:07AM -0800, John Gilmore wrote:

Why would these become "wishlist" bugs as opposed to actual reproducibility bugs
that deserve fixing, just because one server at Debian no longer invokes this
bug because it always uses the same build directory?

because it's "not one server at Debian" but what many ecosystems do: build in an
deterministic path (eg /$pkg/$version or whatever) or record the path as part
of the build environment, to have it deterministic as well.

in the distant past, before namespacing become popular, using a random path
was a solution to allow parallel builds of the same software & version.

and yes, this is a shortcut and a tradeoff, similar to demanding to build
in a certain locale. also it makes reproducibilty from around 80-85% of all
packages to >95%, IOW with this shortcut we can have meaningful reproducibility
*many years* sooner, than without.

and I'd really rather like to see Debian 100% reproducible in 2030, than in 
2038.
and some subsets today, or much sooner.

I agree with Holger (and Vagrant).

It'd be *nice* if a build was reproducible regardless of the directory used to 
build it.
But today, if you're building an executable for others, it's common to build 
using a
container/chroot or similar that makes it easy to implement "must compile with these 
paths",
while *fixing* this is often a lot of work.

I suggest focusing on ensuring everyone knows what the executable files 
contain, first.
if people can add more flexibility to their build process, all the better, but 
that added flexibility
comes at a cost of time and effort that is NOT as important.

--- David A. Wheeler



Yet another +1 "here, here!" to this.

Flexibility is desirable.  Determinism even without maximal flexibility 
should still get the main thrust, and it is _not_ sufficiently solved 
yet in many situations and many pieces of software.

Re: Introducing: Semantically reproducible builds

2023-05-27 Thread Eric Myhre

I could see myself supporting this.

It seems appropriate for the weaker term to require more words (thereby 
teeing up the opportunity to point out the distinction, which will 
remain important to do as part of urging further progress).  And this 
proposal does fit that criteria!


Cheers!


On 26.05.2023 22:06, David A. Wheeler wrote:

Reproducible builds are great for showing that a package really was built
from some given source, but sometimes they're hard to do.

If your primary goal is to determine where the major risks are from subverted 
builds,
I think a useful backoff is something called a "semantically reproducible 
build".
(This term was decided on in a discussion with some other people & now I can't 
remember
who came up with the term.)

Below is a definition of the term & some rationale for it.
Note that this is expressly *not* the same as a fully reproducible build, though
any reproducible build is *also* a semantically reproducible build.

My hope is that if someone wants a reproducible build, they'll use that term.
However, If they want to talk about this backoff approach, they'll have a 
clearly
*different* but *related* term they can use, eliminating a source of confusion.

 David A. Wheeler

 Details ==

As explained in the documentation for the oss-reproducible tool
,
which is part of OSSGadget :

"A project build is *semantically reproducible*
if its build results can be either recreated exactly (a bit for bit 
reproducible build
,
or if the differences between the release package and a rebuilt package are not 
expected
to produce functional differences in normal cases.
For example, the rebuilt package might have different date/time stamps,
or one might include files like .gitignore that are not in the other and would 
not change
the execution of a program under normal circumstances."

A semantically reproducible build has very low risk of being a subverted build
as long as it's *verified* to be semantically reproducible.
Put another way, verifying that a package has a semantically reproducible build
counters the risk where the putative source code isn't malicious, but
where someone has tampered with the build or distribution process,
resulting in a built package that *is* malicious.
It's quite common for builds to produce different date/time stamps, or
to add or remove "extra" files that would have no impact if the original
source code was not malicious.

It's much easier (and lower cost) for software
developers to create a semantically reproducible build instead of always
creating a fully reproducible build.
Fully reproducible builds are still a gold standard for verifying
that a build has not been tampered with.
However, creating fully reproducible builds often require that package
creators change their build process, sometimes in substantive ways.
In many cases a semantically reproducible build requires no changes,
and even if changes are required, there are typically fewer changes required.

OSSGadget 
includes a tool that can determine if a given package is
semantically reproducible.
It's still helpful to work to make a package a fully reproducible build.
A fully reproducible build is a somewhat stronger claim, and
you don't need a complex tool to determine if the package is fully
reproducible.
Even given that, it's easier to first create a package that's
semantically reproducible, and then work on the issues remaining
to make it a fully reproducible build.

In short, making packages at least semantically reproducible
(and verifying this) is a great countermeasure against subverted builds.

I had earlier talked with some people about this idea, but they noted that
there would be a lot of problems if the term "reproducible build" was changed to
be something like this. After consideration, I've decided they're right.
Bit-for-bit equality is a *powerful* countermeasure against even very clever 
attacks.
However, there's value in taking steps to get closer to bit-for-bit equality,
and if your goal is to measure risk, it's useful to have a term for this 
intermediate stage.
So, let's create a new (but obviously similar) term for it.




Re: Future of reprotest and alternatives (sbuild wrapper)?

2023-03-01 Thread Eric Myhre

On 28.02.2023 04:15, Mattia Rizzolo wrote:

On Mon, Feb 27, 2023 at 06:11:16PM -0800, Vagrant Cascadian wrote:

It also contains forks of some autopkgtest code, last updated in 2017,
if I am reading the git logs correctly. It is apparently no longer
working with current versions of qemu with the qemu backends:

   https://bugs.debian.org/1001250

I think it was forked largely to remove Debian-isms in the autopkgtest
code, which looks to be only packaged on Debian derivatives:

   https://repology.org/project/autopkgtest/versions

IMHO, basically that's the main problem reprotest has.

I know I talked with some autopkgtest maintainers in the past, and they
told me that they would really like to also get rid of this code they
have.

What we are interested in is the "backend" part of autopkgtest, the one
that has all of that abstraction layer to run whatever comman in
different environments (qemu, lxc, null, chroot, etc).  They would also
like to get rid of this, delegating maintenance of those things.

Therefore, I believe the optimal solution would be for somebody to take
charge of that part, and make a pretty library out of it.  Now, as for
how this could happen, ...



ISTM that there are actually two jobs that must always be done: one is 
the isolation initiation... but the second is that some filesystems must 
also have appeared already.  (A totally empty filesystem on "boot" is an 
interesting state to be able to reach, but also not a very practical 
place to move much further from.)


It's also been my practical experience that trying to separate those two 
can be a bit tricksy.  Or at least, quite tricksy to do and make 
_efficient_, if that's any sort of goal.  And yet it's important to do 
so, because if one has to reimplement the filesystem construction for 
each kind of executor (qemu, lxc, chroot, etc) then that's a lot of ugly 
code to maintain.


We try to do both of these things in the warpforge project, and I think 
successfully, but a key boundary we chose is that we assume bind and/or 
overlayfs mounts are always a viable option -- or something that can act 
in a roughly equivalent way.  That seems to generally be supportable 
across many containment paradigms.



Does autopkgtest make a similar choice somewhere?  Or does it have some 
different philosophy or fundamental API that makes this less of an issue 
than I think of it as?


Re: Profile Guided Optimization (PGO)

2022-06-21 Thread Eric Myhre
I vaguely recall having a conversation about PGO with an engineer from 
Huawei (iirc) at an RB summit several years ago.  I think we came to 
that same idea -- of AOT determination and simply storing it -- fairly 
quickly; and came to no further ideas after prolonged thought.


And though it's tangential to many of our typical interests (e.g. 
security, etc) in reproducible builds, ISTM that treating PGO info as 
"source" and version controlling it should be treated as rather 
obviously correct anyway.  Surely anyone chasing performance 
optimization is also making sure they do so consistently, and thus doing 
some sort of prolonged tracking and graphing, and such a person would 
have very little leverage scored out of all that effort if they couldn't 
actually point to what changed when, right?



On 6/21/22 19:11, David A. Wheeler wrote:

Profile Guided Optimization (PGO) can make it challenging to generate 
reproducible builds.
Some options are listed here:
https://github.com/bmwiedemann/theunreproduciblepackage/tree/master/pgo

It'd be great if other options existed.

One idea: What about having the profile built ahead-of-time, and recorded in 
the source tree?
That probably has to be recorded separately for different architectures.
It might not be too bad to include it in the "source code" as long as it cannot 
affect
the actual semantics (only the performance). That is obviously annoying, but 
it's
also annoying to lose a nontrivial amount of performance?

Thoughts? Does anyone have any better ideas for making PGO reproducible?

Thanks.

--- David A. Wheeler






On 6/21/22 19:56, Orians, Jeremiah (DTMB) wrote:

It'll solve the reproducible problem but introduce a bootstrapping problem for 
how those files
were generated in the first place.


FWIW, this seems sorta fine to me.

As long as a build path exists which _doesn't_ require the PGO AOT 
snapshot info to build successfully.


But that seems unlikely to rot, doesn't it?


Re: Reproducible Builds Verification Format

2020-05-12 Thread Eric Myhre
Some of these dreams and the outlines of these concepts have been around 
quite a bit longer than this year, even.  I think some differential 
diagnosis about what makes this draft different, and why it makes the 
choices it does, would be useful.


Some things I'd like to see identified and explicitly discussed more 
frequently in this concept space:


- What's the "primary key"?  In other words, how can I meaningfully 
expect to identify this one attestation record, or this one build 
instruction document?


- What are the "secondary keys" I could plausibly expect to select on if 
I have a zillion of these, and want to find those that should or should 
not align in results?


- What parts of this info do we expect to be useful, and why?  (What 
user story caused a certain piece of info to seem relevant and 
actionable enough to include?)


- What things we *could* imagine someone proposing putting in this info 
which we might reject because we don't believe it would be useful, and why?


The motivations of "a generic way to compare results" are good.  But 
good intentions can only carry us so far.  These four things are some of 
the first considerations I have when looking at a format proposal.  
Without some thought about the "keys", I don't know how it will deliver 
on "comparability" at scale.  Without some meta-documentation of not 
just the data that goes _in_, but also the kind of data that _doesn't_, 
I worry that the spec will become a kitchen sink, sopping up more data 
with time regardless of its relevance, and correspondingly becoming less 
and less useful over time.


I don't know if these are the only four questions to ask, nor will I 
claim they are perfect, but they're some of the first things that come 
to my mind as heuristics, and I share them in the hope that they can be 
a useful whetstone for someone else's thoughts.




As an incidental aside, I think what's currently listed in that github 
link as "origin_uri" may be mistaken in its conception of "URI".  The 
examples are such things as "http://ftp.us.debian.org/; and 
"https://download.docker.com/;, and I'm sure these are _locations_, not 
_identifiers_ -- URLs, not URIs.


And I would question (begging forgiveness from anyone who knows my 
refrain already) if "locations" as any sort of primary key are a sturdy 
idea to try to build upon.  They're terribly centralized. And provide 
very little insurance against mutability events which can make all other 
documents that refer to them become instantly useless.  
Content-addressing may have some potential to address this, git (at 
least in concept) has shown us the way...




Cheers to all hopeful rebuilders :)


On 5/12/20 11:00 PM, Paul Spooren wrote:

Hi all,

at the RB Summit 2019 in Marrakesh there were some intense discussions about
*rebuilders* and a *verification format*. While first discussed only with
participants of the summit, it should now be shared with a broader audience!

A quck introduction to the topic of *rebuilders*: Open source projects usually
offer compiled packages, which is great in case I don't want to compile every
installed application. However it raises the questions if distributed packages
are what they claim. This is where *reproducible builds* and *rebuilders* join
the stage. The *rebuilders* try to recreate offered binaries following the
upstream build process as close as necessary.

To make the results accessible, store-able and create tools around them, they
should all follow the same schema, hello *reproducible builds verification
format* (rbvf). The format tries to be as generic as possible to cover all open
source projects offering precompiled source code. It stores the rebuilder
results of what is reproducible and what not.

Rebuilders should publish those files publicly and sign them. Tools then collect
those files and process them for users and developers.

Ideally multiple institutions spin up their own rebuilders so users can trust
those rbuilders and only install packages verified by them.

The format is just a draft, please join in and share you thoughts. I'm happy to
extend, explain and discuss all the details. Please find it here[0].

As a proof of concept, there is already a *collector* which compares upstream
provided packages of Archlinux and OpenWrt with the results of rebuilders.
Please see the frontend here[1].

If you already perform any rebuilds of your project, please contacy me on how to
integrate the results in the collector!

Best,
Paul


[0]: https://github.com/aparcar/reproducible-builds-verification-format
[1]: https://rebuild.aparcar.org/





[rb-general] Reproducible builds for... steampunk mind-transfer gadgetry (in webcomics)?

2019-10-03 Thread Eric Myhre
I get a giggle now and again when the concept of reproducible builds 
appears in some other cultural context.  Such as this week, in a webcomic!


(Context: steampunk universe; mad geniuses everywhere, generally 
building gadgets and then thinking about the consequences later; and at 
this moment in the plot arc, machines for altering minds... which 
they're about to use, but are afraid might've been modified by another 
mad genius in an unknown way!  Whee!)


http://www.girlgeniusonline.com/comic.php?date=20191002

Thank goodness this is only a problem in the comic universe!

(No serious content here -- but if anyone else is keeping a portfolio of 
"reproducible builds user stories"... ;) hehe)

___
rb-general@lists.reproducible-builds.org mailing list

To change your subscription options, visit 
https://lists.reproducible-builds.org/listinfo/rb-general.

To unsubscribe, send an email to 
rb-general-unsubscr...@lists.reproducible-builds.org.

Re: [rb-general] advice on stashing compiler options in a binary

2019-03-20 Thread Eric Myhre

On 3/20/19 12:19 PM, Orians, Jeremiah (DTMB) wrote:

In today's world, or you can easily create full containers or at least chroot 
sandboxes, those are pretty easy to recreate.

Or a simpler option, fully static binaries like those M2-Planet creates.
https://github.com/oriansj/M2-Planet
Where there is no possible input that could possibly create non-deterministic 
output.
Build directories, paths, timestamps, library paths, host instruction set or 
any other of that nonsense, just doesn't matter.



This seems like a good moment to mention the term "path-agnostic".

Static linked binaries are path-agnostic; but not all path-agnostic 
things need be static linked.  For those who consider static linking 
verboten for whatever reason... there's still options to escape the 
"nonsense".


(If you have the patience for videos, I gave a talk which defines this 
term and has other ways to get there: 
https://media.ccc.de/v/ASG2018-204-path-agnostic_binaries_co-installable_libraries_and_how_to_have_nice_things 
)

___
rb-general@lists.reproducible-builds.org mailing list

To change your subscription options, visit 
https://lists.reproducible-builds.org/listinfo/rb-general.

To unsubscribe, send an email to 
rb-general-unsubscr...@lists.reproducible-builds.org.

Re: [rb-general] __DATE__ and other toolchain patches

2019-01-14 Thread Eric Myhre

On 1/14/19 12:42 PM, Mattia Rizzolo wrote:

I personally still prefer to _also_ push for single fixes, dropping any
source of possible unreproducibilities.

After all, the build date is totally meaningless in pretty much all
cases I can think of, so getting rid of it completely is only good, and
there is no reason to go out of our way to actively say "hey, don't
bother with that" in MRs removing __DATE__/__TIME__.



(At the risk of sending an email which could've been a Slack emoji 
reply... I don't know any other way to signal positive consensus on 
email...)


+1 intense nodding
___
rb-general@lists.reproducible-builds.org mailing list

To change your subscription options, visit 
https://lists.reproducible-builds.org/listinfo/rb-general.

To unsubscribe, send an email to 
rb-general-unsubscr...@lists.reproducible-builds.org.

Re: [rb-general] transitive collision resistance [was: rb formalism]

2018-12-21 Thread Eric Myhre
Folks, if there's something to say about hashes that can be answered by 
a quick trip to Wikipedia or your other favorite fount of public 
knowledge, please consider doing so... this is discussion, though 
liveliness is good, is starting to seem like a significant divergence 
from the core purposes of this mailing list.


For what it's worth,

https://en.wikipedia.org/wiki/Cryptographic_hash

is a lovely page, as is

https://en.wikipedia.org/wiki/Merkle_tree

which talks about the long and well-studied history of how hashes compose.

And if you didn't like the parts of the earlier thread about rb 
formalisms that mentioned "h", then just mentally elide it.  By and 
large, it didn't matter: it's an efficiency boost, but if you'd rather 
see any uses of "h" as "this should be plausible as a primary key in 
some table", that's just fine.

___
rb-general@lists.reproducible-builds.org mailing list

To change your subscription options, visit 
https://lists.reproducible-builds.org/listinfo/rb-general.

To unsubscribe, send an email to 
rb-general-unsubscr...@lists.reproducible-builds.org.

Re: [rb-general] Reproducible Java builds with Maven

2018-11-26 Thread Eric Myhre

On 26.11.2018 03:00, Bernhard M. Wiedemann wrote:

Hi Hervé,

thanks for raising this topic.

On 26/11/2018 09.08, Hervé Boutemy wrote:

Anybody interested in working together?

With openSUSE we are doing all builds offline to ensure that we can
repeat builds later (without worry about offline or hacked servers), but
for maven this often meant we had to download 300 MB of someone else's
binaries to use in the build.


I love all the reproducibility issues of jars enumerated in this wiki page.

However... another +1 to this issue raised by Bernhard and Julien. One 
of the biggest practical hurdles in working with Maven comes before any 
of that: there's no clear separation of "download time" vs "resolve 
time" vs "build time".


Maven seems to intermix downloads and execution operations fairly freely 
(e.g. plugin download, now plugin eval, now dep download -- download and 
execution are interleaved).  This makes it very, very difficult to 
ensure all the needed dependencies can be identified and downloaded (and 
saved locally) in advance.


Some distributions and build environments prefer to completely disable 
the network during builds in order to make certain that there aren't 
uncaptured information sources or dependencies being downloaded at build 
time -- in order to make rigorously sure we satisfy our core definition 
of reproducible: "given the same source code, [and] build environment".  
I'd love to work on making Maven as compatible with this goal as possible.


Even some features for more explicit/pre-build-phase dependency 
enumeration would be a big help in this area. I chatted with some other 
Maven enthusiastic folk at our last summit, and while we found ways to 
instruct Maven to yield a list of resolved dependencies, this still 
didn't cover a lot of critical ground: the output was human-readable, 
but not very easily machine-parsible; and if I recall correctly it 
covered dependencies but not plugins, making it somewhat incomplete.  An 
API for these operations would be incredibly useful.  (And then ideally, 
perhaps we'd like a way to take our resolved list of dependencies and 
automatically write out a new pom file with either those fixed versions 
or a fixed reference to everything needed to perform an identical 
resolution process offline in the future; but that's a next step.  
Sounds like Guix has a tool for that; it'd be nice if such a tool was in 
mainline Maven itself.)


Of course if I'm misspeaking and there are more features for dependency 
enumeration and separating download/resolve/build phases -- I love being 
wrong :) -- then this whole email can instead be: I'd love to round up 
some documentation about these features and add it to these wiki pages 
about reproducibility :)


---

https://github.com/signalapp/gradle-witness might be interesting in 
relation to this topic.  It is a Gradle plugin to add hash checks to 
downloads.


It ran into a few issues that seem likely to arise again:

- It's very opt-in; you can't apply it to a project without modifying 
the pom^H^H^H build.gradle file, and this limits its usefulness to folk 
from the distro perspective


- As the readme mentions, it has something of a bootstrapping problem 
(it can't fetch *itself* by hash...)


- IIUC, it doesn't work for Maven/Gradle plugins, only for the project 
dependencies... which means it's not a complete coverage of the build 
environment.


   - It only applies the checks to dependencies listed in the
   configuration; if transitive resolution somehow adds a new
   dependency, it goes unchecked (and this does come up: for example,
   if building on a different architecture, the dependency resolution
   may yield different results *even when* all versions are pinned),
   and so again, it's not complete coverage.


In general, the lesson here seems to be that when trying to get a 
complete view of the sources and build environment, tools built into the 
core can really can shine a lot brighter; when trying to do it in 
plugins, then things like (ironically) plugins seem to end up very 
difficult to handle.


---

Cheers!  Very excited for the gathering of effort.


___
rb-general@lists.reproducible-builds.org mailing list

To change your subscription options, visit 
https://lists.reproducible-builds.org/listinfo/rb-general.

To unsubscribe, send an email to 
rb-general-unsubscr...@lists.reproducible-builds.org.