Re: mitigating non-determinism

2024-06-18 Thread John Gilmore
"Bernhard M. Wiedemann via rb-general" wrote:
> ASLR:
> Influences from address-space-layout-randomization(ASLR) can be avoided 
> with setarch -R COMMAND or globally with echo 0 > 
> /proc/sys/kernel/randomize_va_space . This also helps with some cases of 
> uninitialized memory.

Anytime we find programs using uninitialized memory, we should debug
them, not change the build environment to make them seem OK.

(I found a couple of bugs like this in the GNU assembler back in the
1990s, that produced different instruction sequences based on reading an
uninitialized variable.  This hadn't been noticed before testing
reproducibility, because all the sequences were valid instructions.  I
think the bug was in picking long or short offsets, perhaps in jump
instructions.)

John



Re: Reproducible Builds in May 2024: Missing paper link

2024-06-11 Thread John Gilmore
Chris Lamb  wrote:
> Secondly, Ludovic Courtès, Timothy Sample, Simon Tournier and Stefano
> Zacchiroli have collaborated to publish a paper on "Source Code
> Archiving to the Rescue of Reproducible Deployment" [42]. Their paper
> was motivated because:
> 
> > The ability to verify research results and to experiment with
> > methodologies are core tenets of science. As research results are
> > increasingly the outcome of computational processes, software plays
> > a central role. GNU Guix [43] is a software deployment tool that
> > supports reproducible software deployment, making it a foundation
> > for computational research workflows. To achieve reproducibility, we
> > must first ensure the source code of software packages Guix deploys
> > remains available.
> 
> A PDF of this article [44] is also available.
> 
>  [42] https://hal.science/hal-04586520
>  [43] https://guix.gnu.org/
>  [44] https://hal.science/hal-04582287/document

Those links 42 and 44 do not lead to the cited paper.  They lead to
the first paper discussed (which apparently appears twice in hal.science).

John


[no subject]

2024-05-17 Thread John Gilmore
FreeBSD 14.1-BETA2 Now Available
https://lists.freebsd.org/archives/freebsd-stable/2024-May/002133.html

"A summary of changes since BETA1 includes: ...
Kernels are now built reproducibly."

Yay!

John



Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-04-02 Thread John Gilmore
James Addison wrote that local storage can contain errors.  I agree.

> My guess is that we could get into near-unsolvable philosophical territory
> along this path, but I think it's worth being skeptical of the notions that
> local-storage is always trustworthy and that the network should always be
> avoided.

For me, the distinction is that the local storage is under the direct
control of the person trying to rebuild, while the network and the
servers elsewhere in the network are not.  If local storage is
unreliable, you can fix or replace it, and continue with your work.

I am looking for reproducibility that is completely doable by the person
trying to do it, at any time after when they obtain a limited number of
key items by any means: the bootable binary of the OS release, and what
the GPL calls the "Corresponding Source".

And, I am very happy to be seeing lots of incremental progress along the way!

John

PS: I have a local archive of the source ISO images and the binary ISO
images of many Ubuntu, Fedora, Debian, BSD, etc releases.  It all fits
easily on a single hard disk drive, and that drive has many backups from
different times.  The images all have checksums that were checked when I
obtained the images.  The checksums are in the backups, so I can see if
my copies were tampered with or merely suffered from storage degradation
over time.

And I can easily copy the whole thing and send you a copy, if you want
one; or put it on the Internet (some of the releases are available from
me now via BitTorrent).  If those distros were reproducible, I could
verify that each of those binary releases was untampered.  Or YOU could,
without my help, after you got a copy from me or from anyone.  And if
you suspected a binary Ken Thompson attack, you could use those releases
locally at your site, as the source material for an arbitrarily intense
diverse double-compilation check.  Without my help, and without the help
of anyone else on the Internet.

In short, making a local archive of reproducible binaries and their
corresponding sources, readily enables all the verifications that we are
trying to make common in the world.



Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-03-29 Thread John Gilmore
kpcyrd  wrote:
> 1) There's currently no way to tell if a package can be built offline 
> (without trying yourself).

Packages that can't be built offline are not reproducible, by
definition.  They depend on outside events and circumstances
in order for a third party to reproduce them successfully.

So, fixing that in each package would be a prerequisite to making a
reproducible Arch distro (in my opinion).

I don't understand why a "source tree" would store a checksum of a
source tarball or source file, rather than storing the actual source
tarball or source file.  You can't compile a checksum.

kpcyrd  wrote:
> Specifically Gentoo and OpenBSD Ports have solutions for this that I 
> really like, they store a generated list of URLs along with a 
> cryptographic checksum in a separate file, which includes crates 
> referenced in e.g. a project's Cargo.lock.

I don't know what a crate or a Cargo.lock is, but rather than fix the
problem at its source (include the source files), you propose to add
another complex circumvention alongside the existing package building
infrastructure?  What is the advantage of that over merely doing the
"cargo fetch" early rather than late and putting all the resulting
source files into the Arch source package?

> 3) All of this doesn't take BUILDINFO files into account

The BUILDINFO files are part of the source distribution needed
to reproduce the binary distribution.  So they would go on the
source ISO image.

> I did some digging and downloaded the buildinfo files for each package 
> that is present in the archlinux-2024.03.01 iso

Thank you for doing that digging!

>   Using plenty of different gcc versions looks 
> annoying, but is only an issue for bootstrapping, not for reproducible 
> builds (as long as everything is fully documented).

I agree that it's annoying.  It compounds the complexity of reproducing
the build.  Does Arch get some benefit from doing so?

Ideally, a binary release ISO would be built with a single set of
compiler tools.  Why is Arch using a dozen compiler versions?  Just to
avoid rebuilding binary packages once the binary release's engineers
decide what compiler is going to be this release's gold-standard
compiler?  (E.g. The one that gets installed when the user runs pacman
to install gcc.)  Or do the release-engineers never actually standardize
on a compiler -- perhaps new ones get thrown onto some server whenever
someone likes, and suddenly all the users who install a compiler just
start using that one?

It currently seems that there is no guarantee that on day X, if you
install gcc on Arch (from the Internet) and on the same day you pull in
the source code of pacman package Y, that it will even build with the
Day X version of gcc.  Is that true?

John



Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-03-28 Thread John Gilmore
John Gilmore  wrote:
> It seems to me that the next step in making the Arch release ISOs
> reproducible is to have the Arch release engineering team create a
> source-code release ISO that matches each binary release ISO.  Then you
> (or anyone) could test the reproducibility of the release by having
> merely those two ISO images and a bare amd64 computer (without even an
> Internet connection).

kpcyrd  wrote:
> I think this falls under "bootstrappable builds", a bare amd64 computer 
> still needs something to boot into (a CD with only source code won't do 
> the trick).

Bootstrappable builds are a different thing.  Worthwhile, but not
what I was asking for.  I just wanted provable reproducibility from two
ISO images and nothing more.

I was asking that a bare amd64 be able to boot from an Arch Linux
*binary* ISO image.  And then be fed a matching Arch Linux *source* ISO
image.  And that the scripts in the source image would be able to
reproduce the binary image from its source code, running the binaries
(like the kernel, shell, and compiler) from the binary ISO image to do
the rebuilds (without Internet access).

This should be much simpler than doing a bootstrap from bare metal
*without* a binary ISO image.

And if your source/binary ISO images can do that, it's not just an
academic exercise in reproducibility.  It can also produce a new binary
ISO that is built from that source ISO plus a few patches (e.g. for
fixing security issues).  Or, it can "recompile-the-world" after you (or
any user) makes a small change to a kernel, include file, library, or
compiler -- and show exactly how many programs compile to something
*different* as a result.  Basically, that pair of ISOs becomes a seed
that can carry forward, or fork, the whole distribution.  For anybody
who receives them.  That is the promise of free software, but the
complexity of modern distros plus the convenience of ubiquitous
Internet have inadvertently tended to undermine that promise.  Until
the reproducible builds effort!

If someday an Electromagnetic Pulse weapon destroys all the running
computers, we'd like to bootstrap the whole industry up again, without
breadboarding 8-bit micros and manually toggling in programs.  Instead,
a chip foundry can take these two ISOs and a bare laptop out of a locked
fire-safe, reboot the (Arch Linux) world from them, and then use that
Linux machine to control the chip-making and chip-testing machines that
can make more high-function chips.  (This would depend on the
chip-makers keeping good offline fireproof backups of their own
application software -- but even if they had that, they can't reboot and
maintain the chip foundry without working source code for their
controller's OS.)

John



Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-03-22 Thread John Gilmore
Congratulations on closing in toward Arch Linux reproducibility!!!

kpcyrd  wrote:
> Specifically what I mean - given a line like this:
> 
> FROM
> archlinux@sha256:2dbd72d1e5510e047db7f441bf9069e9c53391b87e04e5bee3f379cd03cec060
> 
> I want to reproduce the artifact(s) that are pulled in by this, with
> the packages our Arch Linux rebuilders have reproduced from source
> code. From what I understand this hash points to a json manifest that
> is not contained in the container image itself and was generated by
> the registry (should we archive them?), and this manifest then points
> to the sha256 of the tar containing the filesystem (I'm possibly
> missing an indirection here).

I have no experience with Arch -- am just reading what's on their
website.  From a quick glance at their docs, the Arch distribution
*only* distributes binary packages.  They only offer URLs for source
code, requiring that users depend on a working Internet connection and
what could be a large, arbitrary set of HTTPS servers that in theory
contain the matching source code.  See:

  https://wiki.archlinux.org/title/Arch_build_system

(I'm not sure how that even meets the requirements of the GPL for
binary distributors to make the matching source code available to
recipients of the binaries.)

It seems to me that the next step in making the Arch release ISOs
reproducible is to have the Arch release engineering team create a
source-code release ISO that matches each binary release ISO.  Then you
(or anyone) could test the reproducibility of the release by having
merely those two ISO images and a bare amd64 computer (without even an
Internet connection).  (Someone other than their releng team could do
this shortly after the binary release, hoping that none of the URLs
becomes inaccessible in the meantime.  But the right time to gather the
full source code for reproducibility is when they themselves pull in the
source code to BUILD those binary packages that they will put in their
release ISO.)

Making users reproduce an ISO full of binary packages by downloading the
sources from all over the Internet seems highly prone to fail -- in the
first few months, let alone five or ten years later.

Even Arch's binary releases are only available from Arch for three
(monthly) release cycles.  Then you're on your own if you want to find a
copy of what they released, like the one that was current last
Christmas.  See:

  https://archlinux.org/releng/releases/

Arch may do great release engineering (I hope they do!), but it's
apparently not *archival* release engineering.

John


Re: Two questions about build-path reproducibility in Debian

2024-03-05 Thread John Gilmore
Thanks, everyone, for your contributions to this discussion.

A quick note:
Vagrant Cascadian  wrote:
> It would be pretty impractical, at least for Debian tests, to test
> without SOURC_DATE_EPOCH, as dpkg will set SOURCE_DATE_EPOCH from
> debian/changelog for quite a few years now.

Making a small patch to the local dpkg to alter or remove the value of
SOURCE_DATE_EPOCH, then trying to reproduce all the packages from source
using that version of dpkg, would tell you which of them (newly) fail to
reproduce because they depend on SOURCE_DATE_EPOCH.

> Sounds like an interesting project for someone with significant spare
> time and computing resources to take on!

It looks to me like the whole Ubuntu source code (that gets into the
standard release) fits in about 25 GB.  The Debian 12.0.0 release
sources fit in 83GB (19 DVD images).  Both of these are under 1% of a
10TB disk drive that runs about $200.  A recent Ryzen mini-desktop,
with a 0.5TB SSD that could cache it all, costs about $300.  Is this
significant computing resources?  For another $40 we could add a better
heat sink and a USB fan.  How many days would recompiling a whole
release take on this $540 worth of hardware?

(I agree that the "spare" time to set it up and configure the build
would be the hard part.  This is why I advocate for writing and
releasing, directly in the source release DVDs, the tools that would
automate the recompilation and binary comparison.  The end user should
be able to boot the matching binary release DVD, download or copy in the
source DVD images, and type "reproduce-release".)

John



Re: Two questions about build-path reproducibility in Debian

2024-03-05 Thread John Gilmore
>> But today, if you're building an executable for others, it's common to build 
>> using a
>> container/chroot or similar that makes it easy to implement "must compile 
>> with these paths",
>> while *fixing* this is often a lot of work.

I know that my opinion is not popular, but let me try again before we lay
this decision to rest.

In avoiding fixing directory dependencies, you can move the complexity
around, but in doing so you didn't reduce the complexity.

Our instructions for reproducing any package would have to identify what
container/chroot/namespace/whatever the end-user must set up to be able
to successfully reproduce a package.  Will these be the same for every
package, for every distro, and for every other environment in which we
want to inspire reproducibility?  Do we need to add those constraints to
the Linux Foundation's Filesystem Hierarchy Standard?  Do we need to add
them to the buildinfo files?

Ideally the tools that ordinary people traditionally use to reproduce
one, such as dpkg-buildpackage or rpmbuild, will have been improved to
do the container/chroot setup automatically.  Otherwise, naive users
will have to figure out what a container is or why it is necessary for
them to grok this obscure environmental thing in order to tell if their
binary package was tampered with or not.  Will they always have to build
software as root, because chroot doesn't and can't work for ordinary users?

If we punt this, there will be an ongoing flow of "my package doesn't
build to the same binary, somebody must be 0wning me" emails from people
who do the obvious thing like type "make" and "cmp".  Do we want
successful reproducibility to depend on setting up servers and virtual
machines and web-servers and databases and build farms and CI-queues and
such?  Yes, to reproduce a whole distro, reproducibility has to WORK
there, but does it have to DEPEND on that complex infrastructure?

I'm an old Unix guy and so are millions of end-users and sysadmins.
Containers are a recent Linux thing.  Namespaces ditto.  I still have
never found a use for containers; I tried using Docker for something and
was bemused to discover that it could calculate all kinds of stuff, but
none of the output of the calculation could come back into my ordinary
Linux filesystem (without some kind of obscure per-invocation JCL-like
configuration setup), so I stopped trying to use it.  Another time, I
tried booting an on-disk, installed copy of Ubuntu inside a virtual
machine, so I could keep running an older service that's hard to port
forward, while migrating the rest of my machine to a newer Ubuntu
release.  VM/360 could do that decades ago, but I discovered that that
use-case is not well supported in the Linux vm tools and documentation,
so I gave up on that too.  There are more things in heaven and earth,
Horatius, than spending all of your time doing sysadmin.  These
newfangled tools are just not as well rounded as the stuff that's been
well understood in Unix since the 1970s or 1980s, like "directories".
If only seventeen experts in the world can figure out if a package has
been tampered with, we will have labored mightily but not done much to
improve computer security.

Also recall what pains the full-source bootstrap people are having to go
through after some imho foolish decisions were made about depending on
modern C++ features inside core tools like gcc and gdb.  Reproducible
builds should make the underlying software LESS dependent on the
particular configuration of the build environment; that's kind of the
point.

>>>  ... it makes reproducibilty from around 80-85% of all
>>> packages to >95%, IOW with this shortcut we can have meaningful 
>>> reproducibility
>>> *many years* sooner, than without.

If we move the goal posts in order to claim victory, who are we fooling
but ourselves?  I'd rather that we knew and documented that 57% of
packages are absolutely reproducible, 23% require SOURCE_DATE_EPOCH, and
12% still require a standardized source code directory, than to claim
all 95% are "meaningfully reproducible" today.

John



Re: Two questions about build-path reproducibility in Debian

2024-03-04 Thread John Gilmore
Vagrant Cascadian wrote:
> > > to make it easier to debug other issues, although deprioritizing them
> > > makes sense, given buildd.debian.org now normalizes them.

James Addison via rb-general  wrote:
> Ok, thank you both.  A number of these bugs are currently recorded at severity
> level 'normal'; unless told not to, I'll spend some time to double-check their
> details and - assuming all looks OK - will bulk downgrade them to 'wishlist'
> severity a week or so from now.

I may be confused about this.  These bug reports are that a package cannot
be reproducibly built because its output binary depends on the directory in 
which
it was built?

Why would these become "wishlist" bugs as opposed to actual reproducibility bugs
that deserve fixing, just because one server at Debian no longer invokes this
bug because it always uses the same build directory?

If an end user can't download a source package (into any directory on
any machine), and build it into the same exact binary as the one that Debian
ships, this is not a "wishlist" idea for some future enhancement.  This
is a real issue that prevents the code from being reproducible.

How am I confused?

John



Re: Irregular status update about reproducible live-build ISO images

2024-02-29 Thread John Gilmore
Roland, thank you for your ongoing work and reporting to make Debian 
reproducible!

One question:
> * Last month a question was raised, whether the distributed sources
> are sufficient to rebuild the images. The answer is: probably yes, but
> I haven't tried.
> The chain is: source code --compiler--> executable files --debian
> packaging--> .deb archives --live-build--> live images
> I've focused on the last section of this chain; the installation of
> the .deb archives into the live images.

Thank you for focusing on the last part of the chain.  You are very, very close
there!  I am wondering if there is any low-hanging fruit anywhere else in the
chain, that you may have the expertise and time to address.

For example, how does the live-build process decide which binary .deb
archives are selected for inclusion in the live image?  Are these lists
or criteria stored in the source code archives?  If not, can they be put
into the source code archives?

Similarly, are there any other inputs to the live-build process?  Perhaps a
template of a binary ISO image?  Or a binary program that creates a prototype 
ISO
image, which is run during the live-build process.  I note that when running
jigdo-lite to reproduce a live image, not only is there a set of .deb's that
are copied in, but also a .template file which has the portions of the image
that don't directly come from a .deb file.  Is there an equivalent template
in the live-build process, or where do these nonzero and non-.deb parts of the
resulting live-image come from?  Is there full source code for those?

Also, is there an easy way to start from the set of binary .deb files to
be included in an image, and from each one, produce a list of the source
files (.tar.gz's, Debian control files, patches, etc) that were used to
create it?  If so, you could create a master list of all the source files
that were used to create a particular live-image.  And an automated process
could compare that list of source files to the contents of the matching
"Sources" DVD image, to ensure that all of the required source files are
actually included in the "matching source" DVD.

When a rebuilt image differs in some small way from the original, what
tools do you use to determine what files the differences are in, and
why?  Are these tools to compare a live-image with a rebuilt-live-image
also in the Debian source tree and in the Debian source DVDs?

Being able to do any of these things, and correct any lapses now, before
the next official Debian release, will enable you or anyone to complete
the ultimate job of proving that a source DVD plus a live DVD can fully
reproduce the official live DVD, without access to any network
resources.  (And thus that a live DVD, a source DVD, plus a small set of
patches can verifiably produce a live DVD that includes only the changes
made in that set of patches, and no others.)

Thanks again!

John


Re: Please review the draft for December's report

2024-01-11 Thread John Gilmore
https://reproducible-builds.org/reports/2023-12/

  "Reproducible Builds in December 2023

   Welcome to the November 2023 report..."

It seems better to NOT reproduce the previous month's header quite so
accurately.  ;-/

John


Priority claim re bootstrapping

2023-11-12 Thread John Gilmore
I congratulate the Guix bootstrap team on their continuing progress on
reproducibility.  Yet, there is some controversy over one statement made
in their blog, claiming priority over building:

  a package graph of more than 22,000 nodes rooted in a 357-byte program

in the first paragraph of:

  
https://guix.gnu.org/en/blog/2023/the-full-source-bootstrap-building-from-source-all-the-way-down/

Claims of priority in innovation have some importance to history.  You
can see the level of controversy created by a latecomer who claimed that
he "invented email" here:

  
https://www.emailonacid.com/blog/article/industry-news/who-really-invented-email/
  https://en.wikipedia.org/wiki/Shiva_Ayyadurai#EMAIL_invention_controversy

And there's the key role that priority has in whether inventions are
patentable and by whom, which has led to huge financial implications;
such as:

  https://en.wikipedia.org/wiki/Apple_Inc._v._Samsung_Electronics_Co.

Email and smartphones have become pervasive in the decades since their
invention in small niches, making such claims actually important to the
world.  We hope and work toward our own area of invention,
reproducibility of software, becoming equally pervasive in the decades
to follow.

Therefore, in a community of technical experts, when a difference of
opinion comes up on the priority of a useful invention, it is useful for
the issue to be examined in more detail, among people who "were there at
the time".  Such an examination tends to bring out more facts, which
become very useful to later observers who weren't there and have to
figure out what may have happened from scanty documentation.

In many such cases, different smaller innovations were often made by a
variety of people or teams.  Perhaps only one or none of the claimants
can credibly support the statement that they made the "key" invention in
that area, but many may be able to share some level of public credit
for their work.

(As another example, when I claimed years ago on this list that Cygnus
made the GNU compiler tools reproducible in the 1990s, some people
called bullshit on me -- until I provided much more detail linked to the
published source trees and early releases involved, as well as copies of
internal company emails and marketing announcements on the topic.)

Personally I don't have a stake in which claims about "bootstrapping a
large collection of source code programs from a small binary seed" hold
up to detailed historical and technical examination or not.  It seems to
me that cross-bootstrapping from a more mature software architecture has
for decades been the norm, even in the creation of the original UNIX.
(As UNIX was then used to cross-bootstrap the GNU software at FSF and at
Cygnus many years later.)  Toggling in binary seed programs was unusual
even in the first (8-bit) microprocessors able to self-host in the early
1970s, such as in building Bill Gates's first 8080 BASIC interpreter,
Steve Wozniak's ROMs for the 6502-based Apple II, let alone the
68000-based SUN boards built at Stanford in the 1980s.  Didn't IBM
cross-build from the 7094 for the first IBM 360 tools even in the early
1960s?  But before dismissing the idea out of hand, let's see what facts
we can turn up, and what new innovations can be produced there to
improve supply chain integrity.

I do think the topic is a suitable one for the Reproducible Builds
community to discuss.  Politely conducted disputes should not be
dismissed as "nonsense" with a suggestion that the parties unsubscribe
from the list.  Inflating the emotional tone of the discussion is not
constructive toward the community discovering whatever contemporaneous
truths may be findable behind the various claims.

Thank you for listening.

John Gilmore


Re: Reproducibility terminology/definitions

2023-11-08 Thread John Gilmore
Pol Dellaiera  wrote:
> To that end, I'm currently drafting a formal definition of
> reproducibility that I hope to contribute. However, before I proceed
> further, I would like to know whether any of you have already worked
> on formulating such a definition.

Here are a few emails (from prior R-B discussions) that go into how
"reproducibility" might be formally defined and verified.  I'm sure
that many other inputs would also be useful; these are just two that
I could recall and easily find.

The final series of emails below describes how Cygnus made the GNU
cross-compiler tools "reproducible" in the early 1990s -- in the sense
that if you cross-compiled the same source code on any of 9 hosting
platforms, the tools would produce the exact bit-for-bit identical
binaries for the target platform.  (E.g. a Mac, a Windows machine, and a
SPARC Solaris machine were verified to cross-compile the "GNU make"
source code into an identical "gmake" binary, that was targeted to run
on a Motorola 68000 running SunOS.)  This reproducibility also included
being able to cross-build identical binaries for all the GNU compiler
tools themselves (as well as for all of our hundreds of compiler test
cases).  The several man-years of engineering required to stamp out all
the bugs that made the compilers not reproducible, was a prerequisite to
today's efforts to make whole linux distributions (that are compiled by
those GNU tools) reproducible.

John

To: General discussions about reproducible builds
 
References: 
 <87sgxgok9k@gnu.org>
Comments: In-reply-to =?us-ascii?Q?=3D=3Futf-8=3FQ=3FLudovic=5FCourt=3DC3?=
 =?us-ascii?Q?=3DA8s=3F=3D?= 
 message dated "Fri, 25 Jan 2019 15:59:51 +0100."
MIME-Version: 1.0
Date: Mon, 28 Jan 2019 23:18:43 -0800
Message-ID: <15495.1548746...@hop.toad.com>
From: John Gilmore 
Subject: Re: [rb-general] Definition of "reproducible build"
Content-Type: multipart/mixed; boundary="===0660915116=="
Errors-To: rb-general-bounces+gnu=toad@lists.reproducible-builds.org
Sender: "rb-general"
 

--===0660915116==
Content-Type: text/plain
Content-Transfer-Encoding: 8bit

=?utf-8?Q?Ludovic_Court=C3=A8s?=  wrote:
> I agree that insisting on provenance is crucial.  Dockerfiles (andsimilar) 
> are often viewed as “source”, but they really aren’t source:the actual source 
> would come with the distros they refer to (Debian,pip, etc.)
> Those distros might in turn refer to external pre-built binaries,though, such 
> as “bootstrap binaries” for compilers (Rust, OpenJDK, andso on.)

I propose a definition for whether a bootable OS distro is reproducible.
(If what you're building is not a whole distro that can self-compile,
this definition doesn't apply.)

Our initial goal would be to produce a bootable binary release (DVD or
USB stick) and a source release (ditto).  The source release would
include the script that allows the binary release to recompile the
source release to a new binary release that ends up bit-for-bit
identical.  Such a binary/source release pair would be called
"reproducible".

That's useful: If you have to fix a bug in it, you can make the mods you
need in the source tree, rebuild the world, and out will come a release
with just that one change in the binaries, verifiably identical except
where it matters.  And developers can use such a release to detect what
changes matter to whom, such as: when you alter a system include file,
which binaries change?

During development, the code would be built by some earlier release's
tools, built piecemeal, etc, like current build processes do.  Anytime
before release, the developers can test whether a draft source release
builds into a binary release that itself can build the sources into the
same binary release.  And fix any discrepancies, ideally long before
release.

This is similar to what GCC does to test itself, or what Cygnus did to
test the whole toolchain for cross-compiling.  But applied to the
entire OS release.

Such a paired source/binary release doesn't require a chain of
provenance of earlier binary software, particularly if people can
demonstrate bootstrapping it using several different earlier compiler
toolchains, still producing the same binaries.  You can bootstrap
it with itself.

The separate efforts to minimize the amount of binary code we have to
trust to do a rebuild are laudable and fascinating.  Keep going!  But we
shouldn't require whole distros to do that yet.  We haven't even
accomplished a basic paired binary/source reproducible release yet, for
any major release -- or have we?

John

PS: For extra points, the binary release should be able to cross-compile
its source release into a binary release for each other supported
platform, reproducibly.  And those other-platform binary releases should
cross-compile the source release back bit-for-bit into the

Re: Introducing: Semantically reproducible builds

2023-05-29 Thread John Gilmore
David A. Wheeler  wrote:

> Please don't view the text above as opposing reproducible builds.
> I think reproducible builds are the gold standard for countering subverted 
> builds, and I will continue to encourage them.
> But when you can't get them (e.g., because you don't have time to patch every 
> program
> in the universe or the builders won't make changes to their build process),
> it's useful to look for some *workable* backoff alternatives. The backoffs 
> may not give
> you all you wanted, but they can at least help users focus on their biggest 
> risks first.

To the extent that the text causes the public to be confused about what
reproducibility means, that text *will* oppose reproducible builds.

Can you call packages that aren't reproducible because the maintainers
insist on keeping timestamps or temp file names or etc in the binaries,
(or whose maintainers simply don't care), "irreproducible" rather than
"semantically reproducible"?  That would be much clearer.

John



Re: Sphinx: localisation changes / reproducibility

2023-04-17 Thread John Gilmore
James Addison  wrote:
> When the goal is to build the software as it was available to the
> author at the time of code commit/check-in - and I think that that is
> a valid use case - then that makes sense.

I think of the goal as being less related to the author, and more
related to the creator of a widespread binary release (such as a Linux
distribution, or an app that goes into an app-store).

The goal is then that the recipient of that binary release can verify
that the source code they obtained from the same place is able to
rebuild that exact widespread binary release.  This proves that the
source code can be trusted for some purposes, such as being used to read
it to understand what the binary does.  Or to make small bug-fixes to it.
Or to become the base for further evolution of the project if the
maintainer is suddenly "hit by a bus" and stops making further releases.

James Addison  wrote:
> Inverting the question somewhat: if a single source-base is rebuilt
> using two different SOURCE_DATE_EPOCH values (let's say, 1970-01-01
> and 2023-04-18), then what are expected/valid differences in the
> resulting output?

In the ideal circumstances, the resulting output would be identical,
because the build process would have no dependencies on
SOURCE_DATE_EPOCH.  In these ideal circumstances, the code is "portable",
in the same sense that people understand "portable" code will build and
run the same on an ARM running MacOS as it does on an x86 running
Windows.  There are many ways to make code portable, but the most robust
of them is to *eliminate* dependencies.

A more fragile way would be to #ifdef your code to adjust for every
supported build or run environment.  That fragile way breaks as soon as
it needs to build or run in a new environment, whereas the robust way
has already made it likely to "just work" in a new environment that it
has never encountered before (or to have only one or two minor things
that need adjusting).  Note that if it built fine in a Linux system
version X, then a later Linux system version Y is a "new environment"
and might break the code.  The robust version is again less likely to
break, because it inherently, by design, cares less about the nitty
gritty details of its environment.

Much code in Linux does not reach that ideal (yet!).  Instead, builds of
non-ideal code use SOURCE_DATE_EPOCH as a crutch to limit their
dependencies on the local build environment, replacing those
dependencies with a dependency on SOURCE_DATE_EPOCH.

So, if you rebuild a non-ideal package with two different values of
SOURCE_DATE_EPOCH, you will get two different binaries that differ in
the areas of dependency.  For example, if the documentation embeds a
build-date in its page footer, you'd expect every page of the built
documentation would differ.  If the "--version" output of the program
embeds the build date, then the code that produces that output would
differ.  Etc.  In fact, "fuzzing" their code with different values
of SOURCE_DATE_EPOCH can help a maintainer identify where those
dependencies still remain.

We try to talk package authors out of such dependencies, but ultimately
it's their package and they make the architectural decisions.  To some
of them it's incredibly important that the build date appears in the
man-page.  Reproducibility usually features lower among their priorities
than it does in ours.

John



Re: Sphinx: localisation changes / reproducibility

2023-04-15 Thread John Gilmore
James Addison via rb-general  wrote:
>  In general, we should be able to
> pick two times, "s" and "t", s <= t, where "s" is the
> source-package-retrieval time, and "t" is the build-time, and using
> those, any two people should be able to create exactly the same
> (bit-for-bit) documentation.  I think that SOURCE_DATE_EPOCH generally
> refers to "t".

I think that SOURCE_DATE_EPOCH generally refers to the check-IN time of
each of the source package(s) being rebuilt.  You can retrieve the
packages anytime later than that, and you can do the build at any time
later, and SOURCE_DATE_EPOCH should not change (and the built binaries
and docs should also not change).

John



Re: Three bytes in a zip file

2023-04-06 Thread John Gilmore
Larry Doolittle  wrote:
> $ diff <(ls --full-time -u fab-ea2bb52c-ld) <(ls --full-time -u 
> fab-ea2bb52c-mb)
> 22c22
> < -rw-r--r-- 1 redacted redacted  644661 2023-04-04 18:10:00.0 -0700 
> marble-ipc-d-356.txt
> ---
> > -rw-r--r-- 1 redacted redacted  644661 2023-04-06 00:25:03.0 -0700 
> > marble-ipc-d-356.txt

So I'm guessing that even before the zip file is re-created, the rebuild
process is leaking the rebuild timestamp into the last-modified metadata
of the generated marble-ipc-d-356.txt file?  That seems like it should
be handled by the build process explicitly setting its timestamp to
something related to the last-source-code-checkin time (with "touch
--date=XXX") rather than to current time.

Truncating the timestamps to DOS timestamps wouldn't work to eliminate
this difference anyway, since the date in the two files is two days
different; DOS timestamps are accurate to 2 seconds, as I recall.

John



Re: Does diffoscope compares disk partitions

2023-03-01 Thread John Gilmore
>> So, overall, I actually don't think that diffoscope has the requested
>> support, and it's not "just" a bug of failed identification.

I have been surprised at how much effort has gone into "diffoscope" as a
total fraction of the Reproducible Builds effort.  Perhaps it is a case
akin to the drunk looking for his keys under the streetlight where he
can see, rather than in the dark where he dropped them.  (It's easier to
hack diffoscope than to hack thousands of irreproducible packages.)  I
for one am happy that diffoscope DOESN'T have support for umpteen disk
partitioning schemes and file system formats.

John

PS:  Has anyone on the list considered writing an article for the
Journal of Irreproducible Results about our effort?


Re: Call for real-world scenarios prevented by RB practices

2022-03-24 Thread John Gilmore
On 22/03/2022 13.46, Chris Lamb wrote:
> Just wondering if anyone on this list is aware of any real-world
> instances where RB practices have made a difference and flagged
> something legitimately "bad"?

The GNU compilers are already tested for complete reproducibility.  We
at Cygnus Support built that infrastructure back in the 1990s, when we
made gcc into a cross-compiler (compiling on any architecture + OS,
targeting any other).  We built the Deja Gnu test harness, and some
compiler/assembler/linker test suites, that rebuilt not just our own
tools, but also a test suite with hundreds or thousands of programs.  We
compared their binaries until they were bit-for-bit identical when built
on many different host machines of different architectures.

To make it work, we had to fix many bugs and misfeatures, including even
some high-level design bugs, like object file formats that demanded a
timestamp (we decided that 0 was a fine timestamp).  A few of those bugs
involved generating different but working instruction sequences -- I
recall fixing one that depended on an uninitialized local variable.

We never found any malicious code in the GNU tools during that process,
just poorly debugged code and unportable code.  I don't know whether
that's because nobody malevolent actually knew what a lever they would
have had by infesting our code, or whether we really weren't as
important as we thought we were :-/.  I was still manually making and
reading the diff between the previous release and each new release, to
make sure that no change that I didn't recognize would slip through.  It
was a pretty heady feeling to make a GNU tool release, send an email to
info-gnu, and have thousands of people running it in the next few days.
We took the responsibility seriously.

(Caveat: We weren't shipping binaries, except to Cygnus customers.
 Maliciously patched binaries are what RB is designed to prevent.)

John



Re: [rb-general] Debian buster, 54% reproducible in practice (Re: Core Debian reproducibility: 57% and rising!)

2019-03-02 Thread John Gilmore
> Though without solving #894441 we cannot reach much higher than 80%=20
> (because 93% is the current theoretic maximum, of which we need to=20
> distract 12% binNMUs...)

Even without solving the general binNMU problem, can't you make more
packages reproducible by eliminating those packages' dependencies on
SOURCE_DATE_EPOCH?

(I always thought that removing the date dependencies from the source
code was better than patching over them with this environment variable.
The challenge is when individual maintainers refuse to make their
packages date-indepedent, yet distros still want the packages to be
reproducible.  If you can convince a maintainer, you aren't stuck on
the horns of this dilemma.)

John
___
rb-general@lists.reproducible-builds.org mailing list

To change your subscription options, visit 
https://lists.reproducible-builds.org/listinfo/rb-general.

To unsubscribe, send an email to 
rb-general-unsubscr...@lists.reproducible-builds.org.

Re: [rb-general] Definition of "reproducible build"

2019-02-14 Thread John Gilmore
> I like the idea, however what you are proposing is basically a new
> distro/fork, where you would remove all unreproducible packages, as
> every distro still has some unreproducible bits.

I suggest going the other way -- produce a distro that is "80%
reproducible" from its source code USB stick and its binary boot USB
stick.  You'd already have the global reproducibility structure and
scripts written and working, even before the last packages are
individually reproducible.  That global reproducibility tech would be
immediately adoptable by any distro.  The output of the reproduction
scripts would be a bootable binary that does boot and run!  It would
still have differences from the "release master" bootable binary, but
those differences would be irrelevant to the functioning of the binary,
and would be clearly visible with "diff -r".

(For one thing, this would cause the distros to actually produce a
"source code USB stick image".  Currently most of them don't.  They
instead require you to download thousands of separate source packages or
tarballs, and have no scripts readily visible for building those into a
bootable binary image.)

After accomplishing that, then the focus could go on the 20% (or 10% or
whatever) of packages that aren't yet reproducible.  And, people making
small distros could cut out such packages to make a 100% reproducible
distro, as Holger suggested.

John

___
rb-general@lists.reproducible-builds.org mailing list

To change your subscription options, visit 
https://lists.reproducible-builds.org/listinfo/rb-general.

To unsubscribe, send an email to 
rb-general-unsubscr...@lists.reproducible-builds.org.

Re: [rb-general] Definition of "reproducible build"

2019-01-28 Thread John Gilmore
=?utf-8?Q?Ludovic_Court=C3=A8s?=  wrote:
> I agree that insisting on provenance is crucial.  Dockerfiles (andsimilar) 
> are often viewed as “source”, but they really aren’t source:the actual 
> source would come with the distros they refer to (Debian,pip, etc.)
> Those distros might in turn refer to external pre-built binaries,though, such 
> as “bootstrap binaries” for compilers (Rust, OpenJDK, andso on.)

I propose a definition for whether a bootable OS distro is reproducible.
(If what you're building is not a whole distro that can self-compile,
this definition doesn't apply.)

Our initial goal would be to produce a bootable binary release (DVD or
USB stick) and a source release (ditto).  The source release would
include the script that allows the binary release to recompile the
source release to a new binary release that ends up bit-for-bit
identical.  Such a binary/source release pair would be called
"reproducible".

That's useful: If you have to fix a bug in it, you can make the mods you
need in the source tree, rebuild the world, and out will come a release
with just that one change in the binaries, verifiably identical except
where it matters.  And developers can use such a release to detect what
changes matter to whom, such as: when you alter a system include file,
which binaries change?

During development, the code would be built by some earlier release's
tools, built piecemeal, etc, like current build processes do.  Anytime
before release, the developers can test whether a draft source release
builds into a binary release that itself can build the sources into the
same binary release.  And fix any discrepancies, ideally long before
release.

This is similar to what GCC does to test itself, or what Cygnus did to
test the whole toolchain for cross-compiling.  But applied to the
entire OS release.

Such a paired source/binary release doesn't require a chain of
provenance of earlier binary software, particularly if people can
demonstrate bootstrapping it using several different earlier compiler
toolchains, still producing the same binaries.  You can bootstrap
it with itself.

The separate efforts to minimize the amount of binary code we have to
trust to do a rebuild are laudable and fascinating.  Keep going!  But we
shouldn't require whole distros to do that yet.  We haven't even
accomplished a basic paired binary/source reproducible release yet, for
any major release -- or have we?

John

PS: For extra points, the binary release should be able to cross-compile
its source release into a binary release for each other supported
platform, reproducibly.  And those other-platform binary releases should
cross-compile the source release back bit-for-bit into the same binary
release you started with.

___
rb-general@lists.reproducible-builds.org mailing list

To change your subscription options, visit 
https://lists.reproducible-builds.org/listinfo/rb-general.

To unsubscribe, send an email to 
rb-general-unsubscr...@lists.reproducible-builds.org.

Re: [rb-general] Style Guide Updates

2019-01-18 Thread John Gilmore
> We tend to write Markdown, not HTML, after all so having copy-pastable
> snippets is less compelling to me, priority-wise. This also goes for
> the non-Javascript "story" but this is less interesting as the
> situation is somewhat-satisfactory right now.

Here's a vote for making the reproducible builds site completely
functional without Javascript.

What an insane idea it is that people can't read or interact with a web
site without granting permission to a random third party to run arbitrary
code in their local machine!

I browse with Javascript off all the time (using Firefox plugin
NoScript), and with EFF's Privacy Badger.  Result: Many (used to be
"most") websites work, but I never see ads and I don't get tracked by
the bezillion companies that are spying on me to sell my eyeballs for
their own benefit.  When the sites are arranged so I can't even click a
simple link without enabling Javascript, I generally skip further
interactions with them.

The amazing thing is that so many modern "web designers" don't even know
how HTML works and just assume that everybody has and needs Javascript.
And ditto for the tool builders who make the tools these designers
know instead of HTML.

John
___
rb-general@lists.reproducible-builds.org mailing list

To change your subscription options, visit 
https://lists.reproducible-builds.org/listinfo/rb-general.

To unsubscribe, send an email to 
rb-general-unsubscr...@lists.reproducible-builds.org.