Bug#865769: Second data package including some machine-readable data

2017-06-25 Thread Guillem Jover
On Sun, 2017-06-25 at 16:13:39 -0700, Russ Allbery wrote:
> Guillem Jover  writes:
> > On Sat, 2017-06-24 at 09:57:33 -0700, Russ Allbery wrote:
> >> - The list of archive sections and their descriptions
> 
> > I think this belongs on each archive providing those, alongside the
> > other archive metadata. And I'd rather see the involved parties
> > defining an appropriate file to provide so that any downloader which
> > has to fetch the matadata anyway would use instead of hardcoding it.
> 
> > Using a file from policy does not seem useful to me, because it would
> > mean software would need to depend on such policy provided package,
> > and if you are going to mix and match repos, you really need the
> > metadata from the archive you are pulling from.
> 
> > In addition the text in policy states that the canonical list is
> > maintained by the archive anyway. :)
> 
> I don't see how this would work.  The program would dynamically retrieve
> the list of sections every time it ran?  This seems like a bad idea, and
> even impossible in a lot of situations (off-line development work, for
> instance).

When I researched this at the time, there were two clear groups of
users for this information [U] (now summarized in [W]).

The first were package manager frontends and similar, which need to
fetch archive meta-data anyway, and they do not need to do that all
the time, as they tend to cache that. For this group using an out-of-band
file provided by a non-canonical package seems suboptimal, when the
information can be there alongside the rest of the metadata to
download. For example dselect is a prominent omission from that list,
one for which I'd rather not introduce the hardcoding and wait for
proper meta-data from the archives themselves, or make it Debian-specific
by having it depend on a Debian Policy specific file. :)

  [U] 
  [W] 

For off-line tools such as linters, syntax highlighers, and similar it
certainly seems like a problem to require fetching the data from the
archive. Although, in some cases relying on an external package that
might update the data outside of the control of the tool might be
undesirable, and it might be better to do like lintian is doing, and
refresh it as part of the release process.

Then I supposed there's a third group comprised of services. But those
I guess kind of fall somehow under the package manager frontends case,
as they need to fetch metadata information from the archive anyway(?).

> We maintain a list of archive sections in Policy anyway, so it's easy for
> us to provide this list in a machine-readable format as well.  (Well, we
> don't have the descriptions, but that's not hard to add and doesn't really
> add much additional maintenance work.)
> 
> I think it's fine that a debian-policy-data package only provide
> information for the Debian archive.  The same is also true of the virtual
> package names, of course; some other archive may have different virtual
> packages too.  Programs that want to work with various different package
> archives will need to know how to obtain this data from multiple sources.
> The intent is to provide a tiny package that others can easily depend on
> without much overhead.

Oh, I didn't mean to imply that Debian Policy should provide data or
support for other non-Debian archives.

My point is that perhaps it is not the best way to provide some of
this data in the first place, because:

  - it's not the canonical origin of the data,
  - having to fork the policy package just to amend the sections seems
burdensome, when the latter change way less often than the former,
  - might make code having to support this data Debian-specific.

If we need an off-line replica of the data, it might perhaps make more
sense for the archive admins (ftp-masters in this case) to provide it,
in a similar way as we have a debian-archive-keyring. Of course they'd
need to agree to that first. :)

Barring that, having a single place to include this kind of information
in a uniform way, similar to what distro-info does, might be the second
best options.

But even then, if the least bad solution is to have debian-policy
provide the data, what I was trying to have at least taken into
account is that it would be nice to try to specify a somewhat neutral
hierarchical structure in the filesystem, and ideally a common file
format, so that ideally programs can just check for the vendor and
do the equivalent of something like:

   load /usr/share/distro-metadata/-archive-metadata.

instead of say, having debian-policy hardcoded therein or similar, so
you could just key on the vendor and be somewhat neutral.

> >> - The list of valid Debian control field names (by type of control file)
> 
> > This one, I'm uncertain, but I'd tend to think it is partly in a similar
> > situation to the previous one.
> 
> > For example dpkg contains 

Bug#865769: Second data package including some machine-readable data

2017-06-25 Thread Russ Allbery
Guillem Jover  writes:
> On Sat, 2017-06-24 at 09:57:33 -0700, Russ Allbery wrote:

>> - The list of archive sections and their descriptions

> I think this belongs on each archive providing those, alongside the
> other archive metadata. And I'd rather see the involved parties
> defining an appropriate file to provide so that any downloader which
> has to fetch the matadata anyway would use instead of hardcoding it.

> Using a file from policy does not seem useful to me, because it would
> mean software would need to depend on such policy provided package,
> and if you are going to mix and match repos, you really need the
> metadata from the archive you are pulling from.

> In addition the text in policy states that the canonical list is
> maintained by the archive anyway. :)

I don't see how this would work.  The program would dynamically retrieve
the list of sections every time it ran?  This seems like a bad idea, and
even impossible in a lot of situations (off-line development work, for
instance).

We maintain a list of archive sections in Policy anyway, so it's easy for
us to provide this list in a machine-readable format as well.  (Well, we
don't have the descriptions, but that's not hard to add and doesn't really
add much additional maintenance work.)

I think it's fine that a debian-policy-data package only provide
information for the Debian archive.  The same is also true of the virtual
package names, of course; some other archive may have different virtual
packages too.  Programs that want to work with various different package
archives will need to know how to obtain this data from multiple sources.
The intent is to provide a tiny package that others can easily depend on
without much overhead.

>> - The list of valid Debian control field names (by type of control file)

> This one, I'm uncertain, but I'd tend to think it is partly in a similar
> situation to the previous one.

> For example dpkg contains already such a list (provably more
> exhaustive) in Dpkg::Control::Fields, and I don't see making dpkg
> depend on an external list, because dpkg is being used beyond Debian.

This was just an idle thought of mine, and maybe it doesn't solve any real
problems.

> For the equivalent in policy I think I see where you are coming from,
> and I think it would be nice to have most of policy in a declarative
> format that could be used by linters, or some parsers, but if that means
> it's going to make those somewhat Debian-specific it might not take
> off.

I'm in general fine with the things provided by Debian Policy being
Debian-specific.  That, in my opinion, is the point of the package.  If
some other distribution wants something equivalent, they can certainly
fork Debian Policy or write their own separate document that supplements
Debian Policy, and maintain corresponding data files.

> The list of common licenses perhaps. Other things that come to mind
> could be perhaps a file with common regexes to validate things that
> policy specifies, say package names, version strings etc. Precisely
> because those can and do diverge from what dpkg accepts for example.

Yes, those would also be interesting.

> Valid pathnames, etc, and as I've mentioned above ideally all of policy
> would be available in a declarative format, but that'd be a pretty huge
> undertaking. But then it might make sense to do a quick poll and ask
> whether people would use any of this, because otherwise it seems perhaps
> a bit like a waste.

Indeed.  The virtual package name list has a specific use case already,
and people suggesting using sed scripts to parse files from the
debian-policy package to generate it right now, so maybe we should just
start there and see if uses of the other data actually materialize.

Lintian is a large possible use case, but Lintian already has mechanisms
for gathering and maintaining this data internally, and Lintian may not
want to depend on a debian-policy-data package for various reasons (it
makes lintian.debian.org a bit harder).

> I don't think I have a direct use for any of the above anyway, but I
> also think I'd prefer YAML, because it is more human readable. But not a
> strong objection in any case.

I have a professional aversion to YAML because the security properties of
YAML are so awful.

I wish everyone would just use TOML, but unfortunately it's not at a 1.0
version yet and is not as widely supported by default as JSON is.

-- 
Russ Allbery (r...@debian.org)   



Bug#865769: Second data package including some machine-readable data

2017-06-25 Thread Guillem Jover
Hi!

On Sat, 2017-06-24 at 09:57:33 -0700, Russ Allbery wrote:
> Package: debian-policy
> Version: 4.0.0.2
> Severity: wishlist

> A discussion in #865720 got me thinking that there is some data maintained
> in Policy that would be useful to have in a machine-readable format.  The
> things that have occurred to me so far are:
> 
> - The list of registered virtual packages

This one definitely makes sense, because policy is the canonical place
where this is defined.

> - The list of archive sections and their descriptions

I think this belongs on each archive providing those, alongside the
other archive metadata. And I'd rather see the involved parties
defining an appropriate file to provide so that any downloader which
has to fetch the matadata anyway would use instead of hardcoding it.

Using a file from policy does not seem useful to me, because it would
mean software would need to depend on such policy provided package,
and if you are going to mix and match repos, you really need the
metadata from the archive you are pulling from.

In addition the text in policy states that the canonical list is
maintained by the archive anyway. :)

> - The list of valid Debian control field names (by type of control file)

This one, I'm uncertain, but I'd tend to think it is partly in a similar
situation to the previous one.

For example dpkg contains already such a list (provably more
exhaustive) in Dpkg::Control::Fields, and I don't see making dpkg
depend on an external list, because dpkg is being used beyond Debian.

The "list" in dpkg has currently some problems though:

 - in a perl module; not that easily accessible to other languages.
 - tracks on which control file the fields are available, but cannot
   currently distinguish the differing semantics (field separator) for
   fields with the same name, f.ex. Files in .changes and .dsc.
 - lacks information whether a field is folded, simple, multiline, etc.

My plan is to remedy at least the last two points with a new perl
module hiearchy. I'm not sure if the first is worth "fixing" in dpkg
though?

For the equivalent in policy I think I see where you are coming from,
and I think it would be nice to have most of policy in a declarative
format that could be used by linters, or some parsers, but if that
means it's going to make those somewhat Debian-specific it might not
take off. I guess to avoid that the path and names to get to that
information would need to be somewhat neutral and allow for other
derivatives with their own policies. :)

> These are things that either we already maintain or that have no other
> obvious place to live.  This data could then be consumed by packages like
> lintian (although that's a bit tricky for lintian.debian.org),
> libconfig-model-dpkg-perl, etc.

The list of common licenses perhaps. Other things that come to mind
could be perhaps a file with common regexes to validate things that
policy specifies, say package names, version strings etc. Precisely
because those can and do diverge from what dpkg accepts for example.

Valid pathnames, etc, and as I've mentioned above ideally all of
policy would be available in a declarative format, but that'd be a
pretty huge undertaking. But then it might make sense to do a quick
poll and ask whether people would use any of this, because otherwise
it seems perhaps a bit like a waste.

> The idea would be to provide these in some machine-readable form (probably
> JSON unless someone has objections) in files under /usr/share/debian-policy
> or some similar path (so that software can consume them) in a separate
> binary package built from the debian-policy package (debian-policy-data,
> perhaps) so that other packages can depend on that package without pulling
> in the larger human-focused Policy documentation.

I don't think I have a direct use for any of the above anyway, but I
also think I'd prefer YAML, because it is more human readable. But not
a strong objection in any case.

Thanks,
Guillem



Bug#865769: Second data package including some machine-readable data

2017-06-24 Thread Russ Allbery
Package: debian-policy
Version: 4.0.0.2
Severity: wishlist

A discussion in #865720 got me thinking that there is some data maintained
in Policy that would be useful to have in a machine-readable format.  The
things that have occurred to me so far are:

- The list of registered virtual packages
- The list of archive sections and their descriptions
- The list of valid Debian control field names (by type of control file)

These are things that either we already maintain or that have no other
obvious place to live.  This data could then be consumed by packages like
lintian (although that's a bit tricky for lintian.debian.org),
libconfig-model-dpkg-perl, etc.

The idea would be to provide these in some machine-readable form (probably
JSON unless someone has objections) in files under /usr/share/debian-policy
or some similar path (so that software can consume them) in a separate
binary package built from the debian-policy package (debian-policy-data,
perhaps) so that other packages can depend on that package without pulling
in the larger human-focused Policy documentation.

If anyone has ideas for other things that should be included, or any
concerns, please speak up!

-- System Information:
Debian Release: 9.0
  APT prefers unstable
  APT policy: (990, 'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 4.9.0-3-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)

debian-policy depends on no packages.

debian-policy recommends no packages.

Versions of packages debian-policy suggests:
pn  doc-base  

-- no debconf information