Bug#960154: Feed UDD with just-in-time packaging hints from Lintian

2021-04-20 Thread Raphael Hertzog
Hi,

On Wed, 14 Apr 2021, Lucas Nussbaum wrote:
> I think that in Debian, we would aim for a better separation between:
> 
> A/ QA tools development, focused on getting the good tools to analyze
> packages (output: tools, as Debian packages)
> 
> B/ infrastructure that mass-process the archive using QA tools. (output:
> current status of each package in Debian, analyzed with the latest
> version of a given tool, as a parsable file)
> 
> C/ infrastructure that gathers the current status from all instances of
> (B) and exposes it per-package, per-maintainer, per-team, etc.
> 
> (C) could even be split into:
>   (C.1) infrastructure that gathers the status and stores it into a
>   common DB;
>   (C.2) infrastructure that uses (C.1) to provide useful
>   per-package/per-maintainer frontends (views).

Fully agreed on this. tracker.debian.org is clearly in the scope
of (C) but started to move into (B), but once I realized this I decided
that it would be better to have a separate project, that's how I ended
up designing "debusine".

See 
https://salsa.debian.org/freexian-team/debusine/-/blob/master/docs/devel/why.rst

As I announced a few days ago, I will invest Freexian's money
in this project so you're welcome to watch the project (in gitlab speak,
aka enable notifications) so that you can contribute to its design.

The first milestone will be oriented towards package building,
not lintian processing but I'm happy to include this in the roadmap
at some point.

Cheers,
-- 
  ⢀⣴⠾⠻⢶⣦⠀   Raphaël Hertzog 
  ⣾⠁⢠⠒⠀⣿⡁
  ⢿⡄⠘⠷⠚⠋The Debian Handbook: https://debian-handbook.info/get/
  ⠈⠳⣄   Debian Long Term Support: https://deb.li/LTS



Bug#960154: Feed UDD with just-in-time packaging hints from Lintian

2021-04-19 Thread Felix Lechner
Hi,

On Wed, Apr 14, 2021 at 1:49 AM Lucas Nussbaum  wrote:
>
> C/ infrastructure that gathers the current status from all instances of
> (B) and exposes it per-package, per-maintainer, per-team, etc.

For some data, such as Lintian packaging hints, there may be a
powerful combination of AMQP and PostgreSQL. UDD could even provide a
RabbitMQ instance as the primary interface for dynamic data
collection.

Very soon, UDD will collect Lintian's packaging hints (formerly known
as tags) in real time. Instead of grouping data as Lintian's run, our
runners could already produce rows suitable for the 'lintian' table.
(Alternatively, RabbitMQ could take apart the Lintian runs and
re-broadcast the data hint by hint on an adjacent channel.) In a super
simple design, UDD could collect those hints and true them up with its
more stable data sources like packages in the archive.

That design would weigh relevance over completeness. UDD data would
always be current even though occasionally a packaging hint might be
lost. No sweat—the missing hint will be captured next week.

The point behind this email is a hope that a conceptual insight might
emerge: UDD could become an event collector. The result would be an
up-to-date Lintian table that also ties to UDD's static data—which I
do not believe it does currently.

Kind regards
Felix Lechner



Bug#960154: Feed UDD with just-in-time packaging hints from Lintian

2021-04-14 Thread Felix Lechner
Hi Lucas,

TL;DR please find one idea to solve your issue below

> provide the current
> status of the archive against the current version of lintian as
> something parsable

Just for lintian.d.n (which is about to be transferred to
lintian.d.o), that is exactly what we provide. It just won't be one
file like it used to be. We plan instead to produce packaging hints
based on heuristics designed to provide the best service to
*maintainers*. I am sorry about the inconvenience, but as a service
facing the public—a distinction you likewise recognized in your
previous message—the change makes sense for us. We hope to prioritize
based on:

- packages for which no or no recent runs are available
- frequency of uploads (more uploads, better data)
- team requirements (for their statistics)

UDD can subscribe to the AQMP "results" queue and decide
independently, i.e. based on other input, when "a run across the
archive" is substantially complete. We previously used DAKweb for that
purpose, but our services are now available in real time.

But why wait? Why not just add a "lintian_version" column to your
table [1] and update the table at regular intervals, when you have
collected a sufficient number of runs? The Lintian version is in our
JSON results. Next, cut from your table those sources no longer known
to the archive.

For an example of how to do that, please see here for a solution via
DAKweb. [2] That is the script we use now to DROP, via ON CASCADE
DELETE, website data that is obsolete due changes in the archive.

HTH

Kind regards
Felix Lechner

[1] https://udd.debian.org/schema/udd.html#public.table.lintian
[2] 
https://salsa.debian.org/lintian/taxiv/-/blob/master/get-archive-state#L149-150



Bug#960154: Feed UDD with just-in-time packaging hints from Lintian

2021-04-14 Thread Lucas Nussbaum
(Adding debian-qa@ to Cc to broaden the discussion a bit)

Hi,

On the issue of lintian.d.n/lintian.d.o/UDD/tracker.d.o, I wonder if the
separation of concerns is the right one.

I think that in Debian, we would aim for a better separation between:

A/ QA tools development, focused on getting the good tools to analyze
packages (output: tools, as Debian packages)

B/ infrastructure that mass-process the archive using QA tools. (output:
current status of each package in Debian, analyzed with the latest
version of a given tool, as a parsable file)

C/ infrastructure that gathers the current status from all instances of
(B) and exposes it per-package, per-maintainer, per-team, etc.

(C) could even be split into:
  (C.1) infrastructure that gathers the status and stores it into a
  common DB;
  (C.2) infrastructure that uses (C.1) to provide useful
  per-package/per-maintainer frontends (views).

lintian.d.n is again an attempt at solving (B) and (C) at the same time.
While I don't want to prevent anyone from working on their projects of
choice, I wonder if someone else shouldn't work on a 'lintian archive
runner' service whose sole mission would be to provide the current
status of the archive against the current version of lintian as
something parsable, just to feed UDD/tracker/others.

Lucas



Bug#960154: Feed UDD with just-in-time packaging hints from Lintian

2021-04-13 Thread Lucas Nussbaum
On 13/04/21 at 11:49 -0700, Felix Lechner wrote:
> Hi,
> 
> On Tue, Apr 13, 2021 at 11:27 AM Lucas Nussbaum  wrote:
> >
> > From the UDD point of view, I would very much prefer to get a full dump
> > something I can import every few hours, than having to deal with a
> > stream of updates or with querying a per-package API.
> 
> Since few users ever need *all* data, would it make sense to
> re-conceive UDD as a "query broker" to help people get the data they
> actually need?

Well if you adopt it, feel free to reimplement it the way you want :)

> > Currently the full import (that runs twice a day) takes about 10 minutes
> 
> The power of COPY. The Lintian website currently takes 12 hours to
> import a single run across the archive in 42 bulk UPSERTS via JSON
> (but will eventually cease to generate data that way).

No, it's just PREPARE/EXECUTE (inside a single big transaction -- I
think)

Lucas



Bug#960154: Feed UDD with just-in-time packaging hints from Lintian

2021-04-13 Thread Felix Lechner
Hi,

On Tue, Apr 13, 2021 at 11:27 AM Lucas Nussbaum  wrote:
>
> From the UDD point of view, I would very much prefer to get a full dump
> something I can import every few hours, than having to deal with a
> stream of updates or with querying a per-package API.

Since few users ever need *all* data, would it make sense to
re-conceive UDD as a "query broker" to help people get the data they
actually need?

> Currently the full import (that runs twice a day) takes about 10 minutes

The power of COPY. The Lintian website currently takes 12 hours to
import a single run across the archive in 42 bulk UPSERTS via JSON
(but will eventually cease to generate data that way).

Kind regards
Felix Lechner



Bug#960154: Feed UDD with just-in-time packaging hints from Lintian

2021-04-13 Thread Lucas Nussbaum
On 13/04/21 at 18:45 +0200, Mattia Rizzolo wrote:
> [ Adding lucas@ to CC since he is the main person behind UDD after all ]
> 
> On Sun, Apr 11, 2021 at 12:45:14PM -0700, Felix Lechner wrote:
> > On Sat, May 9, 2020 at 5:33 PM Mattia Rizzolo  wrote:
> > > have lintian decide on a nice machine-parsable (text!) format
> > > then udd will adapt its importer.
> > 
> > As you know, both of these already happened several months ago.
> 
> Indeed, I consider that done by now.
> 
> > I have
> > not commented here because I am still chewing on a related, but much
> > harder problem:
> 
> I'd have probably used a different bug, but guess we'll cope.
> 
> > Lintian will soon cease to run blindly across the archive and instead
> > produce packaging hints on demand, as uploads are received by the
> > archive. There is no batch process anymore that will produce files for
> > the entire archive the way you expect. Instead, Lintian's new website
> > https://lintian.debian.*net* offers a JSON interface [1] to get up to
> > date information similar to DAKweb. [2]
> 
> So, if we really go down this route, I think we need to:
> 
> * Have the importer able to run a full import of everything, which means
>   looping through all sources (which means running some ~30k HTTP GETs)
>   and storing them.
> * Figure out a way for UDD to know it needs to check the status of a
>   package.  This likely means a job that compares the set of known
>   (package, version, suite) (is the tuple right?) with what is available
>   in the lintian table: if something is missing query the lintian
>   website for new data.
> * perhaps have the lintian website *push* new data to udd.d.o.  I'm
>   conflicted if this should be just a trigger ("hey I've just processed
>   this, check it out yourself") or if it should carry the actual data as
>   well.  I'm sure you'd like a HTTP post or such, but I can tell you
>   that we'd likely prefer something through SSH.
> 
> 
> Since after all you did look at udd several times, I believe you should
> already be able to implement the first 2?
> 
> 
> 
> All this said, I still don't understand why you wouldn't be able to
> provide a view of everything.  Since you set up that API, couldn't you
> have a endpoint with *all* packages and everything, like the current
> dump?  That sounds much more trivial than what you are proposing…

>From the UDD point of view, I would very much prefer to get a full dump
something I can import every few hours, than having to deal with a
stream of updates or with querying a per-package API.

Currently the full import (that runs twice a day) takes about 10 minutes
(and I don't remember if it has been optimized, so there might be space
for improvement).

Lucas



Bug#960154: Feed UDD with just-in-time packaging hints from Lintian

2021-04-13 Thread Mattia Rizzolo
On Tue, Apr 13, 2021 at 11:03:12AM -0700, Felix Lechner wrote:
> On Tue, Apr 13, 2021 at 9:46 AM Mattia Rizzolo  wrote:
> > full import of everything
> 
> I do not believe that is practicable. There are other ideas below.
> 
> > * Figure out a way for UDD to know it needs to check the status of a
> >   package.
> 
> Such a polling technique seems likewise like a so-so solution.

These two points (and noting that the second also takes care of the
first) are still needed, for whenever UDD misses a notification or
similar, or for bootstrapping the tables (else we'd need a complete
re-run of all lintian, which I understand that with the new setup is
going to be somewhat rarer than it used to be as well, so…).

> > * perhaps have the lintian website *push* new data to udd.d.o.
> 
> I love this idea (from Jelmer), if you can make it work. We will
> publish the files you consume now in real time. You can subscribe via
> RabbitMQ and collect them, if that is helpful to you.

Mh, as myself I never used RabbitMQ, but I suppose it's a one way.
probably more "contemporary" than you providing SSH triggers or so.
However I'd have no clues how to incorporate a long-running process in
the current UDD setup, I'll have to leave that to Lucas.

-- 
regards,
Mattia Rizzolo

GPG Key: 66AE 2B4A FCCF 3F52 DA18  4D18 4B04 3FCD B944 4540  .''`.
More about me:  https://mapreri.org : :'  :
Launchpad user: https://launchpad.net/~mapreri  `. `'`
Debian QA page: https://qa.debian.org/developer.php?login=mattia  `-


signature.asc
Description: PGP signature


Bug#960154: Feed UDD with just-in-time packaging hints from Lintian

2021-04-13 Thread Felix Lechner
Hi,

On Tue, Apr 13, 2021 at 9:46 AM Mattia Rizzolo  wrote:
>
> I'd have probably used a different bug, but guess we'll cope.

I thought you might get upset the other way around. This bug already
blocks a UDD counterpart (#960156). The solutions I offered to date
were stop gaps.

> full import of everything

I do not believe that is practicable. There are other ideas below.

> * Figure out a way for UDD to know it needs to check the status of a
>   package.

Such a polling technique seems likewise like a so-so solution.

> * perhaps have the lintian website *push* new data to udd.d.o.

I love this idea (from Jelmer), if you can make it work. We will
publish the files you consume now in real time. You can subscribe via
RabbitMQ and collect them, if that is helpful to you.

> you did look at udd several times

As a UDD user, I believe the data may be better off being curated in
real time, if the effort can be justified. The tables don't always
match up.

> Couldn't you
> have a endpoint with *all* packages and everything, like the current
> dump?

It is a speed issue. We are in the process of moving to DSA-operated
equipment. Maybe they have faster disks.

Kind regards
Felix Lechner



Bug#960154: Feed UDD with just-in-time packaging hints from Lintian

2021-04-13 Thread Mattia Rizzolo
[ Adding lucas@ to CC since he is the main person behind UDD after all ]

On Sun, Apr 11, 2021 at 12:45:14PM -0700, Felix Lechner wrote:
> On Sat, May 9, 2020 at 5:33 PM Mattia Rizzolo  wrote:
> > have lintian decide on a nice machine-parsable (text!) format
> > then udd will adapt its importer.
> 
> As you know, both of these already happened several months ago.

Indeed, I consider that done by now.

> I have
> not commented here because I am still chewing on a related, but much
> harder problem:

I'd have probably used a different bug, but guess we'll cope.

> Lintian will soon cease to run blindly across the archive and instead
> produce packaging hints on demand, as uploads are received by the
> archive. There is no batch process anymore that will produce files for
> the entire archive the way you expect. Instead, Lintian's new website
> https://lintian.debian.*net* offers a JSON interface [1] to get up to
> date information similar to DAKweb. [2]

So, if we really go down this route, I think we need to:

* Have the importer able to run a full import of everything, which means
  looping through all sources (which means running some ~30k HTTP GETs)
  and storing them.
* Figure out a way for UDD to know it needs to check the status of a
  package.  This likely means a job that compares the set of known
  (package, version, suite) (is the tuple right?) with what is available
  in the lintian table: if something is missing query the lintian
  website for new data.
* perhaps have the lintian website *push* new data to udd.d.o.  I'm
  conflicted if this should be just a trigger ("hey I've just processed
  this, check it out yourself") or if it should carry the actual data as
  well.  I'm sure you'd like a HTTP post or such, but I can tell you
  that we'd likely prefer something through SSH.


Since after all you did look at udd several times, I believe you should
already be able to implement the first 2?



All this said, I still don't understand why you wouldn't be able to
provide a view of everything.  Since you set up that API, couldn't you
have a endpoint with *all* packages and everything, like the current
dump?  That sounds much more trivial than what you are proposing…

-- 
regards,
Mattia Rizzolo

GPG Key: 66AE 2B4A FCCF 3F52 DA18  4D18 4B04 3FCD B944 4540  .''`.
More about me:  https://mapreri.org : :'  :
Launchpad user: https://launchpad.net/~mapreri  `. `'`
Debian QA page: https://qa.debian.org/developer.php?login=mattia  `-


signature.asc
Description: PGP signature