Bug#960154: Feed UDD with just-in-time packaging hints from Lintian
Hi, On Wed, 14 Apr 2021, Lucas Nussbaum wrote: > I think that in Debian, we would aim for a better separation between: > > A/ QA tools development, focused on getting the good tools to analyze > packages (output: tools, as Debian packages) > > B/ infrastructure that mass-process the archive using QA tools. (output: > current status of each package in Debian, analyzed with the latest > version of a given tool, as a parsable file) > > C/ infrastructure that gathers the current status from all instances of > (B) and exposes it per-package, per-maintainer, per-team, etc. > > (C) could even be split into: > (C.1) infrastructure that gathers the status and stores it into a > common DB; > (C.2) infrastructure that uses (C.1) to provide useful > per-package/per-maintainer frontends (views). Fully agreed on this. tracker.debian.org is clearly in the scope of (C) but started to move into (B), but once I realized this I decided that it would be better to have a separate project, that's how I ended up designing "debusine". See https://salsa.debian.org/freexian-team/debusine/-/blob/master/docs/devel/why.rst As I announced a few days ago, I will invest Freexian's money in this project so you're welcome to watch the project (in gitlab speak, aka enable notifications) so that you can contribute to its design. The first milestone will be oriented towards package building, not lintian processing but I'm happy to include this in the roadmap at some point. Cheers, -- ⢀⣴⠾⠻⢶⣦⠀ Raphaël Hertzog ⣾⠁⢠⠒⠀⣿⡁ ⢿⡄⠘⠷⠚⠋The Debian Handbook: https://debian-handbook.info/get/ ⠈⠳⣄ Debian Long Term Support: https://deb.li/LTS
Bug#960154: Feed UDD with just-in-time packaging hints from Lintian
Hi, On Wed, Apr 14, 2021 at 1:49 AM Lucas Nussbaum wrote: > > C/ infrastructure that gathers the current status from all instances of > (B) and exposes it per-package, per-maintainer, per-team, etc. For some data, such as Lintian packaging hints, there may be a powerful combination of AMQP and PostgreSQL. UDD could even provide a RabbitMQ instance as the primary interface for dynamic data collection. Very soon, UDD will collect Lintian's packaging hints (formerly known as tags) in real time. Instead of grouping data as Lintian's run, our runners could already produce rows suitable for the 'lintian' table. (Alternatively, RabbitMQ could take apart the Lintian runs and re-broadcast the data hint by hint on an adjacent channel.) In a super simple design, UDD could collect those hints and true them up with its more stable data sources like packages in the archive. That design would weigh relevance over completeness. UDD data would always be current even though occasionally a packaging hint might be lost. No sweat—the missing hint will be captured next week. The point behind this email is a hope that a conceptual insight might emerge: UDD could become an event collector. The result would be an up-to-date Lintian table that also ties to UDD's static data—which I do not believe it does currently. Kind regards Felix Lechner
Bug#960154: Feed UDD with just-in-time packaging hints from Lintian
Hi Lucas, TL;DR please find one idea to solve your issue below > provide the current > status of the archive against the current version of lintian as > something parsable Just for lintian.d.n (which is about to be transferred to lintian.d.o), that is exactly what we provide. It just won't be one file like it used to be. We plan instead to produce packaging hints based on heuristics designed to provide the best service to *maintainers*. I am sorry about the inconvenience, but as a service facing the public—a distinction you likewise recognized in your previous message—the change makes sense for us. We hope to prioritize based on: - packages for which no or no recent runs are available - frequency of uploads (more uploads, better data) - team requirements (for their statistics) UDD can subscribe to the AQMP "results" queue and decide independently, i.e. based on other input, when "a run across the archive" is substantially complete. We previously used DAKweb for that purpose, but our services are now available in real time. But why wait? Why not just add a "lintian_version" column to your table [1] and update the table at regular intervals, when you have collected a sufficient number of runs? The Lintian version is in our JSON results. Next, cut from your table those sources no longer known to the archive. For an example of how to do that, please see here for a solution via DAKweb. [2] That is the script we use now to DROP, via ON CASCADE DELETE, website data that is obsolete due changes in the archive. HTH Kind regards Felix Lechner [1] https://udd.debian.org/schema/udd.html#public.table.lintian [2] https://salsa.debian.org/lintian/taxiv/-/blob/master/get-archive-state#L149-150
Bug#960154: Feed UDD with just-in-time packaging hints from Lintian
(Adding debian-qa@ to Cc to broaden the discussion a bit) Hi, On the issue of lintian.d.n/lintian.d.o/UDD/tracker.d.o, I wonder if the separation of concerns is the right one. I think that in Debian, we would aim for a better separation between: A/ QA tools development, focused on getting the good tools to analyze packages (output: tools, as Debian packages) B/ infrastructure that mass-process the archive using QA tools. (output: current status of each package in Debian, analyzed with the latest version of a given tool, as a parsable file) C/ infrastructure that gathers the current status from all instances of (B) and exposes it per-package, per-maintainer, per-team, etc. (C) could even be split into: (C.1) infrastructure that gathers the status and stores it into a common DB; (C.2) infrastructure that uses (C.1) to provide useful per-package/per-maintainer frontends (views). lintian.d.n is again an attempt at solving (B) and (C) at the same time. While I don't want to prevent anyone from working on their projects of choice, I wonder if someone else shouldn't work on a 'lintian archive runner' service whose sole mission would be to provide the current status of the archive against the current version of lintian as something parsable, just to feed UDD/tracker/others. Lucas
Bug#960154: Feed UDD with just-in-time packaging hints from Lintian
On 13/04/21 at 11:49 -0700, Felix Lechner wrote: > Hi, > > On Tue, Apr 13, 2021 at 11:27 AM Lucas Nussbaum wrote: > > > > From the UDD point of view, I would very much prefer to get a full dump > > something I can import every few hours, than having to deal with a > > stream of updates or with querying a per-package API. > > Since few users ever need *all* data, would it make sense to > re-conceive UDD as a "query broker" to help people get the data they > actually need? Well if you adopt it, feel free to reimplement it the way you want :) > > Currently the full import (that runs twice a day) takes about 10 minutes > > The power of COPY. The Lintian website currently takes 12 hours to > import a single run across the archive in 42 bulk UPSERTS via JSON > (but will eventually cease to generate data that way). No, it's just PREPARE/EXECUTE (inside a single big transaction -- I think) Lucas
Bug#960154: Feed UDD with just-in-time packaging hints from Lintian
Hi, On Tue, Apr 13, 2021 at 11:27 AM Lucas Nussbaum wrote: > > From the UDD point of view, I would very much prefer to get a full dump > something I can import every few hours, than having to deal with a > stream of updates or with querying a per-package API. Since few users ever need *all* data, would it make sense to re-conceive UDD as a "query broker" to help people get the data they actually need? > Currently the full import (that runs twice a day) takes about 10 minutes The power of COPY. The Lintian website currently takes 12 hours to import a single run across the archive in 42 bulk UPSERTS via JSON (but will eventually cease to generate data that way). Kind regards Felix Lechner
Bug#960154: Feed UDD with just-in-time packaging hints from Lintian
On 13/04/21 at 18:45 +0200, Mattia Rizzolo wrote: > [ Adding lucas@ to CC since he is the main person behind UDD after all ] > > On Sun, Apr 11, 2021 at 12:45:14PM -0700, Felix Lechner wrote: > > On Sat, May 9, 2020 at 5:33 PM Mattia Rizzolo wrote: > > > have lintian decide on a nice machine-parsable (text!) format > > > then udd will adapt its importer. > > > > As you know, both of these already happened several months ago. > > Indeed, I consider that done by now. > > > I have > > not commented here because I am still chewing on a related, but much > > harder problem: > > I'd have probably used a different bug, but guess we'll cope. > > > Lintian will soon cease to run blindly across the archive and instead > > produce packaging hints on demand, as uploads are received by the > > archive. There is no batch process anymore that will produce files for > > the entire archive the way you expect. Instead, Lintian's new website > > https://lintian.debian.*net* offers a JSON interface [1] to get up to > > date information similar to DAKweb. [2] > > So, if we really go down this route, I think we need to: > > * Have the importer able to run a full import of everything, which means > looping through all sources (which means running some ~30k HTTP GETs) > and storing them. > * Figure out a way for UDD to know it needs to check the status of a > package. This likely means a job that compares the set of known > (package, version, suite) (is the tuple right?) with what is available > in the lintian table: if something is missing query the lintian > website for new data. > * perhaps have the lintian website *push* new data to udd.d.o. I'm > conflicted if this should be just a trigger ("hey I've just processed > this, check it out yourself") or if it should carry the actual data as > well. I'm sure you'd like a HTTP post or such, but I can tell you > that we'd likely prefer something through SSH. > > > Since after all you did look at udd several times, I believe you should > already be able to implement the first 2? > > > > All this said, I still don't understand why you wouldn't be able to > provide a view of everything. Since you set up that API, couldn't you > have a endpoint with *all* packages and everything, like the current > dump? That sounds much more trivial than what you are proposing… >From the UDD point of view, I would very much prefer to get a full dump something I can import every few hours, than having to deal with a stream of updates or with querying a per-package API. Currently the full import (that runs twice a day) takes about 10 minutes (and I don't remember if it has been optimized, so there might be space for improvement). Lucas
Bug#960154: Feed UDD with just-in-time packaging hints from Lintian
On Tue, Apr 13, 2021 at 11:03:12AM -0700, Felix Lechner wrote: > On Tue, Apr 13, 2021 at 9:46 AM Mattia Rizzolo wrote: > > full import of everything > > I do not believe that is practicable. There are other ideas below. > > > * Figure out a way for UDD to know it needs to check the status of a > > package. > > Such a polling technique seems likewise like a so-so solution. These two points (and noting that the second also takes care of the first) are still needed, for whenever UDD misses a notification or similar, or for bootstrapping the tables (else we'd need a complete re-run of all lintian, which I understand that with the new setup is going to be somewhat rarer than it used to be as well, so…). > > * perhaps have the lintian website *push* new data to udd.d.o. > > I love this idea (from Jelmer), if you can make it work. We will > publish the files you consume now in real time. You can subscribe via > RabbitMQ and collect them, if that is helpful to you. Mh, as myself I never used RabbitMQ, but I suppose it's a one way. probably more "contemporary" than you providing SSH triggers or so. However I'd have no clues how to incorporate a long-running process in the current UDD setup, I'll have to leave that to Lucas. -- regards, Mattia Rizzolo GPG Key: 66AE 2B4A FCCF 3F52 DA18 4D18 4B04 3FCD B944 4540 .''`. More about me: https://mapreri.org : :' : Launchpad user: https://launchpad.net/~mapreri `. `'` Debian QA page: https://qa.debian.org/developer.php?login=mattia `- signature.asc Description: PGP signature
Bug#960154: Feed UDD with just-in-time packaging hints from Lintian
Hi, On Tue, Apr 13, 2021 at 9:46 AM Mattia Rizzolo wrote: > > I'd have probably used a different bug, but guess we'll cope. I thought you might get upset the other way around. This bug already blocks a UDD counterpart (#960156). The solutions I offered to date were stop gaps. > full import of everything I do not believe that is practicable. There are other ideas below. > * Figure out a way for UDD to know it needs to check the status of a > package. Such a polling technique seems likewise like a so-so solution. > * perhaps have the lintian website *push* new data to udd.d.o. I love this idea (from Jelmer), if you can make it work. We will publish the files you consume now in real time. You can subscribe via RabbitMQ and collect them, if that is helpful to you. > you did look at udd several times As a UDD user, I believe the data may be better off being curated in real time, if the effort can be justified. The tables don't always match up. > Couldn't you > have a endpoint with *all* packages and everything, like the current > dump? It is a speed issue. We are in the process of moving to DSA-operated equipment. Maybe they have faster disks. Kind regards Felix Lechner
Bug#960154: Feed UDD with just-in-time packaging hints from Lintian
[ Adding lucas@ to CC since he is the main person behind UDD after all ] On Sun, Apr 11, 2021 at 12:45:14PM -0700, Felix Lechner wrote: > On Sat, May 9, 2020 at 5:33 PM Mattia Rizzolo wrote: > > have lintian decide on a nice machine-parsable (text!) format > > then udd will adapt its importer. > > As you know, both of these already happened several months ago. Indeed, I consider that done by now. > I have > not commented here because I am still chewing on a related, but much > harder problem: I'd have probably used a different bug, but guess we'll cope. > Lintian will soon cease to run blindly across the archive and instead > produce packaging hints on demand, as uploads are received by the > archive. There is no batch process anymore that will produce files for > the entire archive the way you expect. Instead, Lintian's new website > https://lintian.debian.*net* offers a JSON interface [1] to get up to > date information similar to DAKweb. [2] So, if we really go down this route, I think we need to: * Have the importer able to run a full import of everything, which means looping through all sources (which means running some ~30k HTTP GETs) and storing them. * Figure out a way for UDD to know it needs to check the status of a package. This likely means a job that compares the set of known (package, version, suite) (is the tuple right?) with what is available in the lintian table: if something is missing query the lintian website for new data. * perhaps have the lintian website *push* new data to udd.d.o. I'm conflicted if this should be just a trigger ("hey I've just processed this, check it out yourself") or if it should carry the actual data as well. I'm sure you'd like a HTTP post or such, but I can tell you that we'd likely prefer something through SSH. Since after all you did look at udd several times, I believe you should already be able to implement the first 2? All this said, I still don't understand why you wouldn't be able to provide a view of everything. Since you set up that API, couldn't you have a endpoint with *all* packages and everything, like the current dump? That sounds much more trivial than what you are proposing… -- regards, Mattia Rizzolo GPG Key: 66AE 2B4A FCCF 3F52 DA18 4D18 4B04 3FCD B944 4540 .''`. More about me: https://mapreri.org : :' : Launchpad user: https://launchpad.net/~mapreri `. `'` Debian QA page: https://qa.debian.org/developer.php?login=mattia `- signature.asc Description: PGP signature