Bug#963887: UDD: 'duck' importer broken since 2020-05-25
Hi Lucas, On Thu, 2023-08-03 at 10:38 +0200, Lucas Nussbaum wrote: > I submitted #1042947 to discuss re-creating a UDD duck importer, > using the same model as the lintian importer. > > @Baptiste: could you take a look? There would be a few changes on the > duck side that would make it much easier. Sure, I'll have a look before next week. Best, -- Baptiste Beauplat signature.asc Description: This is a digitally signed message part
Bug#963887: UDD: 'duck' importer broken since 2020-05-25
Hi all, I submitted #1042947 to discuss re-creating a UDD duck importer, using the same model as the lintian importer. @Baptiste: could you take a look? There would be a few changes on the duck side that would make it much easier. Lucas
Bug#963887: UDD: 'duck' importer broken since 2020-05-25
Re: Baptiste BEAUPLAT > >> Maybe you could add that to vcswatch? > > The main differences between vcswatch and duck.d.n are: > > - history: duck used to keep 6 runs for each package, reporting only > after 3 failures. vcswatch only keeps the last run. vcswatch could be improved by not notifying the users for each error. At the moment the data model is very simple, but adding that would be possible I'd think. > - d/control: duck processed the Homepage as well as the > Vcs-{Git,SVN,Hg,Darcs} fields. vcswatch has a wider support for all Vcs-*. There's not much that vcswatch supports on top of that. CVS, but I don't think anyone is still actively using it, the remaining entries are just bitrot. > - d/upstream/metadata: duck processed any urls found here. > - worker: parallel worker for duck, single instance for vcswatch. vcswatch can start several workers in parallel, the current config starts up to 5 workers. > I'm not convinced that adding those features would result in an > improvement for vcswatch (Cc'ing Christoph to have his input on that). > > Creating a new sub-project, like vcswatch, to qa.debian.org would be > more sensible IMHO. The new duck could only take care of the http urls > and leave Vcs stuff to vcswatch. You could reuse the vcsimport machinery. It's not very pretty, but works. Christoph
Bug#963887: UDD: 'duck' importer broken since 2020-05-25
On 30/06/20 at 09:19 +0200, Baptiste BEAUPLAT wrote: > On 6/29/20 11:34 PM, Raphael Hertzog wrote: > > On Mon, 29 Jun 2020, Baptiste BEAUPLAT wrote: > >>> Indeed, creating a dedicated service for this does not seem a good idea. > >> > >> I would love to have this feature integrated directly with > >> distro-tracker. However, I'm wondering about the load that would case > >> for the service. > > > > Network request do not generate much "load", such processes spend the bulk > > of their time waiting on the network. > > True that. > > >> The duck worker has to process around 46 urls (only counting > >> Homepage) in less than 24h. > > > > How do you get to that figure? We don't have that many source package > > and even if you consider multiple URL for each source package due to > > changes over time (in multiple releases), that makes way too many URLs > > per source package. > > Err, sorry about that. That figure is the result of: > > $ curl -s > http://deb.debian.org/debian/dists/unstable/main/source/Sources.gz | > zgrep -v Homepage: | sort -u | wc -l > 458804 > > Which is obviously wrong. Here is the real number: > > $ curl -s > http://deb.debian.org/debian/dists/unstable/main/source/Sources.gz | > zgrep Homepage: | sort -u | wc -l > 26250 > > >> I'm not sure that can done properly using > >> the distro-tracker tasks (parallel workers are needed to work around > >> timeout). Obviously that can be optimized (different check delay for > >> different results) but that's still bulk network related tasks. > > > > Nothing forbids parallel workers and in any case, I welcome any > > improvement to the task mechanism to make that kind of parallelism easier > > to handle. > > > > There are other tasks that could benefit from this (and in general I want > > to merge more of such features in distro-tracker to make them available to > > derivatives too). > > Then, let's add this to distro-tracker :) > > I've created an issue on the project on salsa so we can discuss > technical details : > > https://salsa.debian.org/qa/distro-tracker/-/issues/51 > > As I've said before, I would like to finish up on a couple of other > projects (namely mentors.d.n and snapshot.d.o) and I will be available > right after that. Hi, I don't really want to push for it (doing it into distro-tracker and then importer the data into UDD is fine), but another alternative would be to include this directly into UDD, similarly to what is done for the 'upstream' importer that checks debian/watch using uscan. It would boils down to: 1) identify the URLs that need to be check: select distinct homepage from (select homepage from sources union select homepage from packages) t; Or maybe better: select distinct homepage from ( select homepage from sources where release in ('sid', 'experimental') union select homepage from packages where release in ('sid','experimental') ) t; 2) populate/update a table with: (url, last_check_timestamp, status, detailed_status) (obviously, with whatever policy is needed about retries/refreshes) 3) export the data (for example as a JSON file) so that it can be used by other services Lucas signature.asc Description: PGP signature
Bug#963887: UDD: 'duck' importer broken since 2020-05-25
On 6/30/20 9:29 AM, Mattia Rizzolo wrote: > Just a note before you head toward implementing that: the Homepage field > is similar to Section, in the way that it can also be specified in the > binary paragraphs, not just the source paragraphs. > You can see that as the Homepage field is present in the DEBIAN binary > control field of the .debs, and clearly that value might be different > than the one in Homepage of the .dsc. > > So please, look harder for Homepage, not just in the first paragraph of > d/control ;) A good list of places to look can be found in: https://salsa.debian.org/debian/duck/-/tree/master/lib/checks -- Baptiste BEAUPLAT - lyknode signature.asc Description: OpenPGP digital signature
Bug#963887: UDD: 'duck' importer broken since 2020-05-25
On Tue, Jun 30, 2020 at 09:19:31AM +0200, Baptiste BEAUPLAT wrote: > On 6/29/20 11:34 PM, Raphael Hertzog wrote: > >> The duck worker has to process around 46 urls (only counting > >> Homepage) in less than 24h. > > > > How do you get to that figure? We don't have that many source package > > and even if you consider multiple URL for each source package due to > > changes over time (in multiple releases), that makes way too many URLs > > per source package. > > Err, sorry about that. That figure is the result of: > > $ curl -s > http://deb.debian.org/debian/dists/unstable/main/source/Sources.gz | > zgrep -v Homepage: | sort -u | wc -l > 458804 > > Which is obviously wrong. Here is the real number: > > $ curl -s > http://deb.debian.org/debian/dists/unstable/main/source/Sources.gz | > zgrep Homepage: | sort -u | wc -l > 26250 Just a note before you head toward implementing that: the Homepage field is similar to Section, in the way that it can also be specified in the binary paragraphs, not just the source paragraphs. You can see that as the Homepage field is present in the DEBIAN binary control field of the .debs, and clearly that value might be different than the one in Homepage of the .dsc. So please, look harder for Homepage, not just in the first paragraph of d/control ;) -- regards, Mattia Rizzolo GPG Key: 66AE 2B4A FCCF 3F52 DA18 4D18 4B04 3FCD B944 4540 .''`. More about me: https://mapreri.org : :' : Launchpad user: https://launchpad.net/~mapreri `. `'` Debian QA page: https://qa.debian.org/developer.php?login=mattia `- signature.asc Description: PGP signature
Bug#963887: UDD: 'duck' importer broken since 2020-05-25
On 6/29/20 11:34 PM, Raphael Hertzog wrote: > On Mon, 29 Jun 2020, Baptiste BEAUPLAT wrote: >>> Indeed, creating a dedicated service for this does not seem a good idea. >> >> I would love to have this feature integrated directly with >> distro-tracker. However, I'm wondering about the load that would case >> for the service. > > Network request do not generate much "load", such processes spend the bulk > of their time waiting on the network. True that. >> The duck worker has to process around 46 urls (only counting >> Homepage) in less than 24h. > > How do you get to that figure? We don't have that many source package > and even if you consider multiple URL for each source package due to > changes over time (in multiple releases), that makes way too many URLs > per source package. Err, sorry about that. That figure is the result of: $ curl -s http://deb.debian.org/debian/dists/unstable/main/source/Sources.gz | zgrep -v Homepage: | sort -u | wc -l 458804 Which is obviously wrong. Here is the real number: $ curl -s http://deb.debian.org/debian/dists/unstable/main/source/Sources.gz | zgrep Homepage: | sort -u | wc -l 26250 >> I'm not sure that can done properly using >> the distro-tracker tasks (parallel workers are needed to work around >> timeout). Obviously that can be optimized (different check delay for >> different results) but that's still bulk network related tasks. > > Nothing forbids parallel workers and in any case, I welcome any > improvement to the task mechanism to make that kind of parallelism easier > to handle. > > There are other tasks that could benefit from this (and in general I want > to merge more of such features in distro-tracker to make them available to > derivatives too). Then, let's add this to distro-tracker :) I've created an issue on the project on salsa so we can discuss technical details : https://salsa.debian.org/qa/distro-tracker/-/issues/51 As I've said before, I would like to finish up on a couple of other projects (namely mentors.d.n and snapshot.d.o) and I will be available right after that. Best, -- Baptiste BEAUPLAT - lyknode signature.asc Description: OpenPGP digital signature
Bug#963887: UDD: 'duck' importer broken since 2020-05-25
On Mon, 29 Jun 2020, Baptiste BEAUPLAT wrote: > > Indeed, creating a dedicated service for this does not seem a good idea. > > I would love to have this feature integrated directly with > distro-tracker. However, I'm wondering about the load that would case > for the service. Network request do not generate much "load", such processes spend the bulk of their time waiting on the network. > The duck worker has to process around 46 urls (only counting > Homepage) in less than 24h. How do you get to that figure? We don't have that many source package and even if you consider multiple URL for each source package due to changes over time (in multiple releases), that makes way too many URLs per source package. > I'm not sure that can done properly using > the distro-tracker tasks (parallel workers are needed to work around > timeout). Obviously that can be optimized (different check delay for > different results) but that's still bulk network related tasks. Nothing forbids parallel workers and in any case, I welcome any improvement to the task mechanism to make that kind of parallelism easier to handle. There are other tasks that could benefit from this (and in general I want to merge more of such features in distro-tracker to make them available to derivatives too). Cheers, -- ⢀⣴⠾⠻⢶⣦⠀ Raphaël Hertzog ⣾⠁⢠⠒⠀⣿⡁ ⢿⡄⠘⠷⠚⠋The Debian Handbook: https://debian-handbook.info/get/ ⠈⠳⣄ Debian Long Term Support: https://deb.li/LTS signature.asc Description: PGP signature
Bug#963887: UDD: 'duck' importer broken since 2020-05-25
Hi Bastian, Raphael, On 6/29/20 3:55 PM, Raphael Hertzog wrote: > On Sun, 28 Jun 2020, Bastian Blank wrote: >>> Baptiste (CCed) volunteered to write it over again, but for now there is >>> no clear timeline as for when the new project will be started. >> >> Maybe you could add that to vcswatch? The main differences between vcswatch and duck.d.n are: - history: duck used to keep 6 runs for each package, reporting only after 3 failures. vcswatch only keeps the last run. - d/control: duck processed the Homepage as well as the Vcs-{Git,SVN,Hg,Darcs} fields. vcswatch has a wider support for all Vcs-*. - d/upstream/metadata: duck processed any urls found here. - worker: parallel worker for duck, single instance for vcswatch. (sorry if I got anything wrong here. Please correct me!) I'm not convinced that adding those features would result in an improvement for vcswatch (Cc'ing Christoph to have his input on that). Creating a new sub-project, like vcswatch, to qa.debian.org would be more sensible IMHO. The new duck could only take care of the http urls and leave Vcs stuff to vcswatch. > or distro-tracker? > > Indeed, creating a dedicated service for this does not seem a good idea. I would love to have this feature integrated directly with distro-tracker. However, I'm wondering about the load that would case for the service. The duck worker has to process around 46 urls (only counting Homepage) in less than 24h. I'm not sure that can done properly using the distro-tracker tasks (parallel workers are needed to work around timeout). Obviously that can be optimized (different check delay for different results) but that's still bulk network related tasks. Another thing is that duck.d.n was delegating the actual checks to the `duck` perl library. To work with distro-tracker I would need to drop that an implement something silimar in python. Not a huge task per se, but something to keep in mind. I'm not sure what is best here and I'm looking forward to your suggestions and remarks. -- Baptiste BEAUPLAT - lyknode signature.asc Description: OpenPGP digital signature
Bug#963887: UDD: 'duck' importer broken since 2020-05-25
Hi, On Sun, 28 Jun 2020, Bastian Blank wrote: > > Baptiste (CCed) volunteered to write it over again, but for now there is > > no clear timeline as for when the new project will be started. > > Maybe you could add that to vcswatch? or distro-tracker? Indeed, creating a dedicated service for this does not seem a good idea. https://qa.pages.debian.net/distro-tracker/contributing.html https://qa.pages.debian.net/distro-tracker/devel/design.html#tasks-framework Cheers, -- ⢀⣴⠾⠻⢶⣦⠀ Raphaël Hertzog ⣾⠁⢠⠒⠀⣿⡁ ⢿⡄⠘⠷⠚⠋The Debian Handbook: https://debian-handbook.info/get/ ⠈⠳⣄ Debian Long Term Support: https://deb.li/LTS
Bug#963887: UDD: 'duck' importer broken since 2020-05-25
On Sun, Jun 28, 2020 at 09:40:12PM +0200, Mattia Rizzolo wrote: > On Sun, Jun 28, 2020 at 05:32:05PM +0200, Lucas Nussbaum wrote: > > The importer uses http://duck.debian.net/ which doesn't resolve anymore. duck.d.n in the past pulled git repositories from salsa.d.o, not sure what exactly it did with them. However it stopped pulling at least three months ago. > * it turns out said code was not freely license and as such easily >usable by a new maintainer in a new deployment Nice, not really. > Baptiste (CCed) volunteered to write it over again, but for now there is > no clear timeline as for when the new project will be started. Maybe you could add that to vcswatch? Regards, Bastian -- Military secrets are the most fleeting of all. -- Spock, "The Enterprise Incident", stardate 5027.4
Bug#963887: UDD: 'duck' importer broken since 2020-05-25
On Sun, Jun 28, 2020 at 05:32:05PM +0200, Lucas Nussbaum wrote: > The importer uses http://duck.debian.net/ which doesn't resolve anymore. Some context: * the previous maintainer of duck.d.n retired, and as such the .d.n domain was removed * the previous maintainer was contacted to have at least access to the previously running code * it turns out said code was not freely license and as such easily usable by a new maintainer in a new deployment Baptiste (CCed) volunteered to write it over again, but for now there is no clear timeline as for when the new project will be started. -- regards, Mattia Rizzolo GPG Key: 66AE 2B4A FCCF 3F52 DA18 4D18 4B04 3FCD B944 4540 .''`. More about me: https://mapreri.org : :' : Launchpad user: https://launchpad.net/~mapreri `. `'` Debian QA page: https://qa.debian.org/developer.php?login=mattia `- signature.asc Description: PGP signature
Bug#963887: UDD: 'duck' importer broken since 2020-05-25
Package: qa.debian.org User: qa.debian@packages.debian.org Usertags: udd The importer uses http://duck.debian.net/ which doesn't resolve anymore.