Bug#963887: UDD: 'duck' importer broken since 2020-05-25

2020-07-03 Thread Christoph Berg
Re: Baptiste BEAUPLAT
> >> Maybe you could add that to vcswatch?
> 
> The main differences between vcswatch and duck.d.n are:
> 
> - history: duck used to keep 6 runs for each package, reporting only
> after 3 failures. vcswatch only keeps the last run.

vcswatch could be improved by not notifying the users for each error.
At the moment the data model is very simple, but adding that would be
possible I'd think.

> - d/control: duck processed the Homepage as well as the
> Vcs-{Git,SVN,Hg,Darcs} fields. vcswatch has a wider support for all Vcs-*.

There's not much that vcswatch supports on top of that. CVS, but I
don't think anyone is still actively using it, the remaining entries
are just bitrot.

> - d/upstream/metadata: duck processed any urls found here.
> - worker: parallel worker for duck, single instance for vcswatch.

vcswatch can start several workers in parallel, the current config
starts up to 5 workers.

> I'm not convinced that adding those features would result in an
> improvement for vcswatch (Cc'ing Christoph to have his input on that).
> 
> Creating a new sub-project, like vcswatch, to qa.debian.org would be
> more sensible IMHO. The new duck could only take care of the http urls
> and leave Vcs stuff to vcswatch.

You could reuse the vcsimport machinery. It's not very pretty, but
works.

Christoph



Bug#963887: UDD: 'duck' importer broken since 2020-05-25

2020-06-30 Thread Lucas Nussbaum
On 30/06/20 at 09:19 +0200, Baptiste BEAUPLAT wrote:
> On 6/29/20 11:34 PM, Raphael Hertzog wrote:
> > On Mon, 29 Jun 2020, Baptiste BEAUPLAT wrote:
> >>> Indeed, creating a dedicated service for this does not seem a good idea.
> >>
> >> I would love to have this feature integrated directly with
> >> distro-tracker. However, I'm wondering about the load that would case
> >> for the service.
> > 
> > Network request do not generate much "load", such processes spend the bulk
> > of their time waiting on the network.
> 
> True that.
> 
> >> The duck worker has to process around 46 urls (only counting
> >> Homepage) in less than 24h.
> > 
> > How do you get to that figure? We don't have that many source package
> > and even if you consider multiple URL for each source package due to
> > changes over time (in multiple releases), that makes way too many URLs
> > per source package.
> 
> Err, sorry about that. That figure is the result of:
> 
> $ curl -s
> http://deb.debian.org/debian/dists/unstable/main/source/Sources.gz |
> zgrep -v Homepage: | sort -u | wc -l
> 458804
> 
> Which is obviously wrong. Here is the real number:
> 
> $ curl -s
> http://deb.debian.org/debian/dists/unstable/main/source/Sources.gz |
> zgrep Homepage: | sort -u | wc -l
> 26250
> 
> >> I'm not sure that can done properly using
> >> the distro-tracker tasks (parallel workers are needed to work around
> >> timeout). Obviously that can be optimized (different check delay for
> >> different results) but that's still bulk network related tasks.
> > 
> > Nothing forbids parallel workers and in any case, I welcome any
> > improvement to the task mechanism to make that kind of parallelism easier
> > to handle.
> > 
> > There are other tasks that could benefit from this (and in general I want
> > to merge more of such features in distro-tracker to make them available to
> > derivatives too).
> 
> Then, let's add this to distro-tracker :)
> 
> I've created an issue on the project on salsa so we can discuss
> technical details :
> 
> https://salsa.debian.org/qa/distro-tracker/-/issues/51
> 
> As I've said before, I would like to finish up on a couple of other
> projects (namely mentors.d.n and snapshot.d.o) and I will be available
> right after that.

Hi,

I don't really want to push for it (doing it into distro-tracker and
then importer the data into UDD is fine), but another alternative would
be to include this directly into UDD, similarly to what is done for the
'upstream' importer that checks debian/watch using uscan.

It would boils down to:

1) identify the URLs that need to be check:

select distinct homepage
from (select homepage from sources union select homepage from packages) t;

Or maybe better:
select distinct homepage
from (
   select homepage from sources where release in ('sid', 'experimental')
   union select homepage from packages where release in ('sid','experimental')
) t;

2) populate/update a table with:
(url, last_check_timestamp, status, detailed_status)
(obviously, with whatever policy is needed about retries/refreshes)

3) export the data (for example as a JSON file) so that it can be used
by other services

Lucas


signature.asc
Description: PGP signature


Bug#963887: UDD: 'duck' importer broken since 2020-05-25

2020-06-30 Thread Baptiste BEAUPLAT
On 6/30/20 9:29 AM, Mattia Rizzolo wrote:
> Just a note before you head toward implementing that: the Homepage field
> is similar to Section, in the way that it can also be specified in the
> binary paragraphs, not just the source paragraphs.
> You can see that as the Homepage field is present in the DEBIAN binary
> control field of the .debs, and clearly that value might be different
> than the one in Homepage of the .dsc.
> 
> So please, look harder for Homepage, not just in the first paragraph of
> d/control ;)

A good list of places to look can be found in:

https://salsa.debian.org/debian/duck/-/tree/master/lib/checks

-- 
Baptiste BEAUPLAT - lyknode



signature.asc
Description: OpenPGP digital signature


Bug#963887: UDD: 'duck' importer broken since 2020-05-25

2020-06-30 Thread Mattia Rizzolo
On Tue, Jun 30, 2020 at 09:19:31AM +0200, Baptiste BEAUPLAT wrote:
> On 6/29/20 11:34 PM, Raphael Hertzog wrote:
> >> The duck worker has to process around 46 urls (only counting
> >> Homepage) in less than 24h.
> > 
> > How do you get to that figure? We don't have that many source package
> > and even if you consider multiple URL for each source package due to
> > changes over time (in multiple releases), that makes way too many URLs
> > per source package.
> 
> Err, sorry about that. That figure is the result of:
> 
> $ curl -s
> http://deb.debian.org/debian/dists/unstable/main/source/Sources.gz |
> zgrep -v Homepage: | sort -u | wc -l
> 458804
> 
> Which is obviously wrong. Here is the real number:
> 
> $ curl -s
> http://deb.debian.org/debian/dists/unstable/main/source/Sources.gz |
> zgrep Homepage: | sort -u | wc -l
> 26250

Just a note before you head toward implementing that: the Homepage field
is similar to Section, in the way that it can also be specified in the
binary paragraphs, not just the source paragraphs.
You can see that as the Homepage field is present in the DEBIAN binary
control field of the .debs, and clearly that value might be different
than the one in Homepage of the .dsc.

So please, look harder for Homepage, not just in the first paragraph of
d/control ;)

-- 
regards,
Mattia Rizzolo

GPG Key: 66AE 2B4A FCCF 3F52 DA18  4D18 4B04 3FCD B944 4540  .''`.
More about me:  https://mapreri.org : :'  :
Launchpad user: https://launchpad.net/~mapreri  `. `'`
Debian QA page: https://qa.debian.org/developer.php?login=mattia  `-


signature.asc
Description: PGP signature


Bug#963887: UDD: 'duck' importer broken since 2020-05-25

2020-06-30 Thread Baptiste BEAUPLAT
On 6/29/20 11:34 PM, Raphael Hertzog wrote:
> On Mon, 29 Jun 2020, Baptiste BEAUPLAT wrote:
>>> Indeed, creating a dedicated service for this does not seem a good idea.
>>
>> I would love to have this feature integrated directly with
>> distro-tracker. However, I'm wondering about the load that would case
>> for the service.
> 
> Network request do not generate much "load", such processes spend the bulk
> of their time waiting on the network.

True that.

>> The duck worker has to process around 46 urls (only counting
>> Homepage) in less than 24h.
> 
> How do you get to that figure? We don't have that many source package
> and even if you consider multiple URL for each source package due to
> changes over time (in multiple releases), that makes way too many URLs
> per source package.

Err, sorry about that. That figure is the result of:

$ curl -s
http://deb.debian.org/debian/dists/unstable/main/source/Sources.gz |
zgrep -v Homepage: | sort -u | wc -l
458804

Which is obviously wrong. Here is the real number:

$ curl -s
http://deb.debian.org/debian/dists/unstable/main/source/Sources.gz |
zgrep Homepage: | sort -u | wc -l
26250

>> I'm not sure that can done properly using
>> the distro-tracker tasks (parallel workers are needed to work around
>> timeout). Obviously that can be optimized (different check delay for
>> different results) but that's still bulk network related tasks.
> 
> Nothing forbids parallel workers and in any case, I welcome any
> improvement to the task mechanism to make that kind of parallelism easier
> to handle.
> 
> There are other tasks that could benefit from this (and in general I want
> to merge more of such features in distro-tracker to make them available to
> derivatives too).

Then, let's add this to distro-tracker :)

I've created an issue on the project on salsa so we can discuss
technical details :

https://salsa.debian.org/qa/distro-tracker/-/issues/51

As I've said before, I would like to finish up on a couple of other
projects (namely mentors.d.n and snapshot.d.o) and I will be available
right after that.

Best,
-- 
Baptiste BEAUPLAT - lyknode



signature.asc
Description: OpenPGP digital signature


Bug#963887: UDD: 'duck' importer broken since 2020-05-25

2020-06-29 Thread Raphael Hertzog
On Mon, 29 Jun 2020, Baptiste BEAUPLAT wrote:
> > Indeed, creating a dedicated service for this does not seem a good idea.
> 
> I would love to have this feature integrated directly with
> distro-tracker. However, I'm wondering about the load that would case
> for the service.

Network request do not generate much "load", such processes spend the bulk
of their time waiting on the network.

> The duck worker has to process around 46 urls (only counting
> Homepage) in less than 24h.

How do you get to that figure? We don't have that many source package
and even if you consider multiple URL for each source package due to
changes over time (in multiple releases), that makes way too many URLs
per source package.

> I'm not sure that can done properly using
> the distro-tracker tasks (parallel workers are needed to work around
> timeout). Obviously that can be optimized (different check delay for
> different results) but that's still bulk network related tasks.

Nothing forbids parallel workers and in any case, I welcome any
improvement to the task mechanism to make that kind of parallelism easier
to handle.

There are other tasks that could benefit from this (and in general I want
to merge more of such features in distro-tracker to make them available to
derivatives too).

Cheers,
-- 
  ⢀⣴⠾⠻⢶⣦⠀   Raphaël Hertzog 
  ⣾⠁⢠⠒⠀⣿⡁
  ⢿⡄⠘⠷⠚⠋The Debian Handbook: https://debian-handbook.info/get/
  ⠈⠳⣄   Debian Long Term Support: https://deb.li/LTS


signature.asc
Description: PGP signature


Bug#963887: UDD: 'duck' importer broken since 2020-05-25

2020-06-29 Thread Baptiste BEAUPLAT
Hi Bastian, Raphael,

On 6/29/20 3:55 PM, Raphael Hertzog wrote:
> On Sun, 28 Jun 2020, Bastian Blank wrote:
>>> Baptiste (CCed) volunteered to write it over again, but for now there is
>>> no clear timeline as for when the new project will be started.
>>
>> Maybe you could add that to vcswatch?

The main differences between vcswatch and duck.d.n are:

- history: duck used to keep 6 runs for each package, reporting only
after 3 failures. vcswatch only keeps the last run.
- d/control: duck processed the Homepage as well as the
Vcs-{Git,SVN,Hg,Darcs} fields. vcswatch has a wider support for all Vcs-*.
- d/upstream/metadata: duck processed any urls found here.
- worker: parallel worker for duck, single instance for vcswatch.

(sorry if I got anything wrong here. Please correct me!)

I'm not convinced that adding those features would result in an
improvement for vcswatch (Cc'ing Christoph to have his input on that).

Creating a new sub-project, like vcswatch, to qa.debian.org would be
more sensible IMHO. The new duck could only take care of the http urls
and leave Vcs stuff to vcswatch.

> or distro-tracker?
> 
> Indeed, creating a dedicated service for this does not seem a good idea.

I would love to have this feature integrated directly with
distro-tracker. However, I'm wondering about the load that would case
for the service.

The duck worker has to process around 46 urls (only counting
Homepage) in less than 24h. I'm not sure that can done properly using
the distro-tracker tasks (parallel workers are needed to work around
timeout). Obviously that can be optimized (different check delay for
different results) but that's still bulk network related tasks.

Another thing is that duck.d.n was delegating the actual checks to the
`duck` perl library. To work with distro-tracker I would need to drop
that an implement something silimar in python. Not a huge task per se,
but something to keep in mind.

I'm not sure what is best here and I'm looking forward to your
suggestions and remarks.
-- 
Baptiste BEAUPLAT - lyknode



signature.asc
Description: OpenPGP digital signature


Bug#963887: UDD: 'duck' importer broken since 2020-05-25

2020-06-29 Thread Raphael Hertzog
Hi,

On Sun, 28 Jun 2020, Bastian Blank wrote:
> > Baptiste (CCed) volunteered to write it over again, but for now there is
> > no clear timeline as for when the new project will be started.
> 
> Maybe you could add that to vcswatch?

or distro-tracker?

Indeed, creating a dedicated service for this does not seem a good idea.

https://qa.pages.debian.net/distro-tracker/contributing.html
https://qa.pages.debian.net/distro-tracker/devel/design.html#tasks-framework

Cheers,
-- 
  ⢀⣴⠾⠻⢶⣦⠀   Raphaël Hertzog 
  ⣾⠁⢠⠒⠀⣿⡁
  ⢿⡄⠘⠷⠚⠋The Debian Handbook: https://debian-handbook.info/get/
  ⠈⠳⣄   Debian Long Term Support: https://deb.li/LTS



Bug#963887: UDD: 'duck' importer broken since 2020-05-25

2020-06-28 Thread Bastian Blank
On Sun, Jun 28, 2020 at 09:40:12PM +0200, Mattia Rizzolo wrote:
> On Sun, Jun 28, 2020 at 05:32:05PM +0200, Lucas Nussbaum wrote:
> > The importer uses http://duck.debian.net/ which doesn't resolve anymore.

duck.d.n in the past pulled git repositories from salsa.d.o, not sure
what exactly it did with them.  However it stopped pulling at least
three months ago.

>  * it turns out said code was not freely license and as such easily
>usable by a new maintainer in a new deployment

Nice, not really.

> Baptiste (CCed) volunteered to write it over again, but for now there is
> no clear timeline as for when the new project will be started.

Maybe you could add that to vcswatch?

Regards,
Bastian

-- 
Military secrets are the most fleeting of all.
-- Spock, "The Enterprise Incident", stardate 5027.4



Bug#963887: UDD: 'duck' importer broken since 2020-05-25

2020-06-28 Thread Mattia Rizzolo
On Sun, Jun 28, 2020 at 05:32:05PM +0200, Lucas Nussbaum wrote:
> The importer uses http://duck.debian.net/ which doesn't resolve anymore.

Some context:
 * the previous maintainer of duck.d.n retired, and as such the .d.n
   domain was removed
 * the previous maintainer was contacted to have at least access to the
   previously running code
 * it turns out said code was not freely license and as such easily
   usable by a new maintainer in a new deployment

Baptiste (CCed) volunteered to write it over again, but for now there is
no clear timeline as for when the new project will be started.

-- 
regards,
Mattia Rizzolo

GPG Key: 66AE 2B4A FCCF 3F52 DA18  4D18 4B04 3FCD B944 4540  .''`.
More about me:  https://mapreri.org : :'  :
Launchpad user: https://launchpad.net/~mapreri  `. `'`
Debian QA page: https://qa.debian.org/developer.php?login=mattia  `-


signature.asc
Description: PGP signature


Bug#963887: UDD: 'duck' importer broken since 2020-05-25

2020-06-28 Thread Lucas Nussbaum
Package: qa.debian.org
User: qa.debian@packages.debian.org
Usertags: udd

The importer uses http://duck.debian.net/ which doesn't resolve anymore.