Re: Planned MirrorManager changes

2018-04-16 Thread Adrian Reber
On Sat, Apr 14, 2018 at 04:28:37PM -0700, Kevin Fenzi wrote:
> > I would like to change the setup of our mirror crawler and just wanted
> > to mention my planned changes here before working on them.
> > 
> > Currently we have two VMs which are crawling our mirrors. Each of the
> > machine is responsible for one half of the active mirrors. The crawl
> > starts every 12 hours on the first crawler and 6 hours later on the
> > second crawler. So every 6 hours one crawler is accessing the database.
> > 
> > Currently most of the crawling time is not spent crawling but updating
> > the database about which host has which directory up to date. With a
> > timeout of 4 hours per host we are hitting that timeout on some hosts
> > regularly and most of the time the database access is the problem.
> > 
> > What I would like to change is to crawl each category (Fedora Linux,
> > Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive)
> > separately and at different times and intervals.
> > 
> > We would not hit the timeout as often as now as only the information for
> > a single category has to be updated. We could scan 'Fedora Archive' only
> > once per day or every second day. We can scan 'Fedora EPEL' much more
> > often as it is usually really fast and get better data about the
> > available mirrors.
> > 
> > My goal would be to distribute the scanning in such a way to decrease
> > the load on the database and to decrease the cases of mirror
> > auto-deactivation due to slow database accesses. 
> > 
> > Let me know if you think that these planned changes are the wrong
> > direction of if you have other ideas how to improve the mirror crawling.
> 
> Sounds like all great ideas to me. ;)

Thanks.

> I wonder if we could also find some way to note which mirrors have
> iso/image files, and could communicate this to the
> download.fedoraproject.org redirect to only redirect people to mirrors
> that have that specific file if they are pointing to an iso/qcow2, etc.

This is one of the cases where MirrorManager, in theory, should almost
handle it correctly. The important part of this sentence is 'in theory'.
MirrorManager should know about the 3 most recent files in a directory
and if we are crawling via rsync we even download the complete listing
for a mirror. So besides the theory it would help to see a wrong
redirect live to understand why it is happening.

Adrian


signature.asc
Description: PGP signature
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org


Re: Planned MirrorManager changes

2018-04-16 Thread Adrian Reber
On Sat, Apr 14, 2018 at 12:37:24AM +, Stephen John Smoogen wrote:
> On Fri, Apr 13, 2018 at 11:14 AM Adrian Reber  wrote:
> 
> > I would like to change the setup of our mirror crawler and just wanted
> > to mention my planned changes here before working on them.
> >
> > Currently we have two VMs which are crawling our mirrors. Each of the
> > machine is responsible for one half of the active mirrors. The crawl
> > starts every 12 hours on the first crawler and 6 hours later on the
> > second crawler. So every 6 hours one crawler is accessing the database.
> >
> > Currently most of the crawling time is not spent crawling but updating
> > the database about which host has which directory up to date. With a
> > timeout of 4 hours per host we are hitting that timeout on some hosts
> > regularly and most of the time the database access is the problem.
> >
> > What I would like to change is to crawl each category (Fedora Linux,
> > Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive)
> > separately and at different times and intervals.
> >
> > We would not hit the timeout as often as now as only the information for
> > a single category has to be updated. We could scan 'Fedora Archive' only
> > once per day or every second day. We can scan 'Fedora EPEL' much more
> > often as it is usually really fast and get better data about the
> > available mirrors.
> >
> > My goal would be to distribute the scanning in such a way to decrease
> > the load on the database and to decrease the cases of mirror
> > auto-deactivation due to slow database accesses.
> >
> > Let me know if you think that these planned changes are the wrong
> > direction of if you have other ideas how to improve the mirror crawling.
> 
> These look like a good way to deal with the fact that we have a lot of data
> and files and mirrors nd users get confused about how up to date they are.
> Would more VM’s help spread this out also?

From my point of view the main problem is the load MirrorManager creates
on the database. Currently I do not think that more VMs would help the
crawling. Someone once mentioned a dedicated database VM for
MirrorManager. That is something which could make a difference, but
first I would like to see if crawling per category can improve the
situation.

Adrian


signature.asc
Description: PGP signature
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org


Re: Planned MirrorManager changes

2018-04-14 Thread Kevin Fenzi
On 04/13/2018 08:14 AM, Adrian Reber wrote:
> 
> I would like to change the setup of our mirror crawler and just wanted
> to mention my planned changes here before working on them.
> 
> Currently we have two VMs which are crawling our mirrors. Each of the
> machine is responsible for one half of the active mirrors. The crawl
> starts every 12 hours on the first crawler and 6 hours later on the
> second crawler. So every 6 hours one crawler is accessing the database.
> 
> Currently most of the crawling time is not spent crawling but updating
> the database about which host has which directory up to date. With a
> timeout of 4 hours per host we are hitting that timeout on some hosts
> regularly and most of the time the database access is the problem.
> 
> What I would like to change is to crawl each category (Fedora Linux,
> Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive)
> separately and at different times and intervals.
> 
> We would not hit the timeout as often as now as only the information for
> a single category has to be updated. We could scan 'Fedora Archive' only
> once per day or every second day. We can scan 'Fedora EPEL' much more
> often as it is usually really fast and get better data about the
> available mirrors.
> 
> My goal would be to distribute the scanning in such a way to decrease
> the load on the database and to decrease the cases of mirror
> auto-deactivation due to slow database accesses. 
> 
> Let me know if you think that these planned changes are the wrong
> direction of if you have other ideas how to improve the mirror crawling.

Sounds like all great ideas to me. ;)

I wonder if we could also find some way to note which mirrors have
iso/image files, and could communicate this to the
download.fedoraproject.org redirect to only redirect people to mirrors
that have that specific file if they are pointing to an iso/qcow2, etc.

Anyhow, the crawler changes sound good to me and thanks again for
working on it.

kevin




signature.asc
Description: OpenPGP digital signature
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org


Re: Planned MirrorManager changes

2018-04-13 Thread Stephen John Smoogen
On Fri, Apr 13, 2018 at 11:14 AM Adrian Reber  wrote:

>
> I would like to change the setup of our mirror crawler and just wanted
> to mention my planned changes here before working on them.
>
> Currently we have two VMs which are crawling our mirrors. Each of the
> machine is responsible for one half of the active mirrors. The crawl
> starts every 12 hours on the first crawler and 6 hours later on the
> second crawler. So every 6 hours one crawler is accessing the database.
>
> Currently most of the crawling time is not spent crawling but updating
> the database about which host has which directory up to date. With a
> timeout of 4 hours per host we are hitting that timeout on some hosts
> regularly and most of the time the database access is the problem.
>
> What I would like to change is to crawl each category (Fedora Linux,
> Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive)
> separately and at different times and intervals.
>
> We would not hit the timeout as often as now as only the information for
> a single category has to be updated. We could scan 'Fedora Archive' only
> once per day or every second day. We can scan 'Fedora EPEL' much more
> often as it is usually really fast and get better data about the
> available mirrors.
>
> My goal would be to distribute the scanning in such a way to decrease
> the load on the database and to decrease the cases of mirror
> auto-deactivation due to slow database accesses.
>
> Let me know if you think that these planned changes are the wrong
> direction of if you have other ideas how to improve the mirror crawling.
>

These look like a good way to deal with the fact that we have a lot of data
and files and mirrors nd users get confused about how up to date they are.
Would more VM’s help spread this out also?




> Adrian
> ___
> infrastructure mailing list -- infrastructure@lists.fedoraproject.org
> To unsubscribe send an email to
> infrastructure-le...@lists.fedoraproject.org
>
-- 
Stephen J Smoogen.
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org