Re: Planned MirrorManager changes

2018-04-16 Thread Adrian Reber
On Sat, Apr 14, 2018 at 04:28:37PM -0700, Kevin Fenzi wrote:
> > I would like to change the setup of our mirror crawler and just wanted
> > to mention my planned changes here before working on them.
> > 
> > Currently we have two VMs which are crawling our mirrors. Each of the
> > machine is responsible for one half of the active mirrors. The crawl
> > starts every 12 hours on the first crawler and 6 hours later on the
> > second crawler. So every 6 hours one crawler is accessing the database.
> > 
> > Currently most of the crawling time is not spent crawling but updating
> > the database about which host has which directory up to date. With a
> > timeout of 4 hours per host we are hitting that timeout on some hosts
> > regularly and most of the time the database access is the problem.
> > 
> > What I would like to change is to crawl each category (Fedora Linux,
> > Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive)
> > separately and at different times and intervals.
> > 
> > We would not hit the timeout as often as now as only the information for
> > a single category has to be updated. We could scan 'Fedora Archive' only
> > once per day or every second day. We can scan 'Fedora EPEL' much more
> > often as it is usually really fast and get better data about the
> > available mirrors.
> > 
> > My goal would be to distribute the scanning in such a way to decrease
> > the load on the database and to decrease the cases of mirror
> > auto-deactivation due to slow database accesses. 
> > 
> > Let me know if you think that these planned changes are the wrong
> > direction of if you have other ideas how to improve the mirror crawling.
> 
> Sounds like all great ideas to me. ;)

Thanks.

> I wonder if we could also find some way to note which mirrors have
> iso/image files, and could communicate this to the
> download.fedoraproject.org redirect to only redirect people to mirrors
> that have that specific file if they are pointing to an iso/qcow2, etc.

This is one of the cases where MirrorManager, in theory, should almost
handle it correctly. The important part of this sentence is 'in theory'.
MirrorManager should know about the 3 most recent files in a directory
and if we are crawling via rsync we even download the complete listing
for a mirror. So besides the theory it would help to see a wrong
redirect live to understand why it is happening.

Adrian


signature.asc
Description: PGP signature
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org


Re: Planned MirrorManager changes

2018-04-16 Thread Adrian Reber
On Sat, Apr 14, 2018 at 12:37:24AM +, Stephen John Smoogen wrote:
> On Fri, Apr 13, 2018 at 11:14 AM Adrian Reber  wrote:
> 
> > I would like to change the setup of our mirror crawler and just wanted
> > to mention my planned changes here before working on them.
> >
> > Currently we have two VMs which are crawling our mirrors. Each of the
> > machine is responsible for one half of the active mirrors. The crawl
> > starts every 12 hours on the first crawler and 6 hours later on the
> > second crawler. So every 6 hours one crawler is accessing the database.
> >
> > Currently most of the crawling time is not spent crawling but updating
> > the database about which host has which directory up to date. With a
> > timeout of 4 hours per host we are hitting that timeout on some hosts
> > regularly and most of the time the database access is the problem.
> >
> > What I would like to change is to crawl each category (Fedora Linux,
> > Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive)
> > separately and at different times and intervals.
> >
> > We would not hit the timeout as often as now as only the information for
> > a single category has to be updated. We could scan 'Fedora Archive' only
> > once per day or every second day. We can scan 'Fedora EPEL' much more
> > often as it is usually really fast and get better data about the
> > available mirrors.
> >
> > My goal would be to distribute the scanning in such a way to decrease
> > the load on the database and to decrease the cases of mirror
> > auto-deactivation due to slow database accesses.
> >
> > Let me know if you think that these planned changes are the wrong
> > direction of if you have other ideas how to improve the mirror crawling.
> 
> These look like a good way to deal with the fact that we have a lot of data
> and files and mirrors nd users get confused about how up to date they are.
> Would more VM’s help spread this out also?

From my point of view the main problem is the load MirrorManager creates
on the database. Currently I do not think that more VMs would help the
crawling. Someone once mentioned a dedicated database VM for
MirrorManager. That is something which could make a difference, but
first I would like to see if crawling per category can improve the
situation.

Adrian


signature.asc
Description: PGP signature
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org


Re: Planned MirrorManager changes

2018-04-14 Thread Kevin Fenzi
On 04/13/2018 08:14 AM, Adrian Reber wrote:
> 
> I would like to change the setup of our mirror crawler and just wanted
> to mention my planned changes here before working on them.
> 
> Currently we have two VMs which are crawling our mirrors. Each of the
> machine is responsible for one half of the active mirrors. The crawl
> starts every 12 hours on the first crawler and 6 hours later on the
> second crawler. So every 6 hours one crawler is accessing the database.
> 
> Currently most of the crawling time is not spent crawling but updating
> the database about which host has which directory up to date. With a
> timeout of 4 hours per host we are hitting that timeout on some hosts
> regularly and most of the time the database access is the problem.
> 
> What I would like to change is to crawl each category (Fedora Linux,
> Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive)
> separately and at different times and intervals.
> 
> We would not hit the timeout as often as now as only the information for
> a single category has to be updated. We could scan 'Fedora Archive' only
> once per day or every second day. We can scan 'Fedora EPEL' much more
> often as it is usually really fast and get better data about the
> available mirrors.
> 
> My goal would be to distribute the scanning in such a way to decrease
> the load on the database and to decrease the cases of mirror
> auto-deactivation due to slow database accesses. 
> 
> Let me know if you think that these planned changes are the wrong
> direction of if you have other ideas how to improve the mirror crawling.

Sounds like all great ideas to me. ;)

I wonder if we could also find some way to note which mirrors have
iso/image files, and could communicate this to the
download.fedoraproject.org redirect to only redirect people to mirrors
that have that specific file if they are pointing to an iso/qcow2, etc.

Anyhow, the crawler changes sound good to me and thanks again for
working on it.

kevin




signature.asc
Description: OpenPGP digital signature
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org


Re: Planned MirrorManager changes

2018-04-13 Thread Stephen John Smoogen
On Fri, Apr 13, 2018 at 11:14 AM Adrian Reber  wrote:

>
> I would like to change the setup of our mirror crawler and just wanted
> to mention my planned changes here before working on them.
>
> Currently we have two VMs which are crawling our mirrors. Each of the
> machine is responsible for one half of the active mirrors. The crawl
> starts every 12 hours on the first crawler and 6 hours later on the
> second crawler. So every 6 hours one crawler is accessing the database.
>
> Currently most of the crawling time is not spent crawling but updating
> the database about which host has which directory up to date. With a
> timeout of 4 hours per host we are hitting that timeout on some hosts
> regularly and most of the time the database access is the problem.
>
> What I would like to change is to crawl each category (Fedora Linux,
> Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive)
> separately and at different times and intervals.
>
> We would not hit the timeout as often as now as only the information for
> a single category has to be updated. We could scan 'Fedora Archive' only
> once per day or every second day. We can scan 'Fedora EPEL' much more
> often as it is usually really fast and get better data about the
> available mirrors.
>
> My goal would be to distribute the scanning in such a way to decrease
> the load on the database and to decrease the cases of mirror
> auto-deactivation due to slow database accesses.
>
> Let me know if you think that these planned changes are the wrong
> direction of if you have other ideas how to improve the mirror crawling.
>

These look like a good way to deal with the fact that we have a lot of data
and files and mirrors nd users get confused about how up to date they are.
Would more VM’s help spread this out also?




> Adrian
> ___
> infrastructure mailing list -- infrastructure@lists.fedoraproject.org
> To unsubscribe send an email to
> infrastructure-le...@lists.fedoraproject.org
>
-- 
Stephen J Smoogen.
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org


Planned MirrorManager changes

2018-04-13 Thread Adrian Reber

I would like to change the setup of our mirror crawler and just wanted
to mention my planned changes here before working on them.

Currently we have two VMs which are crawling our mirrors. Each of the
machine is responsible for one half of the active mirrors. The crawl
starts every 12 hours on the first crawler and 6 hours later on the
second crawler. So every 6 hours one crawler is accessing the database.

Currently most of the crawling time is not spent crawling but updating
the database about which host has which directory up to date. With a
timeout of 4 hours per host we are hitting that timeout on some hosts
regularly and most of the time the database access is the problem.

What I would like to change is to crawl each category (Fedora Linux,
Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive)
separately and at different times and intervals.

We would not hit the timeout as often as now as only the information for
a single category has to be updated. We could scan 'Fedora Archive' only
once per day or every second day. We can scan 'Fedora EPEL' much more
often as it is usually really fast and get better data about the
available mirrors.

My goal would be to distribute the scanning in such a way to decrease
the load on the database and to decrease the cases of mirror
auto-deactivation due to slow database accesses. 

Let me know if you think that these planned changes are the wrong
direction of if you have other ideas how to improve the mirror crawling.

Adrian


signature.asc
Description: PGP signature
___
infrastructure mailing list -- infrastructure@lists.fedoraproject.org
To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org