Re: Planned MirrorManager changes
On Sat, Apr 14, 2018 at 04:28:37PM -0700, Kevin Fenzi wrote: > > I would like to change the setup of our mirror crawler and just wanted > > to mention my planned changes here before working on them. > > > > Currently we have two VMs which are crawling our mirrors. Each of the > > machine is responsible for one half of the active mirrors. The crawl > > starts every 12 hours on the first crawler and 6 hours later on the > > second crawler. So every 6 hours one crawler is accessing the database. > > > > Currently most of the crawling time is not spent crawling but updating > > the database about which host has which directory up to date. With a > > timeout of 4 hours per host we are hitting that timeout on some hosts > > regularly and most of the time the database access is the problem. > > > > What I would like to change is to crawl each category (Fedora Linux, > > Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive) > > separately and at different times and intervals. > > > > We would not hit the timeout as often as now as only the information for > > a single category has to be updated. We could scan 'Fedora Archive' only > > once per day or every second day. We can scan 'Fedora EPEL' much more > > often as it is usually really fast and get better data about the > > available mirrors. > > > > My goal would be to distribute the scanning in such a way to decrease > > the load on the database and to decrease the cases of mirror > > auto-deactivation due to slow database accesses. > > > > Let me know if you think that these planned changes are the wrong > > direction of if you have other ideas how to improve the mirror crawling. > > Sounds like all great ideas to me. ;) Thanks. > I wonder if we could also find some way to note which mirrors have > iso/image files, and could communicate this to the > download.fedoraproject.org redirect to only redirect people to mirrors > that have that specific file if they are pointing to an iso/qcow2, etc. This is one of the cases where MirrorManager, in theory, should almost handle it correctly. The important part of this sentence is 'in theory'. MirrorManager should know about the 3 most recent files in a directory and if we are crawling via rsync we even download the complete listing for a mirror. So besides the theory it would help to see a wrong redirect live to understand why it is happening. Adrian signature.asc Description: PGP signature ___ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
Re: Planned MirrorManager changes
On Sat, Apr 14, 2018 at 12:37:24AM +, Stephen John Smoogen wrote: > On Fri, Apr 13, 2018 at 11:14 AM Adrian Reberwrote: > > > I would like to change the setup of our mirror crawler and just wanted > > to mention my planned changes here before working on them. > > > > Currently we have two VMs which are crawling our mirrors. Each of the > > machine is responsible for one half of the active mirrors. The crawl > > starts every 12 hours on the first crawler and 6 hours later on the > > second crawler. So every 6 hours one crawler is accessing the database. > > > > Currently most of the crawling time is not spent crawling but updating > > the database about which host has which directory up to date. With a > > timeout of 4 hours per host we are hitting that timeout on some hosts > > regularly and most of the time the database access is the problem. > > > > What I would like to change is to crawl each category (Fedora Linux, > > Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive) > > separately and at different times and intervals. > > > > We would not hit the timeout as often as now as only the information for > > a single category has to be updated. We could scan 'Fedora Archive' only > > once per day or every second day. We can scan 'Fedora EPEL' much more > > often as it is usually really fast and get better data about the > > available mirrors. > > > > My goal would be to distribute the scanning in such a way to decrease > > the load on the database and to decrease the cases of mirror > > auto-deactivation due to slow database accesses. > > > > Let me know if you think that these planned changes are the wrong > > direction of if you have other ideas how to improve the mirror crawling. > > These look like a good way to deal with the fact that we have a lot of data > and files and mirrors nd users get confused about how up to date they are. > Would more VM’s help spread this out also? From my point of view the main problem is the load MirrorManager creates on the database. Currently I do not think that more VMs would help the crawling. Someone once mentioned a dedicated database VM for MirrorManager. That is something which could make a difference, but first I would like to see if crawling per category can improve the situation. Adrian signature.asc Description: PGP signature ___ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
Re: Planned MirrorManager changes
On 04/13/2018 08:14 AM, Adrian Reber wrote: > > I would like to change the setup of our mirror crawler and just wanted > to mention my planned changes here before working on them. > > Currently we have two VMs which are crawling our mirrors. Each of the > machine is responsible for one half of the active mirrors. The crawl > starts every 12 hours on the first crawler and 6 hours later on the > second crawler. So every 6 hours one crawler is accessing the database. > > Currently most of the crawling time is not spent crawling but updating > the database about which host has which directory up to date. With a > timeout of 4 hours per host we are hitting that timeout on some hosts > regularly and most of the time the database access is the problem. > > What I would like to change is to crawl each category (Fedora Linux, > Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive) > separately and at different times and intervals. > > We would not hit the timeout as often as now as only the information for > a single category has to be updated. We could scan 'Fedora Archive' only > once per day or every second day. We can scan 'Fedora EPEL' much more > often as it is usually really fast and get better data about the > available mirrors. > > My goal would be to distribute the scanning in such a way to decrease > the load on the database and to decrease the cases of mirror > auto-deactivation due to slow database accesses. > > Let me know if you think that these planned changes are the wrong > direction of if you have other ideas how to improve the mirror crawling. Sounds like all great ideas to me. ;) I wonder if we could also find some way to note which mirrors have iso/image files, and could communicate this to the download.fedoraproject.org redirect to only redirect people to mirrors that have that specific file if they are pointing to an iso/qcow2, etc. Anyhow, the crawler changes sound good to me and thanks again for working on it. kevin signature.asc Description: OpenPGP digital signature ___ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
Re: Planned MirrorManager changes
On Fri, Apr 13, 2018 at 11:14 AM Adrian Reberwrote: > > I would like to change the setup of our mirror crawler and just wanted > to mention my planned changes here before working on them. > > Currently we have two VMs which are crawling our mirrors. Each of the > machine is responsible for one half of the active mirrors. The crawl > starts every 12 hours on the first crawler and 6 hours later on the > second crawler. So every 6 hours one crawler is accessing the database. > > Currently most of the crawling time is not spent crawling but updating > the database about which host has which directory up to date. With a > timeout of 4 hours per host we are hitting that timeout on some hosts > regularly and most of the time the database access is the problem. > > What I would like to change is to crawl each category (Fedora Linux, > Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive) > separately and at different times and intervals. > > We would not hit the timeout as often as now as only the information for > a single category has to be updated. We could scan 'Fedora Archive' only > once per day or every second day. We can scan 'Fedora EPEL' much more > often as it is usually really fast and get better data about the > available mirrors. > > My goal would be to distribute the scanning in such a way to decrease > the load on the database and to decrease the cases of mirror > auto-deactivation due to slow database accesses. > > Let me know if you think that these planned changes are the wrong > direction of if you have other ideas how to improve the mirror crawling. > These look like a good way to deal with the fact that we have a lot of data and files and mirrors nd users get confused about how up to date they are. Would more VM’s help spread this out also? > Adrian > ___ > infrastructure mailing list -- infrastructure@lists.fedoraproject.org > To unsubscribe send an email to > infrastructure-le...@lists.fedoraproject.org > -- Stephen J Smoogen. ___ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org