Re: Planned MirrorManager changes
On Sat, Apr 14, 2018 at 04:28:37PM -0700, Kevin Fenzi wrote: > > I would like to change the setup of our mirror crawler and just wanted > > to mention my planned changes here before working on them. > > > > Currently we have two VMs which are crawling our mirrors. Each of the > > machine is responsible for one half of the active mirrors. The crawl > > starts every 12 hours on the first crawler and 6 hours later on the > > second crawler. So every 6 hours one crawler is accessing the database. > > > > Currently most of the crawling time is not spent crawling but updating > > the database about which host has which directory up to date. With a > > timeout of 4 hours per host we are hitting that timeout on some hosts > > regularly and most of the time the database access is the problem. > > > > What I would like to change is to crawl each category (Fedora Linux, > > Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive) > > separately and at different times and intervals. > > > > We would not hit the timeout as often as now as only the information for > > a single category has to be updated. We could scan 'Fedora Archive' only > > once per day or every second day. We can scan 'Fedora EPEL' much more > > often as it is usually really fast and get better data about the > > available mirrors. > > > > My goal would be to distribute the scanning in such a way to decrease > > the load on the database and to decrease the cases of mirror > > auto-deactivation due to slow database accesses. > > > > Let me know if you think that these planned changes are the wrong > > direction of if you have other ideas how to improve the mirror crawling. > > Sounds like all great ideas to me. ;) Thanks. > I wonder if we could also find some way to note which mirrors have > iso/image files, and could communicate this to the > download.fedoraproject.org redirect to only redirect people to mirrors > that have that specific file if they are pointing to an iso/qcow2, etc. This is one of the cases where MirrorManager, in theory, should almost handle it correctly. The important part of this sentence is 'in theory'. MirrorManager should know about the 3 most recent files in a directory and if we are crawling via rsync we even download the complete listing for a mirror. So besides the theory it would help to see a wrong redirect live to understand why it is happening. Adrian signature.asc Description: PGP signature ___ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
Re: Planned MirrorManager changes
On Sat, Apr 14, 2018 at 12:37:24AM +, Stephen John Smoogen wrote: > On Fri, Apr 13, 2018 at 11:14 AM Adrian Reberwrote: > > > I would like to change the setup of our mirror crawler and just wanted > > to mention my planned changes here before working on them. > > > > Currently we have two VMs which are crawling our mirrors. Each of the > > machine is responsible for one half of the active mirrors. The crawl > > starts every 12 hours on the first crawler and 6 hours later on the > > second crawler. So every 6 hours one crawler is accessing the database. > > > > Currently most of the crawling time is not spent crawling but updating > > the database about which host has which directory up to date. With a > > timeout of 4 hours per host we are hitting that timeout on some hosts > > regularly and most of the time the database access is the problem. > > > > What I would like to change is to crawl each category (Fedora Linux, > > Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive) > > separately and at different times and intervals. > > > > We would not hit the timeout as often as now as only the information for > > a single category has to be updated. We could scan 'Fedora Archive' only > > once per day or every second day. We can scan 'Fedora EPEL' much more > > often as it is usually really fast and get better data about the > > available mirrors. > > > > My goal would be to distribute the scanning in such a way to decrease > > the load on the database and to decrease the cases of mirror > > auto-deactivation due to slow database accesses. > > > > Let me know if you think that these planned changes are the wrong > > direction of if you have other ideas how to improve the mirror crawling. > > These look like a good way to deal with the fact that we have a lot of data > and files and mirrors nd users get confused about how up to date they are. > Would more VM’s help spread this out also? From my point of view the main problem is the load MirrorManager creates on the database. Currently I do not think that more VMs would help the crawling. Someone once mentioned a dedicated database VM for MirrorManager. That is something which could make a difference, but first I would like to see if crawling per category can improve the situation. Adrian signature.asc Description: PGP signature ___ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
Re: Planned MirrorManager changes
On 04/13/2018 08:14 AM, Adrian Reber wrote: > > I would like to change the setup of our mirror crawler and just wanted > to mention my planned changes here before working on them. > > Currently we have two VMs which are crawling our mirrors. Each of the > machine is responsible for one half of the active mirrors. The crawl > starts every 12 hours on the first crawler and 6 hours later on the > second crawler. So every 6 hours one crawler is accessing the database. > > Currently most of the crawling time is not spent crawling but updating > the database about which host has which directory up to date. With a > timeout of 4 hours per host we are hitting that timeout on some hosts > regularly and most of the time the database access is the problem. > > What I would like to change is to crawl each category (Fedora Linux, > Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive) > separately and at different times and intervals. > > We would not hit the timeout as often as now as only the information for > a single category has to be updated. We could scan 'Fedora Archive' only > once per day or every second day. We can scan 'Fedora EPEL' much more > often as it is usually really fast and get better data about the > available mirrors. > > My goal would be to distribute the scanning in such a way to decrease > the load on the database and to decrease the cases of mirror > auto-deactivation due to slow database accesses. > > Let me know if you think that these planned changes are the wrong > direction of if you have other ideas how to improve the mirror crawling. Sounds like all great ideas to me. ;) I wonder if we could also find some way to note which mirrors have iso/image files, and could communicate this to the download.fedoraproject.org redirect to only redirect people to mirrors that have that specific file if they are pointing to an iso/qcow2, etc. Anyhow, the crawler changes sound good to me and thanks again for working on it. kevin signature.asc Description: OpenPGP digital signature ___ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
Re: Planned MirrorManager changes
On Fri, Apr 13, 2018 at 11:14 AM Adrian Reberwrote: > > I would like to change the setup of our mirror crawler and just wanted > to mention my planned changes here before working on them. > > Currently we have two VMs which are crawling our mirrors. Each of the > machine is responsible for one half of the active mirrors. The crawl > starts every 12 hours on the first crawler and 6 hours later on the > second crawler. So every 6 hours one crawler is accessing the database. > > Currently most of the crawling time is not spent crawling but updating > the database about which host has which directory up to date. With a > timeout of 4 hours per host we are hitting that timeout on some hosts > regularly and most of the time the database access is the problem. > > What I would like to change is to crawl each category (Fedora Linux, > Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive) > separately and at different times and intervals. > > We would not hit the timeout as often as now as only the information for > a single category has to be updated. We could scan 'Fedora Archive' only > once per day or every second day. We can scan 'Fedora EPEL' much more > often as it is usually really fast and get better data about the > available mirrors. > > My goal would be to distribute the scanning in such a way to decrease > the load on the database and to decrease the cases of mirror > auto-deactivation due to slow database accesses. > > Let me know if you think that these planned changes are the wrong > direction of if you have other ideas how to improve the mirror crawling. > These look like a good way to deal with the fact that we have a lot of data and files and mirrors nd users get confused about how up to date they are. Would more VM’s help spread this out also? > Adrian > ___ > infrastructure mailing list -- infrastructure@lists.fedoraproject.org > To unsubscribe send an email to > infrastructure-le...@lists.fedoraproject.org > -- Stephen J Smoogen. ___ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org
Planned MirrorManager changes
I would like to change the setup of our mirror crawler and just wanted to mention my planned changes here before working on them. Currently we have two VMs which are crawling our mirrors. Each of the machine is responsible for one half of the active mirrors. The crawl starts every 12 hours on the first crawler and 6 hours later on the second crawler. So every 6 hours one crawler is accessing the database. Currently most of the crawling time is not spent crawling but updating the database about which host has which directory up to date. With a timeout of 4 hours per host we are hitting that timeout on some hosts regularly and most of the time the database access is the problem. What I would like to change is to crawl each category (Fedora Linux, Fedora Other, Fedora EPEL, Fedora Secondary Arches, Fedora Archive) separately and at different times and intervals. We would not hit the timeout as often as now as only the information for a single category has to be updated. We could scan 'Fedora Archive' only once per day or every second day. We can scan 'Fedora EPEL' much more often as it is usually really fast and get better data about the available mirrors. My goal would be to distribute the scanning in such a way to decrease the load on the database and to decrease the cases of mirror auto-deactivation due to slow database accesses. Let me know if you think that these planned changes are the wrong direction of if you have other ideas how to improve the mirror crawling. Adrian signature.asc Description: PGP signature ___ infrastructure mailing list -- infrastructure@lists.fedoraproject.org To unsubscribe send an email to infrastructure-le...@lists.fedoraproject.org